-
Notifications
You must be signed in to change notification settings - Fork 206
Snarls and chains
Variation graphs can become very large and complicated and it is often useful to break down the graph and describe the commonly occurring nested substructures. Variations in the graph often have a common topological motif, with the variant described by one or more nodes that are flanked by two nodes representing conserved sequence. These topological motifs are called snarls. Also see our explainer videos
In the following example, the variation graph represents a single nucleotide variant (SNP). The two alleles of the SNP are represented by nodes 2 and 3, and the conserved sequence flanking the SNP are represented by nodes 1 and 4.
Similarly, for an insertion/deletion (indel), the inserted sequence is represented by node 2, while the deletion is represented by the edge between nodes 1 and 3.
A duplication can be represented with a node that is connected to itself by an edge from the start of the node to the end of the node. A path can traverse the node multiple times by taking this looping edge.
An inversion is represented by a node that can be traversed in either direction in a left-to-right (or right-to-left) traversal of the graph.
Each of these graphs share the common motif where the variant is represented by a subgraph representing the variable sequence, flanked by two nodes representing conserved sequence. We refer to this topological motif as a "snarl". The term "bubble" is also common in the literature. A snarl is a generalization of a bubble. Generally, we use the term snarl to refer to the subgraph not including the two flanking nodes
Formally, a snarl is defined by a pair of node sides,
- separable: splitting the boundary nodes into their two node sides disconnects the snarl from the rest of the graph,
and
- minimal: there is no node z within the snarl that is separable with the boundary nodes.
In this example, the subgraph between nodes 4 and 6 is a snarl, highlighted in green.
By splitting the boundary nodes into their two node sides, the subgraph containing the snarl becomes unreachable with the rest of the graph.
This snarl is delimited by the right side of node 4 and the left side of node 6.
For specificity, the snarl may be written as "Snarl
Snarls often occur successively with a shared boundary node between them. A sequence of consecutive nodes and snarls is called a "chain".
In this example, the graph contains two snarls, snarl 1-4 and snarl 4-6. The two snarls are part of the same chain, chain 1-6. This chain is comprised of the two snarls and nodes 1, 4, and 6.
Snarls and chains may be nested inside of each other. Conceptually, this occurs when multiple variants affect the same locus, such as a SNP that is nested within an insertion. In a graph, a snarl contains another snarl if all of the nodes in the latter snarl are contained in the subgraph of the former. A snarl contains a chain if it contains all of the chains component nodes and snarls.
The nesting relationship of snarls and chains in a graph is described by its "snarl decomposition" or "snarl tree". Each node, snarl, and chain in the graph is represented as a node in the snarl tree. A chain is a child of a snarl if it is contained in the snarl and no other descendant of the snarl. A node or snarl is a child of a chain if it is a component of the chain. Every node and snarl is a child of a chain; a chain containing just one node is called a “trivial chain”. Because of this, the snarl tree is composed of alternating layers of snarls and chains, usually starting with a top-level chain as the root.
In this example, chain 1-10 is the top-level chain containing the entire variation graph. Chain 1-10 is comprised of 6 children: nodes 1, 4, 9, and 10, and snarls 1-4 and 4-9. Snarl 1-4 contains two children, chains 2 and 3, each of which are trivial chains containing only one node. Between nodes 9 and 10 is a "trivial snarl" containing only an edge between the boundary nodes. These are generally omitted from the snarl tree as they are implied by two consecutive nodes.
Nodes, snarls, and chains all have the property that they are connected to the rest of the graph by two node sides. Based on this property, snarls and chains can be treated as nodes within their parents, obscuring the presence of a subgraph between the boundaries of the child. A “netgraph” is a view of a snarl where its child chains are replaced by nodes.
This example shows the netgraph-view of snarl 1-12. In this view, chains 2-7 and 8-11 are replaced by a single node that represents the subgraph of the chain. Because these chains are also reachable by two node sides, the chains can be replaced by a single node without changing the topology of the snarl.
In general, variation graphs build from human data tend to have very long top-level chains with few, simple snarls and shallow nesting. However, there can be larger more complicated snarl structures that need to be taken into account when building data structures and algorithms based on the snarl decomposition.
- Snarls/chains that are not start-end connected It is possible for a valid snarl or chain to not have a valid path connecting its boundary nodes. This means that there is no valid path within its parent chain from the start of the chain to the end of the chain.
- Looping chains It is possible for the top-level chain to have the same node as its boundary nodes. This is possible for cyclic chromosomes.
In this example, the chain 1-1 can loop by taking the edge from 11 to 1.
- Looping chains with non start-end connected snarls This is usually an artifact of the way snarls and chains are found in the graph.
Intuitively in this example, there is a top level chain between nodes 1 and 7.
However, the snarl-finding algorithm will try to root the snarl tree on the longest sequence of nodes in the chain, which can actually be found within the snarl.
In this example, the snarl
This can also happen when the boundary nodes are chosen to be the two alleles of a SNP.
- Changing direction within a chain in a snarl If there edges that allow a path to turn around within a chain, they won't be taken into account in the connectivity of the netgraph of its parent snarl. This isn't really an edge case but it can be confusing and is easy to forget about.
In this example, there is a path from the right side of node 1 back to the right side of node 1, but it can't be seen in the netgraph view.
In vg, we have two different data structures for representing snarls: the Snarl Manager and the Distance Index
In the Distance Index, the root-level structure of the snarl tree is a "snarl", which contains the entire graph. Each top-level chain is a child of this root snarl. There may be connectivity within the root snarl if a connected component of the graph doesn't decompose into a single chain. Each child of the root snarl is therefore not necessarily a separate connected component.
The Distance Index's definition of snarl does not include the boundary nodes of the snarl. The bound of a snarl is a "sentinel" that represents the inner node side of the boundary node, but not technically the node itself.