Skip to content

Snarls and chains

Xian Chang edited this page May 1, 2025 · 3 revisions

Variation graphs can become very large and complicated and it is often useful to break down the graph and describe the commonly occurring nested substructures. Variations in the graph often have a common topological motif, with the variant described by one or more nodes that are flanked by two nodes representing conserved sequence. These topological motifs are called snarls. Also see our explainer videos

Representing variants in a variation graph

In the following example, the variation graph represents a single nucleotide variant (SNP). The two alleles of the SNP are represented by nodes 2 and 3, and the conserved sequence flanking the SNP are represented by nodes 1 and 4.

snp

Similarly, for an insertion/deletion (indel), the inserted sequence is represented by node 2, while the deletion is represented by the edge between nodes 1 and 3.

indel

A duplication can be represented with a node that is connected to itself by an edge from the start of the node to the end of the node. A path can traverse the node multiple times by taking this looping edge.

duplication

An inversion is represented by a node that can be traversed in either direction in a left-to-right (or right-to-left) traversal of the graph.

inversion

Each of these graphs share the common motif where the variant is represented by a subgraph representing the variable sequence, flanked by two nodes representing conserved sequence. We refer to this topological motif as a "snarl". The term "bubble" is also common in the literature. A snarl is a generalization of a bubble. Generally, we use the term snarl to refer to the subgraph not including the two flanking nodes

The snarl decomposition

Snarls

Formally, a snarl is defined by a pair of node sides, ${x_{{l,r}}, y_{{l,r}}}$, that delimit a subgraph between them. The nodes x and y are referred to as the boundary nodes of the snarl. Two node sides define a snarl if they are:

  1. separable: splitting the boundary nodes into their two node sides disconnects the snarl from the rest of the graph,

and

  1. minimal: there is no node z within the snarl that is separable with the boundary nodes.

snarl_example

In this example, the subgraph between nodes 4 and 6 is a snarl, highlighted in green. By splitting the boundary nodes into their two node sides, the subgraph containing the snarl becomes unreachable with the rest of the graph. This snarl is delimited by the right side of node 4 and the left side of node 6. For specificity, the snarl may be written as "Snarl $4_r-6_l$" to refer to the node sides, or "Snarl 4fd-6fd" to refer to the traversal of the nodes. However, as there is only one snarl with nodes 4 and 6 as boundary nodes, we will generally refer to it as "Snarl 4-6".

Chains

Snarls often occur successively with a shared boundary node between them. A sequence of consecutive nodes and snarls is called a "chain".

chain_example

In this example, the graph contains two snarls, snarl 1-4 and snarl 4-6. The two snarls are part of the same chain, chain 1-6. This chain is comprised of the two snarls and nodes 1, 4, and 6.

Nesting

Snarls and chains may be nested inside of each other. Conceptually, this occurs when multiple variants affect the same locus, such as a SNP that is nested within an insertion. In a graph, a snarl contains another snarl if all of the nodes in the latter snarl are contained in the subgraph of the former. A snarl contains a chain if it contains all of the chains component nodes and snarls.

nested_snarl_example

The snarl tree

The nesting relationship of snarls and chains in a graph is described by its "snarl decomposition" or "snarl tree". Each node, snarl, and chain in the graph is represented as a node in the snarl tree. A chain is a child of a snarl if it is contained in the snarl and no other descendant of the snarl. A node or snarl is a child of a chain if it is a component of the chain. Every node and snarl is a child of a chain; a chain containing just one node is called a “trivial chain”. Because of this, the snarl tree is composed of alternating layers of snarls and chains, usually starting with a top-level chain as the root.

snarl_tree

In this example, chain 1-10 is the top-level chain containing the entire variation graph. Chain 1-10 is comprised of 6 children: nodes 1, 4, 9, and 10, and snarls 1-4 and 4-9. Snarl 1-4 contains two children, chains 2 and 3, each of which are trivial chains containing only one node. Between nodes 9 and 10 is a "trivial snarl" containing only an edge between the boundary nodes. These are generally omitted from the snarl tree as they are implied by two consecutive nodes.

Netgraphs

Nodes, snarls, and chains all have the property that they are connected to the rest of the graph by two node sides. Based on this property, snarls and chains can be treated as nodes within their parents, obscuring the presence of a subgraph between the boundaries of the child. A “netgraph” is a view of a snarl where its child chains are replaced by nodes.

netgraph

This example shows the netgraph-view of snarl 1-12. In this view, chains 2-7 and 8-11 are replaced by a single node that represents the subgraph of the chain. Because these chains are also reachable by two node sides, the chains can be replaced by a single node without changing the topology of the snarl.

Complex cases

In general, variation graphs build from human data tend to have very long top-level chains with few, simple snarls and shallow nesting. However, there can be larger more complicated snarl structures that need to be taken into account when building data structures and algorithms based on the snarl decomposition.

  • Snarls/chains that are not start-end connected It is possible for a valid snarl or chain to not have a valid path connecting its boundary nodes. This means that there is no valid path within its parent chain from the start of the chain to the end of the chain.

not_start_end_connected

  • Looping chains It is possible for the top-level chain to have the same node as its boundary nodes. This is possible for cyclic chromosomes.

looping_chain

In this example, the chain 1-1 can loop by taking the edge from 11 to 1.

  • Looping chains with non start-end connected snarls This is usually an artifact of the way snarls and chains are found in the graph.

bad_looping_chain

Intuitively in this example, there is a top level chain between nodes 1 and 7. However, the snarl-finding algorithm will try to root the snarl tree on the longest sequence of nodes in the chain, which can actually be found within the snarl. In this example, the snarl $6_r-3_l$ creates a valid snarl, even though its two boundary nodes are not reachable in a valid path through the snarl.

snarl_snp

This can also happen when the boundary nodes are chosen to be the two alleles of a SNP.

  • Changing direction within a chain in a snarl If there edges that allow a path to turn around within a chain, they won't be taken into account in the connectivity of the netgraph of its parent snarl. This isn't really an edge case but it can be confusing and is easy to forget about.

nested_looping

In this example, there is a path from the right side of node 1 back to the right side of node 1, but it can't be seen in the netgraph view.

Implementation details

In vg, we have two different data structures for representing snarls: the Snarl Manager and the Distance Index

In the Distance Index, the root-level structure of the snarl tree is a "snarl", which contains the entire graph. Each top-level chain is a child of this root snarl. There may be connectivity within the root snarl if a connected component of the graph doesn't decompose into a single chain. Each child of the root snarl is therefore not necessarily a separate connected component.

The Distance Index's definition of snarl does not include the boundary nodes of the snarl. The bound of a snarl is a "sentinel" that represents the inner node side of the boundary node, but not technically the node itself.

Clone this wiki locally