Skip to content

Getting alignment statistics with vg filter

Faith Okamoto edited this page May 27, 2025 · 2 revisions

The various mappers in vg (giraffe, map) create GAMs which include metadata about each alignment. In addition to the high-level statistics from vg stats -a, vg filter has a --tsv-out option to write a TSV with information about each read in a (possibly filtered subset of a) GAM.

Syntax

The general syntax for using --tsv-out is:

vg filter --tsv-out FIELD mappings.gam > statistics.tsv
# Separate fields with semicolons & wrap in quotation marks
vg filter --tsv-out "FIELD1;FIELD2" mappings.gam > statistics.tsv

The output file is a TSV with a header line of column names. The first column name will have a # prefix. Non-header lines have the requested fields for a single read in the GAM.

Other vg filter options are still applied. For example, this command outputs name and score only for mapped reads whose names begin with hifi:

vg filter --name-prefix hifi --only-mapped \
    --tsv-out "name;score" mappings.gam > statistics.tsv

Available fields

Some statistics are pulled directly from the GAM, though not all GAM fields are available. Others are calculated on the fly from the information in the GAM. Statistics pulled from the GAM aren’t recalculated if missing. For example, unless --add-identity is used during vg inject, the resulting GAM won’t have an identity field. Asking vg filter to output the missing identity field will cause an error.

  • name: Read name (pulled from GAM)
  • score: Alignment score (pulled from GAM) - note that several options in vg filter can affect score, such as --rescore, --frac-score, and --substitutions
  • correctly_mapped: True if a read was correctly mapped, False otherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads
  • correctness: correct if a read was correctly mapped, off_reference if it was set to have no truth, incorrect otherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads
  • softclip_start: number of base pairs soft-clipped off the beginning of a read (calculated on the fly)
  • softclip_end: number of base pairs soft-clipped off the end of a read (calculated on the fly) - NOT the index of a soft-clip position
  • cigar: the read's CIGAR string; X is a mismatch and all Ms are true matches
  • identity: identity score of mapping (pulled from GAM) - calculated as (# matches) / (#matches + mismatches + insertions), ignoring soft clips
  • is_perfect: 1 if an alignment is “perfect”, consisting of only matches and no mismatches, indels, or soft clips, 0 otherwise (calculated on the fly)
  • mapping_quality: MQ score (pulled from GAM)
  • sequence: read sequence (pulled from GAM)
  • length: length of read sequence (pulled from GAM)
  • time_used: time spent on mapping (pulled from GAM)
  • annotation: any annotations (pulled from GAM)
  • annotation.X: value of the X annotation (pulled from GAM)

Please request additional fields by opening an issue.

Clone this wiki locally