-
Notifications
You must be signed in to change notification settings - Fork 206
Getting alignment statistics with vg filter
The various mappers in vg (giraffe
, map
) create GAMs which include metadata about each alignment. In addition to the high-level statistics from vg stats -a
, vg filter
has a --tsv-out
option to write a TSV with information about each read in a (possibly filtered subset of a) GAM.
The general syntax for using --tsv-out
is:
vg filter --tsv-out FIELD mappings.gam > statistics.tsv
# Separate fields with semicolons & wrap in quotation marks
vg filter --tsv-out "FIELD1;FIELD2" mappings.gam > statistics.tsv
The output file is a TSV with a header line of column names. The first column name will have a #
prefix. Non-header lines have the requested fields for a single read in the GAM.
Other vg filter
options are still applied. For example, this command outputs name and score only for mapped reads whose names begin with hifi
:
vg filter --name-prefix hifi --only-mapped \
--tsv-out "name;score" mappings.gam > statistics.tsv
Some statistics are pulled directly from the GAM, though not all GAM fields are available. Others are calculated on the fly from the information in the GAM. Statistics pulled from the GAM aren’t recalculated if missing. For example, unless --add-identity
is used during vg inject
, the resulting GAM won’t have an identity
field. Asking vg filter
to output the missing identity field will cause an error.
-
name
: Read name (pulled from GAM) -
score
: Alignment score (pulled from GAM) - note that several options invg filter
can affect score, such as--rescore
,--frac-score
, and--substitutions
-
correctly_mapped
:True
if a read was correctly mapped,False
otherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads -
correctness
:correct
if a read was correctly mapped,off_reference
if it was set to have no truth,incorrect
otherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads -
softclip_start
: number of base pairs soft-clipped off the beginning of a read (calculated on the fly) -
softclip_end
: number of base pairs soft-clipped off the end of a read (calculated on the fly) - NOT the index of a soft-clip position -
cigar
: the read's CIGAR string;X
is a mismatch and allM
s are true matches -
identity
: identity score of mapping (pulled from GAM) - calculated as (# matches) / (#matches + mismatches + insertions), ignoring soft clips -
is_perfect
:1
if an alignment is “perfect”, consisting of only matches and no mismatches, indels, or soft clips,0
otherwise (calculated on the fly) -
mapping_quality
: MQ score (pulled from GAM) -
sequence
: read sequence (pulled from GAM) -
length
: length of read sequence (pulled from GAM) -
time_used
: time spent on mapping (pulled from GAM) -
annotation
: any annotations (pulled from GAM) -
annotation.X
: value of theX
annotation (pulled from GAM)
Please request additional fields by opening an issue.