During the creation of Talos, a tool for identifying clinically relevant variants in large cohorts, we use ClinVar ratings as a contributing factor in determining pathogenicity. During development of this tool we determined that the default summaries generated in ClinVar were highly conservative; see the table here describing the aggregate classification logic.
This repository contains an alternative algorithm (described here) for re-aggregating the individual ClinVar submissions, generating decisions which favour clear assignment of pathogenic/benign ratings instead of defaulting to 'conflicting'. These ratings are not intended as a replacement of ClinVar's own decisions, but may provide value by showing that that though conflicting submissions exist, there is a clear bias towards either benign or pathogenic ratings.
We aim to re-run this process monthly, and publish the resulting files on Zenodo You can download this pre-generated bundle here: https://zenodo.org/records/16792026
- Hail Table and TSV of all revised decisions
- Hail Table and TSV of all Pathogenic missense changes, indexed on Transcript and Codon. This is usable as a PM5 annotation resource.
-
clinvar_decisions.tsv
: A tab-separated file with headers, containing our re-summarised ClinVar decisions. Columns:contig
: the chromosome or contig of the variantposition
: the position of the variant on the contigreference
: the reference allele at the variant positionalternate
: the alternate allele at the variant positionclinical_significance
: the clinical significance of the variant, as determined by our algorithmgold_stars
: the number of gold stars assigned to the variant, indicating the quality of the evidence supporting the asserted significanceallele_id
: the unique identifier for the variant in ClinVar, accessible directly via URL likehttp://www.ncbi.nlm.nih.gov/clinvar?term=XXXXXXX[alleleid]
, or through ClinVar's web page using an 'advanced search' field
-
clinvar_decisions.pm5.tsv
: A tab-separated file with headers, containing our PM5 missense decisions. All ClinVar entries in this file are Pathogenic Missense changes. Columns:transcript
: the transcript ID of the gene in which the missense change occurscodon
: the codon position of the missense change in that transcriptclinvar_alleles
:+
-delimited String, each entry being anAlleleID::GoldStars
string, whereAlleleID
is the unique identifier for the ClinVar allele, andGoldStars
is the number of stars assigned to that allele. e.g.12345::3+67890::1
, indicating that allele12345
has 3 stars, and allele67890
has 1 star, and both affect the same codon in the same transcript.
We aim to generate data monthly, and publish the results on Zenodo. The latest version of the data can be found at:
A NextFlow workflow is provided to run the ClinvArbitration process locally. To use this process you will need reference files:
- a reference genome, in FASTA format
- a GFF3 file, containing gene annotations for the reference genome
- the files containing raw ClinVar submissions and variant details
A directory (data) and a script (download_data.sh) are provided to download and store the required files. Running this script from the data
directory will download and unpack all required files. The location these files are downloaded to matches the expected location in the Nextflow config, so you can run the workflow immediately after downloading.
The ClinVar Variant and Submission summary files are updated weekly. You should delete your local copy and re-download each time you run this workflow, to ensure you're capturing the latest data.
The ClinvArbitration workflow can be run containerised, or locally. By default, the reference data will be read from a directory called data
, and the outputs written to a directory nextflow_outputs
.
Local execution requires:
- a Nextflow installation, to operate the workflow
- a Python environment, with the ClinvArbitration package and its dependencies installed
- this can be actioned with
pip install .
from the root of this repository
- this can be actioned with
- BCFtools, to annotate the ClinVar variants with gene information
nextflow -c nextflow/nextflow.config \
run nextflow/clinvarbitration.nf
A containerised execution requires:
- a Nextflow installation, to operate the workflow
- a Docker installation, to run the workflow in a container
Step 1: build the Docker image:
docker build -t clinvarbitration:local .
Step 2: run the workflow using the Docker image:`
nextflow -c nextflow/nextflow.config \
run nextflow/clinvarbitration.nf \
-with-docker clinvarbitration:local
Internally at CPG, this workflow is run using CPG-Flow, an in-house Hail Batch based workflow executor. The following elements relate to that workflow:
- an example config file, with enough entries populated that a standard CPG user could dry-run the workflow locally
- a workflow runner script
- a definition of all workflow stages
The intention is that once the Dockerfile within this repository is used, this workflow can be triggered like so:
analysis-runner \
--skip-repo-checkout \
--image australia-southeast1-docker.pkg.dev/cpg-common/images-dev/clinvarbitration:PR_24 \
--config new_clinvarbitration.toml \
--dataset seqr \
--description 'resummarise_clinvar' \
-o resummarise_clinvar \
--access-level test \
run_workflow
A config file is required containing a few entries, some relating to this workflow specifically, some relating to cpg-flow setup:
workflow.driver_image
: populated by analysis-runner, points to this docker imagesite_blacklist
: list of ClinVar submitters to ignore. Useful in removing noise, or blinding to self submissionsref_fasta
: required to run bcftools csq. Must match thegenome_build
genome_build
: used to decide whether ClinVar/Annotation is sourced using GRCh37 or GRCh38 (default)
- ClinVar, for providing the data which this process is based on