Skip to content

A side-project from the AIP, containing the code, logic, and documentation for the ClinVar consequence reassessment process in use at the CPG

License

Notifications You must be signed in to change notification settings

HudsonAlpha/ClinvArbitration

 
 

Repository files navigation

ClinVar, re-summarised

Motivation

During the creation of Talos, a tool for identifying clinically relevant variants in large cohorts, we use ClinVar ratings as a contributing factor in determining pathogenicity. During development of this tool we determined that the default summaries generated in ClinVar were highly conservative; see the table here describing the aggregate classification logic.

Content

This repository contains an alternative algorithm (described here) for re-aggregating the individual ClinVar submissions, generating decisions which favour clear assignment of pathogenic/benign ratings instead of defaulting to 'conflicting'. These ratings are not intended as a replacement of ClinVar's own decisions, but may provide value by showing that that though conflicting submissions exist, there is a clear bias towards either benign or pathogenic ratings.

We aim to re-run this process monthly, and publish the resulting files on Zenodo You can download this pre-generated bundle here: https://zenodo.org/records/16792026

Primary Outputs

  • Hail Table and TSV of all revised decisions
  • Hail Table and TSV of all Pathogenic missense changes, indexed on Transcript and Codon. This is usable as a PM5 annotation resource.

TSVs

  1. clinvar_decisions.tsv: A tab-separated file with headers, containing our re-summarised ClinVar decisions. Columns:

    • contig: the chromosome or contig of the variant
    • position: the position of the variant on the contig
    • reference: the reference allele at the variant position
    • alternate: the alternate allele at the variant position
    • clinical_significance: the clinical significance of the variant, as determined by our algorithm
    • gold_stars: the number of gold stars assigned to the variant, indicating the quality of the evidence supporting the asserted significance
    • allele_id: the unique identifier for the variant in ClinVar, accessible directly via URL like http://www.ncbi.nlm.nih.gov/clinvar?term=XXXXXXX[alleleid], or through ClinVar's web page using an 'advanced search' field
  2. clinvar_decisions.pm5.tsv: A tab-separated file with headers, containing our PM5 missense decisions. All ClinVar entries in this file are Pathogenic Missense changes. Columns:

    • transcript: the transcript ID of the gene in which the missense change occurs
    • codon: the codon position of the missense change in that transcript
    • clinvar_alleles: +-delimited String, each entry being an AlleleID::GoldStars string, where AlleleID is the unique identifier for the ClinVar allele, and GoldStars is the number of stars assigned to that allele. e.g. 12345::3+67890::1, indicating that allele 12345 has 3 stars, and allele 67890 has 1 star, and both affect the same codon in the same transcript.

Usage

Download Results

We aim to generate data monthly, and publish the results on Zenodo. The latest version of the data can be found at:

https://zenodo.org/records/16777475

Local Running

Downloading input files

A NextFlow workflow is provided to run the ClinvArbitration process locally. To use this process you will need reference files:

  • a reference genome, in FASTA format
  • a GFF3 file, containing gene annotations for the reference genome
  • the files containing raw ClinVar submissions and variant details

A directory (data) and a script (download_data.sh) are provided to download and store the required files. Running this script from the data directory will download and unpack all required files. The location these files are downloaded to matches the expected location in the Nextflow config, so you can run the workflow immediately after downloading.

The ClinVar Variant and Submission summary files are updated weekly. You should delete your local copy and re-download each time you run this workflow, to ensure you're capturing the latest data.

Running the workflow

The ClinvArbitration workflow can be run containerised, or locally. By default, the reference data will be read from a directory called data, and the outputs written to a directory nextflow_outputs.

Local execution requires:

  • a Nextflow installation, to operate the workflow
  • a Python environment, with the ClinvArbitration package and its dependencies installed
    • this can be actioned with pip install . from the root of this repository
  • BCFtools, to annotate the ClinVar variants with gene information
nextflow -c nextflow/nextflow.config \
    run nextflow/clinvarbitration.nf

A containerised execution requires:

  • a Nextflow installation, to operate the workflow
  • a Docker installation, to run the workflow in a container

Step 1: build the Docker image:

docker build -t clinvarbitration:local .

Step 2: run the workflow using the Docker image:`

nextflow -c nextflow/nextflow.config \
    run nextflow/clinvarbitration.nf \
    -with-docker clinvarbitration:local

CPG-Flow

Internally at CPG, this workflow is run using CPG-Flow, an in-house Hail Batch based workflow executor. The following elements relate to that workflow:

The intention is that once the Dockerfile within this repository is used, this workflow can be triggered like so:

analysis-runner \
    --skip-repo-checkout \
    --image australia-southeast1-docker.pkg.dev/cpg-common/images-dev/clinvarbitration:PR_24 \
    --config new_clinvarbitration.toml \
    --dataset seqr \
    --description 'resummarise_clinvar' \
    -o resummarise_clinvar \
    --access-level test \
    run_workflow

A config file is required containing a few entries, some relating to this workflow specifically, some relating to cpg-flow setup:

  • workflow.driver_image: populated by analysis-runner, points to this docker image
  • site_blacklist: list of ClinVar submitters to ignore. Useful in removing noise, or blinding to self submissions
  • ref_fasta: required to run bcftools csq. Must match the genome_build
  • genome_build: used to decide whether ClinVar/Annotation is sourced using GRCh37 or GRCh38 (default)

Acknowledgements

  • ClinVar, for providing the data which this process is based on

About

A side-project from the AIP, containing the code, logic, and documentation for the ClinVar consequence reassessment process in use at the CPG

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.9%
  • Nextflow 7.8%
  • Shell 5.4%
  • Dockerfile 2.9%