Skip to content

sergeyvilov/investigating-foundation-models-3utr

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Sergey Vilov and Matthias Heinig

bioRxiv preprint

Codes for data preprocessing and analysis

  • rbp_motifs : evaluate the models on RBP binding motifs prediction (TASK 1)
  • variant_effect : evaluate the models on variants from ClinVar, gnomAD, eQTL, and CADD (TASK 2)
  • mpra : prediction of MPRA activity from (Griesemer et al., 2021) and (Siegel et al., 2022) (TASK 3)
  • half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) (TASK 4)
  • dataset_prep : build the multispecies dataset from Zoonomia whole genome alignment
  • embeddings : generate embeddings for the DNABERT, DNABERT-2, NT as well as embeddings and per-base zero-shot scores for StateSpace models
  • zero-shot-probs : derive per-base zero-shot scores for DNABERT, NT, PhyloP, and CADD models

The analysis data, scores for all models, and model weights can be found in our Zenodo repository

Links to the scripts used to generate paper figures and tables:

Fig. 1: ROC AUC scores for RBP binding motif predictions

Fig. 2: ROC curves for prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using the best predictor for each model

Fig. 3: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022)

Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome

Fig. S2: Pearson r correlation between per-nucleotide probabilities predicted by each model and the ground truth probability for the Zoonomia dataset (Zoo-AL)

Fig. S3: Difference between ROC AUC scores based on the variant influence score (VIS) and the reference allele probability (pref), as a function of the maximum window W around the variant used to compute VIS

Table 1: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA expression from (Griesemer et al., 2021)

Table S2: ROC AUC scores for RBP binding motif predictions, for all motifs, proxy-functional motifs within the top 10% conservation, proxy-functional motifs within the bottom 10% conservation, as predicted by PhyloP-241way

Table S3: ROC AUC scores for ClinVar, gnomAD, eQTL, and CADD data computed based on zero-shot functionality scores for all models

Table S4: ROC AUC scores from MLP-based prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using language model embeddings

Table S5: ROC AUC scores from prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using alignment-based models

Table S6: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA activity from (Griesemer et al., 2021)

Table S7: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022)

Table S8: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022)

Table S9: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022), using different 3’UTR embeddings

Installation

  1. Create new conda environment:
conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models
  1. Install Pytorch v.2.0.1

  2. Install the other requirements using pip:

pip install -r requirements.txt
  1. To train DNABERT-2 models also install
pip install triton==2.0.0.dev20221202 --force --no-dependencies

Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.

About

data processing codes for "Investigating the performance of foundation models on human 3’UTR sequences"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published