Investigating the performance of foundation models on human 3’UTR sequences

Sergey Vilov and Matthias Heinig

Codes for data preprocessing and analysis

rbp_motifs : evaluate the models on RBP binding motifs prediction (TASK 1)
variant_effect : evaluate the models on variants from ClinVar, gnomAD, eQTL, and CADD (TASK 2)
mpra : prediction of MPRA activity from (Griesemer et al., 2021) and (Siegel et al., 2022) (TASK 3)
half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) (TASK 4)
dataset_prep : build the multispecies dataset from Zoonomia whole genome alignment
embeddings : generate embeddings for the DNABERT, DNABERT-2, NT as well as embeddings and per-base zero-shot scores for StateSpace models
zero-shot-probs : derive per-base zero-shot scores for DNABERT, NT, PhyloP, and CADD models

The analysis data, scores for all models, and model weights can be found in our Zenodo repository

Links to the scripts used to generate paper figures and tables:

Fig. 1: ROC AUC scores for RBP binding motif predictions

Fig. 2: ROC curves for prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using the best predictor for each model

Fig. 3: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022)

Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome

Fig. S2: Pearson r correlation between per-nucleotide probabilities predicted by each model and the ground truth probability for the Zoonomia dataset (Zoo-AL)

Fig. S3: Difference between ROC AUC scores based on the variant influence score (VIS) and the reference allele probability (pref), as a function of the maximum window W around the variant used to compute VIS

Table 1: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA expression from (Griesemer et al., 2021)

Table S2: ROC AUC scores for RBP binding motif predictions, for all motifs, proxy-functional motifs within the top 10% conservation, proxy-functional motifs within the bottom 10% conservation, as predicted by PhyloP-241way

Table S3: ROC AUC scores for ClinVar, gnomAD, eQTL, and CADD data computed based on zero-shot functionality scores for all models

Table S4: ROC AUC scores from MLP-based prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using language model embeddings

Table S5: ROC AUC scores from prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using alignment-based models

Table S6: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA activity from (Griesemer et al., 2021)

Table S7: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022)

Table S8: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022)

Table S9: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022), using different 3’UTR embeddings

Installation

Create new conda environment:

conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models

Install Pytorch v.2.0.1
Install the other requirements using pip:

pip install -r requirements.txt

To train DNABERT-2 models also install

pip install triton==2.0.0.dev20221202 --force --no-dependencies

Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Codes for data preprocessing and analysis

Links to the scripts used to generate paper figures and tables:

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dataset_prep		dataset_prep
embeddings		embeddings
half_life		half_life
models		models
mpra		mpra
old_and_unused		old_and_unused
rbp_motifs		rbp_motifs
utils		utils
variant_effect		variant_effect
zero-shot-probs		zero-shot-probs
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

sergeyvilov/investigating-foundation-models-3utr

Folders and files

Latest commit

History

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Codes for data preprocessing and analysis

Links to the scripts used to generate paper figures and tables:

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages