Sergey Vilov and Matthias Heinig
- rbp_motifs : evaluate the models on RBP binding motifs prediction (TASK 1)
- variant_effect : evaluate the models on variants from ClinVar, gnomAD, eQTL, and CADD (TASK 2)
- mpra : prediction of MPRA activity from (Griesemer et al., 2021) and (Siegel et al., 2022) (TASK 3)
- half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) (TASK 4)
- dataset_prep : build the multispecies dataset from Zoonomia whole genome alignment
- embeddings : generate embeddings for the DNABERT, DNABERT-2, NT as well as embeddings and per-base zero-shot scores for StateSpace models
- zero-shot-probs : derive per-base zero-shot scores for DNABERT, NT, PhyloP, and CADD models
The analysis data, scores for all models, and model weights can be found in our Zenodo repository
Fig. 1: ROC AUC scores for RBP binding motif predictions
Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome
- Create new conda environment:
conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models
-
Install Pytorch v.2.0.1
-
Install the other requirements using pip:
pip install -r requirements.txt
- To train DNABERT-2 models also install
pip install triton==2.0.0.dev20221202 --force --no-dependencies
Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.