Insa Belter, Maximilian Kühn, Felix Mucha, Nils Rekus, Floris Wittner
- Python 3.8 or higher (Python 3.12 is recommended). More information on how to install Python can be found here.
- pip Python package installer (usually included in Python installations)
- Clone the repository
- Create a virtual environment
python -m venv .venv
- Activate the virtual environment
source .venv/bin/activate
- Windows only: to use Cuda for NVIDIA GPU acceleration, install pytorch with the following command first (for more information see Pytorch Installation Guide)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
- Install all other requirements
pip install -r requirements.txt
.
├── data
│ └── organism
│ └── cleanedData.pkl
├── ml_models
│ └── organism
│ ├── best_rnn_model.pt
│ ├── best_tcn_model.pt
│ └── best_encoder_model.pt
├── notebooks
│ ├── archive
│ └── *.ipynb
├── scripts
│ ├── archive
│ └── *.py
├── unit_tests
├── README.md
└── requirements.txt
- 00_Using_Classifiers
- Notebook for applying the best classifiers from each model architecture to new data
- 01_Data_Aggregation_and_Preparation
- Data aggregation: How was the data gathered?
- Data preparation: How was the data cleaned and split for training and testing purposes
- Secondary Protein Structure as possible additional feature
- 02_Data_Exploration
- How many sequences do we have per organism?
- How are the amino acids distributed?
- What is the Codon Usage Bias?
- 03_Baseline_Classifiers
- Classifiers for comparing the trained machine learning classifiers to the possible baseline resulting from the Codon Usage Bias (CUB)
- 04_Index_based_Analysis_and_Classifier
- Index based analysis of amino acid sequences
- Classifier based on results of this analysis (Index based CUB)
- 05_Other_Statistical_Analysis_Approaches
- Correlation between neighbouring amino acids
- Amino acid analysis based of chemical properties of codons
- 06: RNN
- 07: TCNN
- 08: Encoder-only Transformer
- 09_Accuracy_Results_Overview
- Training validation accuracies per model (RNN, Encoder, TCNN)
- Accuracy per (best) Model per Organism in comparison to baseline
- Index-based CUB vs Baseline Max CUB
- RNN vs Baseline Max CUB
- TCNN vs Baseline Max CUB
- Encoder vs Baseline Max CUB
- All accuracies in one diagram
- Max CUB, Index-based Max CUB, RNN, TCNN, Transformer
- Data Aggregation
- data loading: Load Fasta files, check for corrupted sequences and save cleaned data
- data splitting: Data splitting for training, testing and validation
- codon usage bias: Calculation of Codon Usage Bias (CUB)
- ML helper: Helper functions for training the machine learning models
- ML evaluation: Evaluation functions for the machine learning models
- Files with classifier implementations for each model architecture
- Classifier Class
- Baseline
- RNN
- TCN
- Encoder & Modified Pytorch Encoder (can output Attention weights)
- Index Classifier
- Chemical Property: Chemical property analysis of codons
- change Table
- [WIP] secondary structure: Exploration of secondary structure as additional feature
Contains the cleaned sequence data for 3 organisms (E.Coli, D.Melanogaster, H.Sapiens). The data is split in training, testing and validation data and then saved as a pickle file. Also all the splits are saved in a shuffled version.
The data folder also contains various files, which track the model training progress and the model performance.