Synonyme Codon Prediction

Project Medical Data Science - SS 2024

Insa Belter, Maximilian Kühn, Felix Mucha, Nils Rekus, Floris Wittner

Installation

Requirements

Python 3.8 or higher (Python 3.12 is recommended). More information on how to install Python can be found here.
pip Python package installer (usually included in Python installations)

Setup

Clone the repository
Create a virtual environment
```
python -m venv .venv
```
Activate the virtual environment
```
source .venv/bin/activate
```
Windows only: to use Cuda for NVIDIA GPU acceleration, install pytorch with the following command first (for more information see Pytorch Installation Guide)
```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
Install all other requirements
```
pip install -r requirements.txt
```

Project Content

Folder Structure

.
├── data
│   └── organism
│       └── cleanedData.pkl
├── ml_models
│   └── organism
│       ├── best_rnn_model.pt
│       ├── best_tcn_model.pt
│       └── best_encoder_model.pt
├── notebooks
│   ├── archive
│   └── *.ipynb
├── scripts
│   ├── archive
│   └── *.py
├── unit_tests
├── README.md
└── requirements.txt

Notebooks

00_Using_Classifiers
- Notebook for applying the best classifiers from each model architecture to new data
01_Data_Aggregation_and_Preparation
- Data aggregation: How was the data gathered?
- Data preparation: How was the data cleaned and split for training and testing purposes
- Secondary Protein Structure as possible additional feature
02_Data_Exploration
- How many sequences do we have per organism?
- How are the amino acids distributed?
- What is the Codon Usage Bias?
03_Baseline_Classifiers
- Classifiers for comparing the trained machine learning classifiers to the possible baseline resulting from the Codon Usage Bias (CUB)
04_Index_based_Analysis_and_Classifier
- Index based analysis of amino acid sequences
- Classifier based on results of this analysis (Index based CUB)
05_Other_Statistical_Analysis_Approaches
- Correlation between neighbouring amino acids
- Amino acid analysis based of chemical properties of codons
06: RNN
- 06_1_RNN_Training
- 06_2_RNN_Testing
07: TCNN
- 07_1_TCN_Training
- 07_2_TCN_Testing
08: Encoder-only Transformer
- 08_1_Encoder_Training
- 08_2_Encoder_Testing
09_Accuracy_Results_Overview
- Training validation accuracies per model (RNN, Encoder, TCNN)
- Accuracy per (best) Model per Organism in comparison to baseline
  - Index-based CUB vs Baseline Max CUB
  - RNN vs Baseline Max CUB
  - TCNN vs Baseline Max CUB
  - Encoder vs Baseline Max CUB
- All accuracies in one diagram
  - Max CUB, Index-based Max CUB, RNN, TCNN, Transformer

Scripts

Data Aggregation
- data loading: Load Fasta files, check for corrupted sequences and save cleaned data
- data splitting: Data splitting for training, testing and validation
- codon usage bias: Calculation of Codon Usage Bias (CUB)
ML helper: Helper functions for training the machine learning models
ML evaluation: Evaluation functions for the machine learning models
Files with classifier implementations for each model architecture
- Classifier Class
- Baseline
- RNN
- TCN
- Encoder & Modified Pytorch Encoder (can output Attention weights)
- Index Classifier
Chemical Property: Chemical property analysis of codons
change Table
[WIP] secondary structure: Exploration of secondary structure as additional feature

Data

Contains the cleaned sequence data for 3 organisms (E.Coli, D.Melanogaster, H.Sapiens). The data is split in training, testing and validation data and then saved as a pickle file. Also all the splits are saved in a shuffled version.

The data folder also contains various files, which track the model training progress and the model performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synonyme Codon Prediction

Project Medical Data Science - SS 2024

Installation

Requirements

Setup

Project Content

Folder Structure

Notebooks

Scripts

Data

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.vscode		.vscode
data		data
ml_models		ml_models
notebooks		notebooks
scripts		scripts
unit_tests		unit_tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

maxikuehn/PMDS_Codons

Folders and files

Latest commit

History

Repository files navigation

Synonyme Codon Prediction

Project Medical Data Science - SS 2024

Installation

Requirements

Setup

Project Content

Folder Structure

Notebooks

Scripts

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages