Skip to content

Pseudo Genius is a ML transformer-based package for the classification of gene sequences from Mycobacterium species into 'pseudogenes'

License

Notifications You must be signed in to change notification settings

jimnoneill/PseudoGenius

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PseudoGenius

PseudoGenius is a ML transformers-based package for the binary classification of gene sequences from Mycobacterium species into 'pseudogenes'.

Installation

Clone this repository and navigate into the package directory:

git clone https://github.com/yourusername/PseudoGenius.git
cd PseudoGenius

Install the package:

cd PseudoGenius
pip install .

Usage

Provide a list of "DNA\tabAminoAcid" strings. PseudoGenius will classify them as potential pseudogenes. The intent of this tool is to expedite curation of genome annotation.

PseudoGenius provides an easy way to classify gene sequences using a pre-trained model hosted on Hugging Face. To use the model for making predictions:

from pseudogenius.model import load_model, predict

# Load the pre-trained model from Hugging Face
tokenizer, model = load_model()

# List of DNA and protein sequences concatenated with tabs
dna_protein_list = [
    "DNA_sequence\tProtein_sequence",
    # ... add more sequences
]

# Get predictions
predictions = predict(model, tokenizer, dna_protein_list)
print(predictions)

Model Evaluation The model was trained on a dataset with the following label distribution:

Training set: Normal: 1923, Pseudos: 722 Test set: Normal: 220, Pseudos: 74 The confusion matrix for the model's predictions on the test set is shown below:

Confusion Matrix pseudo_genius_confusion_matrix

Training Your Own Model

If you wish to train your own model with custom data, PseudoGenius also includes the training code. Refer to the training script located at pseudogenius/training.py for details on how to train your model.

The model was trained on the Mycobacterium leprae genbank file (here) and has shown consistent results on other mycobacterium species. It has not been tested on species with a lower GC content like E. coli.

Contributing

Contributions to PseudoGenius are welcome! Please refer to the contributing guidelines for more information.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Pseudo Genius is a ML transformer-based package for the classification of gene sequences from Mycobacterium species into 'pseudogenes'

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages