PseudoGenius is a ML transformers-based package for the binary classification of gene sequences from Mycobacterium species into 'pseudogenes'.
Clone this repository and navigate into the package directory:
git clone https://github.com/yourusername/PseudoGenius.git
cd PseudoGenius
Install the package:
cd PseudoGenius
pip install .
Provide a list of "DNA\tabAminoAcid" strings. PseudoGenius will classify them as potential pseudogenes. The intent of this tool is to expedite curation of genome annotation.
PseudoGenius provides an easy way to classify gene sequences using a pre-trained model hosted on Hugging Face. To use the model for making predictions:
from pseudogenius.model import load_model, predict
# Load the pre-trained model from Hugging Face
tokenizer, model = load_model()
# List of DNA and protein sequences concatenated with tabs
dna_protein_list = [
"DNA_sequence\tProtein_sequence",
# ... add more sequences
]
# Get predictions
predictions = predict(model, tokenizer, dna_protein_list)
print(predictions)
Model Evaluation The model was trained on a dataset with the following label distribution:
Training set: Normal: 1923, Pseudos: 722 Test set: Normal: 220, Pseudos: 74 The confusion matrix for the model's predictions on the test set is shown below:
If you wish to train your own model with custom data, PseudoGenius also includes the training code. Refer to the training script located at pseudogenius/training.py for details on how to train your model.
The model was trained on the Mycobacterium leprae genbank file (here) and has shown consistent results on other mycobacterium species. It has not been tested on species with a lower GC content like E. coli.
Contributions to PseudoGenius are welcome! Please refer to the contributing guidelines for more information.
This project is licensed under the MIT License - see the LICENSE file for details.