PseudoGenius

PseudoGenius is a ML transformers-based package for the binary classification of gene sequences from Mycobacterium species into 'pseudogenes'.

Installation

Clone this repository and navigate into the package directory:

git clone https://github.com/yourusername/PseudoGenius.git
cd PseudoGenius

Install the package:

cd PseudoGenius
pip install .

Usage

Provide a list of "DNA\tabAminoAcid" strings. PseudoGenius will classify them as potential pseudogenes. The intent of this tool is to expedite curation of genome annotation.

PseudoGenius provides an easy way to classify gene sequences using a pre-trained model hosted on Hugging Face. To use the model for making predictions:

from pseudogenius.model import load_model, predict

# Load the pre-trained model from Hugging Face
tokenizer, model = load_model()

# List of DNA and protein sequences concatenated with tabs
dna_protein_list = [
    "DNA_sequence\tProtein_sequence",
    # ... add more sequences
]

# Get predictions
predictions = predict(model, tokenizer, dna_protein_list)
print(predictions)

Model Evaluation The model was trained on a dataset with the following label distribution:

Training set: Normal: 1923, Pseudos: 722 Test set: Normal: 220, Pseudos: 74 The confusion matrix for the model's predictions on the test set is shown below:

Confusion Matrix

Training Your Own Model

If you wish to train your own model with custom data, PseudoGenius also includes the training code. Refer to the training script located at pseudogenius/training.py for details on how to train your model.

The model was trained on the Mycobacterium leprae genbank file (here) and has shown consistent results on other mycobacterium species. It has not been tested on species with a lower GC content like E. coli.

Contributing

Contributions to PseudoGenius are welcome! Please refer to the contributing guidelines for more information.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
data		data
pseudogenius		pseudogenius
tests		tests
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PseudoGenius

Installation

Usage

Training Your Own Model

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jimnoneill/PseudoGenius

Folders and files

Latest commit

History

Repository files navigation

PseudoGenius

Installation

Usage

Training Your Own Model

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages