AAAIM is a LLM-powered system for annotating biosimulation models with standardized ontology terms. It supports both chemical and gene entity annotation.
# python = 3.12
# Install dependencies
pip install -r requirements.txt
Set up your LLM provider API keys:
# For OpenAI models (gpt-4o-mini, gpt-4.1-nano)
export OPENAI_API_KEY="your-openai-key"
# For OpenRouter models (meta-llama/llama-3.3-70b-instruct:free)
export OPENROUTER_API_KEY="your-openrouter-key"
Alternatively, you can setup an .env
file that looks like the following:
LLAMA_API_KEY=<your-llama-api-key-here>
OPENROUTER_API_KEY=<your-openrouter-api-key-here>
AAAIM currently provides two main workflows for both chemical and gene annotation:
- Purpose: Annotate models with no or limited existing annotations
- Input: All species in the model
- Output: Annotation recommendations for all species
- Metrics: Accuracy is NA when no existing annotations available
from core import annotate_model
# Annotate all chemical species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="chemical",
database="chebi"
)
# Save results
recommendations_df.to_csv("chemical_annotation_results.csv", index=False)
from core import annotate_model
# Annotate all gene species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="gene",
database="ncbigene",
tax_id="9606" # for human
)
# Save results
recommendations_df.to_csv("gene_annotation_results.csv", index=False)
from core import annotate_model
# Annotate all gene species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="protein",
database="uniprot",
tax_id="9606" # for human
)
# Save results
recommendations_df.to_csv("protein_annotation_results.csv", index=False)
- Purpose: Evaluate and improve existing annotations
- Input: Only species that already have annotations
- Output: Validation and improvement recommendations
- Metrics: Accuracy calculated against existing annotations
from core import curate_model
# Curate existing chemical annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="chemical",
database="chebi"
)
print(f"Chemical entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")
from core import curate_model
# Curate existing gene annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="gene",
database="ncbigene"
)
print(f"Gene entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")
from core import curate_model
# Curate existing gene annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="protein",
database="uniprot",
tax_id=9606 # for human
)
print(f"Gene entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")
After running annotate_model
or curate_model
, you can review the resulting CSV file and edit the update_annotation
column for each entity:
add
: Add the recommended annotation to the model for that entity.delete
: Remove the annotation for that entity.ignore
orkeep
: Leave the annotation unchanged. Whether keep the existing one, or ignore the new suggestion.
To apply your changes and save a new SBML model:
from core.update_model import update_annotation
update_annotation(
original_model_path="path/to/original_model.xml",
recommendation_table="recommendations.csv", # or a pandas DataFrame
new_model_path="path/to/updated_model.xml",
qualifier="is" # (optional) bqbiol qualifier, default is 'is'
)
A summary of added/removed annotations will be printed after update.
# More control over parameters
recommendations_df, metrics = annotate_model(
model_file = "path/to/model.xml",
llm_model = "meta-llama/llama-3.3-70b-instruct:free", # the LLM model used to predict annotations
max_entities = 100, # maximum number of entities to annotate (None for all)
entity_type = "gene", # type of entities to annotate ("chemical", "gene")
database = "ncbigene", # database to use ("chebi", "ncbigene")
method = "direct", # method used to find the ontology ID ("direct", "rag")
top_k = 3 # number of top candidates to return per entity (based on scores)
)
# Using "tests/test_models/BIOMD0000000190.xml"
python examples/simple_example.py
After LLM performs synonym normalization, use direct dictionary matching to find ontology ID and report hit counting. Returns the top_k candidates with the highest hit counts.
After LLM performs synonym normalization, use RAG with embeddings to find the top_k most similar ontology terms based on cosine similarity.
To use RAG, create embeddings of the ontology first:
cd data
# for ChEBI:
python load_data.py --database chebi --model default
# for NCBI gene, specify the taxnomy id:
python load_data.py --database ncbigene --model default --tax_id 9606
# for uniprot, specify the taxnomy id:
python load_data.py --database uniprot --model default --tax_id 9606
# for KEGG:
python load_data.py --database kegg --model default
-
ChEBI: Chemical Entities of Biological Interest
- Entity Type:
chemical
- All terms in ChEBI are included.
- Entity Type:
-
NCBI Gene: Gene annotation
- Entity Type:
gene
- Only genes for common species are supported (those included in bigg models).
- Entity Type:
-
UniProt: protein annotation
- Entity Type:
uniprot
- Only proteins for human (9606) and mouse (10090) are supported for now.
- Entity Type:
-
KEGGe: Enzyme annotation
- For reaction substrates and products.
- Rhea: Reaction annotation
- GO: Gene Ontology terms
- Location:
data/chebi/
- Files:
cleannames2chebi.lzma
: Mapping from clean names to ChEBI IDschebi2label.lzma
: Mapping from ChEBI IDs to labelschebi2names.lzma
: ChEBI synonyms used for RAG approach
- Source: ChEBI ontology downloaded from https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz.
- Location:
data/ncbigene/
- Files:
names2ncbigene_bigg_organisms_protein-coding.lzma
: Mapping from names to NCBI gene IDs, only include protein-coding genes from 18 species covered in Bigg models for file size considerationsncbigene2label_bigg_organisms_protein-coding.lzma
: Mapping from NCBI gene IDs to labels (primary name)ncbigene2names_tax{tax_id}_protein-coding.lzma
: NCBI gene synonyms for tax_id used for RAG approach
- Source: Data are obtained from the NCBI gene FTP site: https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/.
- Location:
data/uniprot/
- Files:
names2uniprot_human+mouse.lzma
: Mapping from synonyms to UniProt IDs, only include human and mouse proteins for nowuniprot2label_human+mouse.lzma
: Mapping from UniProt IDs to labels (primary name)uniprot2names_tax{tax_id}.lzma
: Uniprot synonyms for tax_id used for RAG approach
- Source: Data are obtained from the UniProt site: https://www.uniprot.org/help/downloads (Reviewed (Swiss-Prot) xml).
- Location:
data/kegg/
- Files:
chebi_to_kegg_map.lzma
: Mapping from ChEBI IDs to KEGG compound IDs.parsed_kegg_reactions.lzma
: Dict of KEGG reactions and their attributes
- Source: Data are obtained from the KEGG site: https://rest.kegg.jp.
aaaim/
├── core/
│ ├── __init__.py # Main interface exports
│ ├── annotation_workflow.py # Annotation workflow (models without annotations)
│ ├── curation_workflow.py # Curation workflow (models with annotations)
│ ├── model_info.py # Model parsing and context
│ ├── llm_interface.py # LLM interaction
│ ├── database_search.py # Database search functions
│ └── update_model.py # put annotations into model
├── utils/
│ ├── constants.py
│ ├── evaluation.py # functions for evaluation
├── examples/
│ ├── simple_example.py # Simple usage demo
├── data/
│ ├── chebi/ # ChEBI compressed dictionaries
│ ├── ncbigene/ # NCBIgene compressed dictionaries
│ ├── uniprot/ # UniProt compressed dictionaries
│ ├── chroma_storage/ # Database embeddings for RAG
└── tests/
├── test_models # Test models
└── aaaim_evaluation.ipynb # evaluation notebook
- Multi-Database Support: GO, Rhea, mapping between ontologies
- Improve RAG for NCBI Gene: Test on other embedding models for genes
- Web Interface: User-friendly annotation tool