AAAIM is a LLM-powered system for annotating biosimulation models with standardized ontology terms. It supports both chemical and gene entity annotation.
# python = 3.12
# Install dependencies
pip install -r requirements.txt
Set up your LLM provider API keys:
# For OpenAI models (gpt-4o-mini, gpt-4.1-nano)
export OPENAI_API_KEY="your-openai-key"
# For OpenRouter models (meta-llama/llama-3.3-70b-instruct:free)
export OPENROUTER_API_KEY="your-openrouter-key"
AAAIM currently provides two main workflows for both chemical and gene annotation:
- Purpose: Annotate models with no or limited existing annotations
- Input: All species in the model
- Output: Annotation recommendations for all species
- Metrics: Accuracy is NA when no existing annotations available
from core import annotate_model
# Annotate all chemical species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="chemical",
database="chebi"
)
# Save results
recommendations_df.to_csv("chemical_annotation_results.csv", index=False)
from core import annotate_model
# Annotate all gene species in a model
recommendations_df, metrics = annotate_model(
model_file="path/to/model.xml",
entity_type="gene",
database="ncbigene"
)
# Save results
recommendations_df.to_csv("gene_annotation_results.csv", index=False)
- Purpose: Evaluate and improve existing annotations
- Input: Only species that already have annotations
- Output: Validation and improvement recommendations
- Metrics: Accuracy calculated against existing annotations
from core import curate_model
# Curate existing chemical annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="chemical",
database="chebi"
)
print(f"Chemical entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")
from core import curate_model
# Curate existing gene annotations
curations_df, metrics = curate_model(
model_file="path/to/model.xml",
entity_type="gene",
database="ncbigene"
)
print(f"Gene entities with existing annotations: {metrics['total_entities']}")
print(f"Accuracy: {metrics['accuracy']:.1%}")
# More control over parameters
recommendations_df, metrics = annotate_model(
model_file = "path/to/model.xml",
llm_model = "meta-llama/llama-3.3-70b-instruct:free", # the LLM model used to predict annotations
max_entities = 100, # maximum number of entities to annotate (None for all)
entity_type = "gene", # type of entities to annotate ("chemical", "gene")
database = "ncbigene", # database to use ("chebi", "ncbigene")
method = "direct", # method used to find the ontology ID ("direct", "rag")
top_k = 3 # number of top candidates to return per entity (based on scores)
)
# Using "tests/test_models/BIOMD0000000190.xml"
python examples/simple_example.py
After LLM performs synonym normalization, use direct dictionary matching to find ontology ID and report hit counting. Returns the top_k candidates with the highest hit counts.
After LLM performs synonym normalization, use RAG with embeddings to find the top_k most similar ontology terms based on cosine similarity.
To use RAG, create embeddings of the ontology first:
cd data
# for ChEBI:
python load_data.py --database chebi --model default
# for NCBI gene, specify the taxnomy id:
python load_data.py --database ncbigene --model default --tax_id 9606
-
ChEBI: Chemical Entities of Biological Interest
- Entity Type:
chemical
- Direct: Dictionary of standard names to ontology ID. Returns top_k candidates with highest hit counts.
- RAG: Embeddings of ontology terms. Returns top_k most similar terms.
- Entity Type:
-
NCBI Gene: Gene annotation
- Entity Type:
gene
- Direct: Dictionary of gene names to NCBI gene IDs. Returns top_k candidates with highest hit counts.
- RAG: Not yet implemented.
- Entity Type:
- UniProt: Protein annotation
- Rhea: Reaction annotation
- GO: Gene Ontology terms
- Location:
data/chebi/
- Files:
cleannames2chebi.lzma
: Mapping from clean names to ChEBI IDschebi2label.lzma
: Mapping from ChEBI IDs to labelschebi2names.lzma
: ChEBI synonyms used for RAG approach
- Source: ChEBI ontology downloaded from https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz.
- Location:
data/ncbigene/
- Files:
names2ncbigene_bigg_organisms_protein-coding.lzma
: Mapping from names to NCBI gene IDs, only include protein-coding genes from 18 species covered in Bigg models for file size considerationsncbigene2label_bigg_organisms_protein-coding.lzma
: Mapping from NCBI gene IDs to labelsncbigene2names_tax{tax_id}_protein-coding.lzma
: NCBI gene synonyms for tax_id used for RAG approach
- Source: Data are obtained from the NCBI gene FTP site: https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/.
aaaim/
├── core/
│ ├── __init__.py # Main interface exports
│ ├── annotation_workflow.py # Annotation workflow (models without annotations)
│ ├── curation_workflow.py # Curation workflow (models with annotations)
│ ├── model_info.py # Model parsing and context
│ ├── llm_interface.py # LLM interaction
│ └── database_search.py # Database search functions
├── utils/
│ ├── constants.py
│ ├── evaluation.py # functions for evaluation
├── examples/
│ ├── simple_example.py # Simple usage demo
├── data/
│ ├── chebi/ # ChEBI compressed dictionaries
│ ├── ncbigene/ # NCBIgene compressed dictionaries
│ ├── chroma_storage/ # Database embeddings for RAG
└── tests/
├── test_models # Test models
└── aaaim_evaluation.ipynb # evaluation notebook
- Multi-Database Support: UniProt, GO, Rhea
- Improve RAG for NCBI Gene: Test on other embedding models for genes
- Web Interface: User-friendly annotation tool