A comprehensive approach to enriching and vectorizing the World Health Organization's ICD-11 medical classification system using advanced biomedical language models and embedding techniques.
The 11th Revision of the International Classification of Diseases (ICD-11) serves as the global diagnostic standard, organizing over 40,000 medical entities within a complex hierarchical structure. However, ICD-11 descriptions suffer from significant information gaps - with 7,066 entries having empty descriptions and an average completeness score of only 14% - severely hindering computational applications like semantic search and automated coding.
How can we systematically enrich sparse ICD-11 descriptions and create high-quality vector representations for biomedical applications?
Our Solution: Leverage Llama3-OpenBioLLM-70B to generate comprehensive medical descriptions, then evaluate multiple embedding approaches to create the first open-source vectorization of ICD-11's complex hierarchical structure.
- 13,960 ICD-11 entities comprising 10,678 diseases (76%) and 3,282 classification entities (24%)
- Original limitations: 154-character average description length, 7,066 empty descriptions, 14% completeness score
- Enhanced dataset: 800-character average descriptions, zero empty entries, 78% completeness score
- WHO API Extraction using breadth-first search to crawl the hierarchical structure
- Bayesian Imputation for missing values while preserving medical relationships
- LLM Enhancement via Llama3-OpenBioLLM-70B with structured prompts (causes, symptoms, transmission, diagnosis)
- Multi-axis Validation across linguistic, medical, and hierarchical dimensions
We employ Llama3-OpenBioLLM-70B, chosen for its superior performance over proprietary alternatives like Med-PaLM 2 and GPT-4 on biomedical benchmarks. Our structured prompt ensures comprehensive coverage: Overview, Causes, Symptoms, Transmission, Diagnosis.
Generation Parameters:
- Temperature: 0.2 (deterministic reproducibility)
- Max tokens: 800 (information density)
- Structured medical format with standardized ordering
We systematically compare seven embedding approaches:
- TF-IDF: 3,000 features with SVD reduction
- FastText: Skip-gram model, window size 5
- BERT: General-domain baseline (bert-base-uncased)
- BioBERT: PubMed abstracts fine-tuning
- BioClinicalBERT: MIMIC-III clinical notes specialization
- PubMedBERT: From-scratch biomedical training
- GatorTron: 82-billion token clinical corpus
Our evaluation spans multiple biomedical tasks to assess embedding quality:
- Intrinsic Quality: Silhouette scores, Calinski-Harabasz indices, Davies-Bouldin scores
- Comorbidity Detection: Correlation with disease co-occurrence patterns
- Symptom-Disease Matching: Clinical relevance of semantic relationships
- Hierarchical Consistency: Preservation of ICD-11's taxonomic structure
- Encyclopedia Definition Retrieval: Accuracy in matching external medical definitions
Our LLM-generated descriptions show marked improvements across all validation metrics:
| Metric | Original ICD-11 | Enhanced Descriptions |
|---|---|---|
| Completeness Score | 14% | 78% |
| Average Length | 154 characters | 800 characters |
| Empty Descriptions | 7,066 entries | 0 entries |
| Medical Causality | Baseline | 6x improvement |
| Readability (Flesch-Kincaid) | 8.9 (intermediate) | 12-15.5 (academic) |
| Rank | Model | Score | Interpretation |
|---|---|---|---|
| 1 | TF-IDF | 0.28 | Excels at lexical overlap detection |
| 2 | GatorTron | 0.19 | Strong clinical context understanding |
| 3 | PubMedBERT | 0.17 | Balanced biomedical knowledge |
| Rank | Model | Agreement Rate | Consistency |
|---|---|---|---|
| 1 | PubMedBERT | 35% inter-model | High consensus |
| 2 | BioBERT | 35% inter-model | Medical specialization |
| 3 | GatorTron | 35% inter-model | Clinical notes expertise |
| Rank | Model | 1-Symbol Accuracy | 4-Symbol Accuracy |
|---|---|---|---|
| 1 | PubMedBERT | 92.33% | 75.13% |
| 2 | BioBERT | 89.95% | 66.14% |
| 3 | BERT | 83.07% | 51.85% |
- Contextual models (PubMedBERT, BioBERT, GatorTron) excel at semantic understanding and hierarchical relationships
- Traditional methods (TF-IDF, FastText) outperform in direct lexical matching tasks like comorbidity detection
- Domain-specific training provides crucial advantages - PubMedBERT's from-scratch biomedical training shows superior performance
- Task-dependent performance highlights the need for embedding selection based on specific biomedical applications
โโโ data/ # Raw, processed, and analyzed datasets
โ โโโ 1-extraction/ # Scripts for data extraction
โ โโโ 2-processing/ # Notebooks for data cleaning and generation
โ โโโ 3-analysis/ # Notebooks for data analysis and visualizations
โโโ embeddings/ # Embeddings, analysis scripts, and visualizations
โ โโโ embedding_analysis.py
โ โโโ embeddings visuals/ # Visualizations related to embeddings
โ โโโ resulting ICD-11 csv embeddings/ # Stored ICD-11 embeddings
โโโ evaluation/ # Model evaluation notebooks and results
โ โโโ comorbidity score evaluation/
โ โโโ encyclopedia definition metric evaluation/
โ โโโ non-medical-terms/
โ โโโ symptoms benchmark/
โ โโโ visualizations/ # Visualizations of evaluation results
โโโ misc./ # Miscellaneous scripts and documentation
โโโ models/ # Notebooks for training various embedding models
โโโ report-presentation/ # Project reports and presentations
โโโ requirements.txt # Python dependencies
โโโ LICENSE # Project license
โโโ README.md # Project overview and instructions
โ
Vocabulary Richness: Superior Type-Token Ratio and Lexical Diversity
โ
Academic Readability: Flesch-Kincaid scores 12-15.5 (vs. 8.9 original)
โ
Information Novelty: Low BLEU scores (0.01-0.055) confirm substantial new content
โ
Medical Relevance: 6x increase in causal medical terminology
โ
Structured Composition: Proper allocation of text to medical components
โ
Domain Alignment: Clustering along five established medical semantic axes
โ
Clinical Accuracy: Enhanced descriptions follow medical best practices
โ
Treatment Information: Average 2% appropriate treatment content inclusion
โ
Content Preservation: Mean cosine similarity 0.636 across hierarchy levels
โ
Semantic Drift Management: Expected variation in diverse subcategories
โ
Anatomical Distribution: Proper emphasis on complex body systems (Head/Face, Brain)
Clinical Decision Support
- Enhanced semantic search across medical conditions
- Improved automated coding and documentation
- Better disease similarity detection for differential diagnosis
Research Applications
- Large-scale epidemiological studies with semantic disease grouping
- Drug discovery through disease mechanism understanding
- Clinical trial patient matching based on condition similarity
Healthcare AI Systems
- Foundation for medical chatbots and virtual assistants
- Semantic search engines for medical literature
- Automated medical coding systems for hospitals
Data Integration
- Cross-system medical data harmonization
- Electronic health record semantic enhancement
- Medical knowledge graph construction
Biomedical NLP
- Benchmark dataset for medical embedding evaluation
- Framework for hierarchical medical knowledge representation
- Open-source alternative to proprietary medical language models
- Graph Neural Networks: Integrate GNNs with self-supervised objectives for better hierarchical relationships
- Foundation Model Integration: Explore GPT-4.1, Claude Sonnet 4, and Grok 3 for enhanced embeddings
- Scaling Law Analysis: Investigate optimal embedding dimensions and model capacity
- Multi-language Support: Extend to non-English ICD-11 implementations
- Temporal Dynamics: Capture evolving medical knowledge and terminology
- Cross-domain Transfer: Apply methodology to other medical taxonomies (SNOMED CT, MeSH)
- Real-time Clinical Systems: Optimize for production deployment in hospitals
- Patient-facing Applications: Improve handling of informal medical language
- Regulatory Compliance: Ensure adherence to medical data privacy requirements
This work builds upon and extends several key research areas:
Medical Language Models: Leverages advances in domain-specific biomedical transformers while providing open-source alternatives to proprietary systems like Med-PaLM 2.
Hierarchical Embeddings: Addresses the unique challenges of ICD-11's complex, interconnected structure compared to simpler taxonomies like ICD-10.
Biomedical Evaluation: Introduces comprehensive evaluation frameworks specifically designed for medical embedding assessment across multiple clinical tasks.
Note: ordering is alphabetical on surname; contribution was equal across all team members.
If you use this work in your research, please cite:
@article{icd11vectorization2025,
title={Descriptions are all you need: Semantic Vectorization of Hierarchical Medical Knowledge (ICD-11) via Large Language Models},
author={Lomele, Marco and Legotkin, Gleb and Koldyshev, Ilia and Caretti, Giorgio and Mantovani, Giovanni and Ruzzante, Leonardo},
institution={Bocconi University},
year={2025}
}This project is licensed under the GPL 3.0 License - see the LICENSE file for details.
- WHO ICD-11 Official: https://icd.who.int/
- Llama3-OpenBioLLM: https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B
- Enhanced Dataset: Available upon request for research purposes
- Evaluation Benchmarks: inspired from ICD2Vec https://www.sciencedirect.com/science/article/pii/S1532046423000825