Validate that supporting text quotes in your data actually appear in their cited references.
This tool fetches scientific publications (currently PubMed/PMC) and verifies that quoted text (supporting_text) can be found in the referenced document using deterministic substring matching.
# Using uv (recommended)
uv pip install linkml-reference-validator
# Using pip
pip install linkml-reference-validator# Validate a single quote against a reference
linkml-reference-validator validate text \
"protein functions in cell cycle regulation" \
PMID:12345678
# Validate a data file using LinkML validation
linkml-validate -s schema.yaml data.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPluginScientific data often includes claims supported by quotes from publications. But how do you know the quotes are accurate?
Before:
gene_function:
gene: TP53
function: "regulates cell cycle"
evidence:
reference: PMID:12345678
supporting_text: "TP53 is critical for cell cycle regulation" # Is this really in the paper?After validation:
$ linkml-reference-validator validate text \
"TP53 is critical for cell cycle regulation" \
PMID:12345678
✓ Valid: True
✓ Supporting text validated successfully in PMID:12345678Note: The CLI was restructured in v1.x to use nested commands (
validate text,validate data,cache reference). The old hyphenated commands (validate-text,validate-data,cache-reference) still work for backward compatibility but are deprecated.
Validate a single quote against a reference without needing a schema.
linkml-reference-validator validate text <TEXT> <REFERENCE_ID> [OPTIONS]Example:
# Basic validation
linkml-reference-validator validate text \
"protein functions in cell cycle regulation" \
PMID:12345678
# With editorial notes (ignored in matching)
linkml-reference-validator validate text \
"protein [X] functions in cell cycle regulation" \
PMID:12345678
# Multi-part quote with omitted text
linkml-reference-validator validate text \
"protein functions ... cell cycle regulation" \
PMID:12345678Options:
--cache-dir PATH- Directory for caching references (default:references_cache)--verbose- Show detailed validation information--help- Show help message
Exit Codes:
0- Validation successful1- Validation failed
Validate an entire data file using a LinkML schema.
linkml-reference-validator validate data <DATA_FILE> --schema <SCHEMA> [OPTIONS]Example:
Schema (gene_schema.yaml):
id: https://example.org/genes
name: gene-schema
classes:
GeneFunction:
attributes:
gene:
range: string
function:
range: string
evidence:
range: Evidence
Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference # Marks this as a reference field
supporting_text:
range: string
implements:
- linkml:excerpt # Marks this as text to validate
Reference:
attributes:
id:
identifier: true
range: string
title:
range: stringData (gene_data.yaml):
gene: TP53
function: "regulates cell cycle"
evidence:
reference:
id: PMID:12345678
title: "TP53 in cell cycle control"
supporting_text: "TP53 protein functions in cell cycle regulation"Validation:
linkml-reference-validator validate data \
gene_data.yaml \
--schema gene_schema.yaml
# Output:
Validating gene_data.yaml against schema gene_schema.yaml
Cache directory: references_cache
✓ All validations passed!Options:
--schema PATH(required) - Path to LinkML schema--target-class CLASS- Specific class to validate--cache-dir PATH- Directory for caching references--verbose- Show detailed output--help- Show help message
Download and cache references for offline use.
linkml-reference-validator cache reference <REFERENCE_ID> [OPTIONS]Example:
# Cache a single reference
linkml-reference-validator cache reference PMID:12345678
# Output:
Fetching PMID:12345678...
✓ Successfully cached PMID:12345678
Title: TP53 in cell cycle control
Authors: Smith J, Doe A, Johnson K
Content type: full_text_xml
Content length: 45231 charactersUse Cases:
- Pre-fetch references before validation
- Build offline reference library
- Verify reference availability
The recommended way to use this tool is as a LinkML validation plugin with the standard linkml-validate command.
1. Install both packages:
uv pip install linkml linkml-reference-validator2. Create your schema with interface markers:
# my_schema.yaml
id: https://example.org/my-schema
name: my-schema
prefixes:
linkml: https://w3id.org/linkml/
classes:
Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference # <-- This marks it as a reference
supporting_text:
range: string
implements:
- linkml:excerpt # <-- This marks it as text to validate3. Validate using linkml-validate:
linkml-validate \
--schema my_schema.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
my_data.yaml✅ Integrated validation - Combines schema validation + reference validation in one command ✅ Standard LinkML workflow - Uses familiar LinkML tools ✅ Flexible schema design - Works with any schema using the interface pattern ✅ Rich error reporting - Shows exactly where validation fails in your data
reference:
id: PMID:12345678
supporting_text: "protein functions in cells"Fetches:
- Abstract (always)
- Full text from PMC (when available)
- Metadata (title, authors, journal, year, DOI)
ID Formats:
PMID:1234567812345678(assumes PMID)
- DOI -
DOI:10.1038/nature12345 - URLs - Web pages and online documents
For the validator to work, your LinkML schema must:
Use implements: [linkml:authoritative_reference] on slots that contain references:
classes:
Evidence:
attributes:
reference: # Can be nested object
range: Reference
implements:
- linkml:authoritative_referenceOR use a flat structure:
classes:
Evidence:
attributes:
reference_id: # Can be flat string
range: string
implements:
- linkml:authoritative_referenceUse implements: [linkml:excerpt] on slots containing quoted text:
classes:
Evidence:
attributes:
supporting_text: # The quote to validate
range: string
implements:
- linkml:excerptIf using nested references, define the Reference class:
classes:
Reference:
attributes:
id:
identifier: true
range: string
title: # Optional: validates if provided
range: stringevidence:
reference:
id: PMID:12345678
title: "Study of Protein X"
supporting_text: "protein functions in cell cycle regulation"evidence:
reference_id: PMID:12345678
supporting_text: "protein functions in cell cycle regulation"statement:
text: "Protein X has multiple functions"
evidence:
- reference:
id: PMID:11111111
supporting_text: "protein functions in cell cycle"
- reference:
id: PMID:22222222
supporting_text: "protein regulates DNA repair"Use square brackets for editorial insertions that should be ignored during matching:
supporting_text: "protein [X] functions in cell cycle regulation"
# Matches: "protein functions in cell cycle regulation"
# Ignores: "X"Use cases:
[sic]- Original spelling[emphasis added]- Added emphasis[gene name]- Clarifications[...]- Omitted content markers
Use ellipsis for gaps in quoted text:
supporting_text: "protein functions ... in cell cycle regulation"
# Matches both parts independently:
# - "protein functions"
# - "in cell cycle regulation"Requirements:
- Both parts must appear in the reference (order independent)
- Each part must be a substring match after normalization
Before matching, text is normalized:
- Greek letters spelled out (α→alpha, β→beta, etc.)
- Lowercased
- Punctuation removed
- Extra whitespace collapsed
Examples:
"T-Cell Receptor" → "t cell receptor"
"TP53 (p53) protein" → "tp53 p53 protein"
"α-catenin" → "alpha catenin"
"β-actin" → "beta actin"
"γ-tubulin" → "gamma tubulin"
Greek Letter Support:
All Greek letters (both uppercase and lowercase) are converted to their spelled-out English equivalents. This ensures:
- Bidirectional matching: "α-catenin" in a query matches "alpha-catenin" in the reference, and vice versa
- Preserved distinctions: "α-catenin" and "β-catenin" remain distinct (not collapsed to just "catenin")
- Consistent behavior: Works with any Greek letter commonly used in biomedical nomenclature
References are automatically cached to disk to:
- Speed up repeated validations
- Reduce API calls to PubMed
- Enable offline validation
references_cache/
├── PMID_12345678.md
├── PMID_98765432.md
└── PMC_7654321.md
Cache files are stored as Markdown with YAML frontmatter for easy readability and compatibility:
---
reference_id: PMID:12345678
title: TP53 in cell cycle control
authors:
- Smith J
- Doe A
- Johnson K
journal: Nature
year: '2024'
doi: 10.1038/nature12345
content_type: full_text_xml
---
# TP53 in cell cycle control
**Authors:** Smith J, Doe A, Johnson K
**Journal:** Nature (2024)
**DOI:** [10.1038/nature12345](https://doi.org/10.1038/nature12345)
## Content
[Full text content follows...]Note: The validator still supports reading legacy .txt format cache files for backward compatibility.
# Use custom cache directory
linkml-reference-validator validate text \
"quote" PMID:123 \
--cache-dir /path/to/cache
# Pre-cache references
linkml-reference-validator cache reference PMID:12345678
# Force re-fetch (bypass cache)
linkml-reference-validator cache reference PMID:12345678 --forceSchema (gene.yaml):
id: https://example.org/genes
name: gene-schema
classes:
GeneFunctionStatement:
tree_root: true
attributes:
gene_symbol:
range: string
function_description:
range: string
evidence:
range: Evidence
Evidence:
attributes:
reference:
range: Reference
implements:
- linkml:authoritative_reference
supporting_text:
range: string
implements:
- linkml:excerpt
Reference:
attributes:
id:
identifier: trueData (tp53.yaml):
gene_symbol: TP53
function_description: "tumor suppressor"
evidence:
reference:
id: PMID:12345678
supporting_text: "TP53 functions as a tumor suppressor"Validation:
linkml-validate \
--schema gene.yaml \
--validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
tp53.yaml# Check if a quote is in a paper
linkml-reference-validator validate text \
"protein kinase activity regulates cell proliferation" \
PMID:12345678
# With editorial note
linkml-reference-validator validate text \
"protein kinase [PKA] activity regulates cell proliferation" \
PMID:12345678
# Multi-part quote
linkml-reference-validator validate text \
"protein kinase activity ... regulates cell proliferation" \
PMID:12345678Data (gene_annotations.yaml):
- gene_symbol: BRCA1
annotations:
- function: "DNA repair"
evidence:
reference:
id: PMID:11111111
supporting_text: "BRCA1 plays a critical role in DNA repair"
- function: "tumor suppressor"
evidence:
reference:
id: PMID:22222222
supporting_text: "BRCA1 functions as a tumor suppressor"
- gene_symbol: TP53
annotations:
- function: "cell cycle regulation"
evidence:
reference:
id: PMID:33333333
supporting_text: "TP53 regulates cell cycle checkpoints"Validation:
linkml-reference-validator validate data \
gene_annotations.yaml \
--schema gene_schema.yaml \
--verbose
# Output shows validation for each reference:
# ✓ PMID:11111111 - "BRCA1 plays a critical role in DNA repair"
# ✓ PMID:22222222 - "BRCA1 functions as a tumor suppressor"
# ✓ PMID:33333333 - "TP53 regulates cell cycle checkpoints"✅ Exact substring match (after normalization)
supporting_text: "protein functions in cells"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - exact substring found✅ Multi-part match
supporting_text: "protein functions ... during mitosis"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - both parts found✅ Editorial notes ignored
supporting_text: "protein [X] functions"
reference_content: "The protein functions in cells."
# ✓ PASS - [X] ignored in matching✅ Case and punctuation normalized
supporting_text: "T-Cell Receptor"
reference_content: "The t cell receptor binds antigens."
# ✓ PASS - normalized to "t cell receptor"❌ Text not in reference
supporting_text: "protein inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - "inhibits apoptosis" not found❌ Partial multi-part match
supporting_text: "protein functions ... inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - second part not found❌ Reference not accessible
supporting_text: "any quote"
reference_id: PMID:99999999
# ✗ FAIL - reference doesn't exist or can't be fetched❌ Title mismatch (when title provided)
reference:
id: PMID:12345678
title: "Wrong Title"
supporting_text: "correct quote"
# ✗ FAIL - title doesn't match fetched referenceDefault: references_cache/ in current directory
# Custom cache location
export REFERENCE_CACHE_DIR=/path/to/cache
linkml-reference-validator validate text "quote" PMID:123
# Or use CLI option
linkml-reference-validator validate text "quote" PMID:123 \
--cache-dir /path/to/cacheThe tool respects NCBI API rate limits (3 requests/second without API key).
Optional: Set email for NCBI Entrez (recommended):
export NCBI_EMAIL="[email protected]"Optional: Use NCBI API key for higher rate limits:
export NCBI_API_KEY="your_api_key_here"Causes:
- PMID doesn't exist
- Network connectivity issues
- NCBI API temporarily unavailable
Solutions:
# Verify PMID exists on PubMed
# Check network connection
# Try again later (NCBI may be down)Causes:
- Abstract not available
- Article behind paywall (no PMC access)
- Retracted article
Solutions:
# Check if article has abstract on PubMed
# Look for PMC full text availability
# Try a different referenceCauses:
- Quote is incorrect or paraphrased
- Text only in figures/tables (not extracted)
- Text uses different terminology
- Unicode characters normalized out
Solutions:
# Verify exact quote from PDF/HTML
# Try shorter, more specific quote
# Check if text is in figure caption
# Use editorial notes for differences: "protein [X] functions"Cause:
- Entire supporting_text is in brackets:
"[editorial note]"
Solution:
# Include actual quote text
supporting_text: "protein functions [in cells]"
# Not just: "[editorial note]"- First validation: ~2-3 seconds (includes fetch + cache)
- Cached validation: ~10-50ms
- Batch validation: ~50ms per reference (cached)
-
Pre-cache references:
# Cache all references before validation for pmid in PMID:111 PMID:222 PMID:333; do linkml-reference-validator cache reference $pmid done
-
Reuse cache directory:
# Share cache across projects export REFERENCE_CACHE_DIR=~/.reference_cache
-
Use verbose mode to see what's slow:
linkml-reference-validator validate data data.yaml \ --schema schema.yaml \ --verbose
# Clone repository
git clone https://github.com/linkml/linkml-reference-validator
cd linkml-reference-validator
# Install with dev dependencies
uv sync --group dev
# Run tests
just test
# Run specific test
uv run pytest tests/test_cli.py::test_validate_text_command_successlinkml-reference-validator/
├── src/linkml_reference_validator/
│ ├── cli.py # CLI commands
│ ├── models.py # Data models
│ ├── validation/
│ │ └── supporting_text_validator.py # Core validation logic
│ ├── etl/
│ │ └── reference_fetcher.py # Reference fetching
│ └── plugins/
│ └── reference_validation_plugin.py # LinkML plugin
├── tests/
│ ├── fixtures/ # Test reference files
│ ├── test_cli.py # CLI tests
│ ├── test_e2e_integration.py # End-to-end tests
│ └── ...
├── justfile # Development commands
└── pyproject.toml # Project configuration
# All tests
just test
# Just pytest
just pytest
# With coverage
uv run pytest --cov=src/linkml_reference_validator
# Specific test file
uv run pytest tests/test_cli.py
# Doctests
just doctestWhile the CLI is recommended, you can also use the Python API:
from linkml_reference_validator.validation.supporting_text_validator import (
SupportingTextValidator
)
from linkml_reference_validator.models import ReferenceValidationConfig
# Create validator
config = ReferenceValidationConfig(cache_dir="my_cache")
validator = SupportingTextValidator(config)
# Validate text
result = validator.validate(
supporting_text="protein functions in cell cycle regulation",
reference_id="PMID:12345678",
)
print(result.is_valid) # True/False
print(result.message) # Validation message- PubMed only - Currently only supports PMID references (DOI and URLs coming soon)
- Text extraction - Only extracts text from abstracts and main article text (not figures, tables, or supplementary materials)
- Unicode normalization - Greek letters and special symbols are removed during normalization (e.g., α → a, β → b)
- No fuzzy matching - Uses deterministic substring matching only (intentional design choice)
- English-centric - Text normalization assumes English text
- Greek letters: "α-catenin" matches "a catenin" or "catenin"
- Chemical formulas: "H₂O" becomes "h o" or "h2o"
- Hyphens: "T-cell" matches "t cell"
- Abbreviations: Must match exactly as they appear (normalized)
| Manual | linkml-reference-validator |
|---|---|
| ❌ Time consuming | ✅ Automated |
| ❌ Error prone | ✅ Consistent |
| ❌ Not scalable | ✅ Validates 100s of quotes |
| ❌ Not reproducible | ✅ Cached, versioned |
linkml-reference-validator uses deterministic substring matching, not fuzzy matching:
✅ Predictable - Same input always gives same result ✅ Explainable - Easy to understand why validation passed/failed ✅ No false positives - Won't accept paraphrased text ✅ Fast - No complex similarity calculations
Contributions welcome! See CONTRIBUTING.md for guidelines.
- DOI support
- URL/webpage support
- Better Unicode handling
- Performance improvements for large batches
- More comprehensive error messages
If you use this tool in your research, please cite:
@software{linkml_reference_validator,
title = {linkml-reference-validator: Validation of supporting text from references},
author = {Mungall, Chris},
year = {2024},
url = {https://github.com/linkml/linkml-reference-validator}
}Apache 2.0 - see LICENSE
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Full Documentation
- LinkML - Modeling language for linked data
- linkml-validator - Core LinkML validation
- ai-gene-reviews - Inspiration for this project