Skip to content

linkml/linkml-reference-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

linkml-reference-validator

Tests Coverage Python

Validate that supporting text quotes in your data actually appear in their cited references.

This tool fetches scientific publications (currently PubMed/PMC) and verifies that quoted text (supporting_text) can be found in the referenced document using deterministic substring matching.


Quick Start

Installation

# Using uv (recommended)
uv pip install linkml-reference-validator

# Using pip
pip install linkml-reference-validator

Basic Usage

# Validate a single quote against a reference
linkml-reference-validator validate text \
  "protein functions in cell cycle regulation" \
  PMID:12345678

# Validate a data file using LinkML validation
linkml-validate -s schema.yaml data.yaml \
  --validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin

Why Use This Tool?

Scientific data often includes claims supported by quotes from publications. But how do you know the quotes are accurate?

Before:

gene_function:
  gene: TP53
  function: "regulates cell cycle"
  evidence:
    reference: PMID:12345678
    supporting_text: "TP53 is critical for cell cycle regulation"  # Is this really in the paper?

After validation:

$ linkml-reference-validator validate text \
    "TP53 is critical for cell cycle regulation" \
    PMID:12345678

✓ Valid: True
✓ Supporting text validated successfully in PMID:12345678

CLI Commands

Note: The CLI was restructured in v1.x to use nested commands (validate text, validate data, cache reference). The old hyphenated commands (validate-text, validate-data, cache-reference) still work for backward compatibility but are deprecated.

1. validate text - Quick Single Quote Validation

Validate a single quote against a reference without needing a schema.

linkml-reference-validator validate text <TEXT> <REFERENCE_ID> [OPTIONS]

Example:

# Basic validation
linkml-reference-validator validate text \
  "protein functions in cell cycle regulation" \
  PMID:12345678

# With editorial notes (ignored in matching)
linkml-reference-validator validate text \
  "protein [X] functions in cell cycle regulation" \
  PMID:12345678

# Multi-part quote with omitted text
linkml-reference-validator validate text \
  "protein functions ... cell cycle regulation" \
  PMID:12345678

Options:

  • --cache-dir PATH - Directory for caching references (default: references_cache)
  • --verbose - Show detailed validation information
  • --help - Show help message

Exit Codes:

  • 0 - Validation successful
  • 1 - Validation failed

2. validate data - Full Data File Validation

Validate an entire data file using a LinkML schema.

linkml-reference-validator validate data <DATA_FILE> --schema <SCHEMA> [OPTIONS]

Example:

Schema (gene_schema.yaml):

id: https://example.org/genes
name: gene-schema

classes:
  GeneFunction:
    attributes:
      gene:
        range: string
      function:
        range: string
      evidence:
        range: Evidence

  Evidence:
    attributes:
      reference:
        range: Reference
        implements:
          - linkml:authoritative_reference  # Marks this as a reference field
      supporting_text:
        range: string
        implements:
          - linkml:excerpt  # Marks this as text to validate

  Reference:
    attributes:
      id:
        identifier: true
        range: string
      title:
        range: string

Data (gene_data.yaml):

gene: TP53
function: "regulates cell cycle"
evidence:
  reference:
    id: PMID:12345678
    title: "TP53 in cell cycle control"
  supporting_text: "TP53 protein functions in cell cycle regulation"

Validation:

linkml-reference-validator validate data \
  gene_data.yaml \
  --schema gene_schema.yaml

# Output:
Validating gene_data.yaml against schema gene_schema.yaml
Cache directory: references_cache

✓ All validations passed!

Options:

  • --schema PATH (required) - Path to LinkML schema
  • --target-class CLASS - Specific class to validate
  • --cache-dir PATH - Directory for caching references
  • --verbose - Show detailed output
  • --help - Show help message

3. cache reference - Pre-cache References

Download and cache references for offline use.

linkml-reference-validator cache reference <REFERENCE_ID> [OPTIONS]

Example:

# Cache a single reference
linkml-reference-validator cache reference PMID:12345678

# Output:
Fetching PMID:12345678...
✓ Successfully cached PMID:12345678
  Title: TP53 in cell cycle control
  Authors: Smith J, Doe A, Johnson K
  Content type: full_text_xml
  Content length: 45231 characters

Use Cases:

  • Pre-fetch references before validation
  • Build offline reference library
  • Verify reference availability

Integration with linkml-validate

The recommended way to use this tool is as a LinkML validation plugin with the standard linkml-validate command.

Setup

1. Install both packages:

uv pip install linkml linkml-reference-validator

2. Create your schema with interface markers:

# my_schema.yaml
id: https://example.org/my-schema
name: my-schema

prefixes:
  linkml: https://w3id.org/linkml/

classes:
  Evidence:
    attributes:
      reference:
        range: Reference
        implements:
          - linkml:authoritative_reference  # <-- This marks it as a reference
      supporting_text:
        range: string
        implements:
          - linkml:excerpt  # <-- This marks it as text to validate

3. Validate using linkml-validate:

linkml-validate \
  --schema my_schema.yaml \
  --validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
  my_data.yaml

Why Use the Plugin?

Integrated validation - Combines schema validation + reference validation in one command ✅ Standard LinkML workflow - Uses familiar LinkML tools ✅ Flexible schema design - Works with any schema using the interface pattern ✅ Rich error reporting - Shows exactly where validation fails in your data


Supported Reference Formats

Currently Supported

PubMed IDs (PMID)

reference:
  id: PMID:12345678
  supporting_text: "protein functions in cells"

Fetches:

  • Abstract (always)
  • Full text from PMC (when available)
  • Metadata (title, authors, journal, year, DOI)

ID Formats:

  • PMID:12345678
  • 12345678 (assumes PMID)

Coming Soon

  • DOI - DOI:10.1038/nature12345
  • URLs - Web pages and online documents

Schema Requirements

For the validator to work, your LinkML schema must:

1. Mark Reference Fields

Use implements: [linkml:authoritative_reference] on slots that contain references:

classes:
  Evidence:
    attributes:
      reference:           # Can be nested object
        range: Reference
        implements:
          - linkml:authoritative_reference

OR use a flat structure:

classes:
  Evidence:
    attributes:
      reference_id:        # Can be flat string
        range: string
        implements:
          - linkml:authoritative_reference

2. Mark Excerpt Fields

Use implements: [linkml:excerpt] on slots containing quoted text:

classes:
  Evidence:
    attributes:
      supporting_text:     # The quote to validate
        range: string
        implements:
          - linkml:excerpt

3. Define Reference Structure

If using nested references, define the Reference class:

classes:
  Reference:
    attributes:
      id:
        identifier: true
        range: string
      title:              # Optional: validates if provided
        range: string

Data Formats

Nested Reference (Recommended)

evidence:
  reference:
    id: PMID:12345678
    title: "Study of Protein X"
  supporting_text: "protein functions in cell cycle regulation"

Flat Reference ID

evidence:
  reference_id: PMID:12345678
  supporting_text: "protein functions in cell cycle regulation"

Multiple Evidence Items

statement:
  text: "Protein X has multiple functions"
  evidence:
    - reference:
        id: PMID:11111111
      supporting_text: "protein functions in cell cycle"
    - reference:
        id: PMID:22222222
      supporting_text: "protein regulates DNA repair"

Text Matching Syntax

Editorial Notes [...]

Use square brackets for editorial insertions that should be ignored during matching:

supporting_text: "protein [X] functions in cell cycle regulation"
# Matches: "protein functions in cell cycle regulation"
# Ignores: "X"

Use cases:

  • [sic] - Original spelling
  • [emphasis added] - Added emphasis
  • [gene name] - Clarifications
  • [...] - Omitted content markers

Omitted Text ...

Use ellipsis for gaps in quoted text:

supporting_text: "protein functions ... in cell cycle regulation"
# Matches both parts independently:
# - "protein functions"
# - "in cell cycle regulation"

Requirements:

  • Both parts must appear in the reference (order independent)
  • Each part must be a substring match after normalization

Text Normalization

Before matching, text is normalized:

  • Greek letters spelled out (α→alpha, β→beta, etc.)
  • Lowercased
  • Punctuation removed
  • Extra whitespace collapsed

Examples:

"T-Cell Receptor"     → "t cell receptor"
"TP53 (p53) protein"  → "tp53 p53 protein"
"α-catenin"          → "alpha catenin"
"β-actin"            → "beta actin"
"γ-tubulin"          → "gamma tubulin"

Greek Letter Support:

All Greek letters (both uppercase and lowercase) are converted to their spelled-out English equivalents. This ensures:

  • Bidirectional matching: "α-catenin" in a query matches "alpha-catenin" in the reference, and vice versa
  • Preserved distinctions: "α-catenin" and "β-catenin" remain distinct (not collapsed to just "catenin")
  • Consistent behavior: Works with any Greek letter commonly used in biomedical nomenclature

Caching

References are automatically cached to disk to:

  • Speed up repeated validations
  • Reduce API calls to PubMed
  • Enable offline validation

Cache Structure

references_cache/
├── PMID_12345678.md
├── PMID_98765432.md
└── PMC_7654321.md

Cache File Format

Cache files are stored as Markdown with YAML frontmatter for easy readability and compatibility:

---
reference_id: PMID:12345678
title: TP53 in cell cycle control
authors:
- Smith J
- Doe A
- Johnson K
journal: Nature
year: '2024'
doi: 10.1038/nature12345
content_type: full_text_xml
---

# TP53 in cell cycle control
**Authors:** Smith J, Doe A, Johnson K
**Journal:** Nature (2024)
**DOI:** [10.1038/nature12345](https://doi.org/10.1038/nature12345)

## Content

[Full text content follows...]

Note: The validator still supports reading legacy .txt format cache files for backward compatibility.

Cache Management

# Use custom cache directory
linkml-reference-validator validate text \
  "quote" PMID:123 \
  --cache-dir /path/to/cache

# Pre-cache references
linkml-reference-validator cache reference PMID:12345678

# Force re-fetch (bypass cache)
linkml-reference-validator cache reference PMID:12345678 --force

Examples

Example 1: Simple Gene Function Validation

Schema (gene.yaml):

id: https://example.org/genes
name: gene-schema

classes:
  GeneFunctionStatement:
    tree_root: true
    attributes:
      gene_symbol:
        range: string
      function_description:
        range: string
      evidence:
        range: Evidence

  Evidence:
    attributes:
      reference:
        range: Reference
        implements:
          - linkml:authoritative_reference
      supporting_text:
        range: string
        implements:
          - linkml:excerpt

  Reference:
    attributes:
      id:
        identifier: true

Data (tp53.yaml):

gene_symbol: TP53
function_description: "tumor suppressor"
evidence:
  reference:
    id: PMID:12345678
  supporting_text: "TP53 functions as a tumor suppressor"

Validation:

linkml-validate \
  --schema gene.yaml \
  --validate-plugins linkml_reference_validator.plugins.ReferenceValidationPlugin \
  tp53.yaml

Example 2: Quick Text Check

# Check if a quote is in a paper
linkml-reference-validator validate text \
  "protein kinase activity regulates cell proliferation" \
  PMID:12345678

# With editorial note
linkml-reference-validator validate text \
  "protein kinase [PKA] activity regulates cell proliferation" \
  PMID:12345678

# Multi-part quote
linkml-reference-validator validate text \
  "protein kinase activity ... regulates cell proliferation" \
  PMID:12345678

Example 3: Batch Validation with Multiple References

Data (gene_annotations.yaml):

- gene_symbol: BRCA1
  annotations:
    - function: "DNA repair"
      evidence:
        reference:
          id: PMID:11111111
        supporting_text: "BRCA1 plays a critical role in DNA repair"
    - function: "tumor suppressor"
      evidence:
        reference:
          id: PMID:22222222
        supporting_text: "BRCA1 functions as a tumor suppressor"

- gene_symbol: TP53
  annotations:
    - function: "cell cycle regulation"
      evidence:
        reference:
          id: PMID:33333333
        supporting_text: "TP53 regulates cell cycle checkpoints"

Validation:

linkml-reference-validator validate data \
  gene_annotations.yaml \
  --schema gene_schema.yaml \
  --verbose

# Output shows validation for each reference:
# ✓ PMID:11111111 - "BRCA1 plays a critical role in DNA repair"
# ✓ PMID:22222222 - "BRCA1 functions as a tumor suppressor"
# ✓ PMID:33333333 - "TP53 regulates cell cycle checkpoints"

Validation Rules

What Passes

Exact substring match (after normalization)

supporting_text: "protein functions in cells"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - exact substring found

Multi-part match

supporting_text: "protein functions ... during mitosis"
reference_content: "The protein functions in cells during mitosis."
# ✓ PASS - both parts found

Editorial notes ignored

supporting_text: "protein [X] functions"
reference_content: "The protein functions in cells."
# ✓ PASS - [X] ignored in matching

Case and punctuation normalized

supporting_text: "T-Cell Receptor"
reference_content: "The t cell receptor binds antigens."
# ✓ PASS - normalized to "t cell receptor"

What Fails

Text not in reference

supporting_text: "protein inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - "inhibits apoptosis" not found

Partial multi-part match

supporting_text: "protein functions ... inhibits apoptosis"
reference_content: "The protein functions in cells."
# ✗ FAIL - second part not found

Reference not accessible

supporting_text: "any quote"
reference_id: PMID:99999999
# ✗ FAIL - reference doesn't exist or can't be fetched

Title mismatch (when title provided)

reference:
  id: PMID:12345678
  title: "Wrong Title"
supporting_text: "correct quote"
# ✗ FAIL - title doesn't match fetched reference

Configuration

Cache Directory

Default: references_cache/ in current directory

# Custom cache location
export REFERENCE_CACHE_DIR=/path/to/cache
linkml-reference-validator validate text "quote" PMID:123

# Or use CLI option
linkml-reference-validator validate text "quote" PMID:123 \
  --cache-dir /path/to/cache

NCBI API Settings

The tool respects NCBI API rate limits (3 requests/second without API key).

Optional: Set email for NCBI Entrez (recommended):

export NCBI_EMAIL="[email protected]"

Optional: Use NCBI API key for higher rate limits:

export NCBI_API_KEY="your_api_key_here"

Troubleshooting

Common Issues

"Could not fetch reference: PMID:12345678"

Causes:

  • PMID doesn't exist
  • Network connectivity issues
  • NCBI API temporarily unavailable

Solutions:

# Verify PMID exists on PubMed
# Check network connection
# Try again later (NCBI may be down)

"No content available for reference"

Causes:

  • Abstract not available
  • Article behind paywall (no PMC access)
  • Retracted article

Solutions:

# Check if article has abstract on PubMed
# Look for PMC full text availability
# Try a different reference

"Supporting text not found"

Causes:

  • Quote is incorrect or paraphrased
  • Text only in figures/tables (not extracted)
  • Text uses different terminology
  • Unicode characters normalized out

Solutions:

# Verify exact quote from PDF/HTML
# Try shorter, more specific quote
# Check if text is in figure caption
# Use editorial notes for differences: "protein [X] functions"

"Query is empty after removing brackets"

Cause:

  • Entire supporting_text is in brackets: "[editorial note]"

Solution:

# Include actual quote text
supporting_text: "protein functions [in cells]"
# Not just: "[editorial note]"

Performance

Benchmarks

  • First validation: ~2-3 seconds (includes fetch + cache)
  • Cached validation: ~10-50ms
  • Batch validation: ~50ms per reference (cached)

Tips for Speed

  1. Pre-cache references:

    # Cache all references before validation
    for pmid in PMID:111 PMID:222 PMID:333; do
      linkml-reference-validator cache reference $pmid
    done
  2. Reuse cache directory:

    # Share cache across projects
    export REFERENCE_CACHE_DIR=~/.reference_cache
  3. Use verbose mode to see what's slow:

    linkml-reference-validator validate data data.yaml \
      --schema schema.yaml \
      --verbose

Development

Installation for Development

# Clone repository
git clone https://github.com/linkml/linkml-reference-validator
cd linkml-reference-validator

# Install with dev dependencies
uv sync --group dev

# Run tests
just test

# Run specific test
uv run pytest tests/test_cli.py::test_validate_text_command_success

Project Structure

linkml-reference-validator/
├── src/linkml_reference_validator/
│   ├── cli.py                    # CLI commands
│   ├── models.py                 # Data models
│   ├── validation/
│   │   └── supporting_text_validator.py  # Core validation logic
│   ├── etl/
│   │   └── reference_fetcher.py  # Reference fetching
│   └── plugins/
│       └── reference_validation_plugin.py  # LinkML plugin
├── tests/
│   ├── fixtures/                 # Test reference files
│   ├── test_cli.py              # CLI tests
│   ├── test_e2e_integration.py  # End-to-end tests
│   └── ...
├── justfile                      # Development commands
└── pyproject.toml               # Project configuration

Running Tests

# All tests
just test

# Just pytest
just pytest

# With coverage
uv run pytest --cov=src/linkml_reference_validator

# Specific test file
uv run pytest tests/test_cli.py

# Doctests
just doctest

API Usage (Python)

While the CLI is recommended, you can also use the Python API:

from linkml_reference_validator.validation.supporting_text_validator import (
    SupportingTextValidator
)
from linkml_reference_validator.models import ReferenceValidationConfig

# Create validator
config = ReferenceValidationConfig(cache_dir="my_cache")
validator = SupportingTextValidator(config)

# Validate text
result = validator.validate(
    supporting_text="protein functions in cell cycle regulation",
    reference_id="PMID:12345678",
)

print(result.is_valid)  # True/False
print(result.message)   # Validation message

Limitations

Current Limitations

  1. PubMed only - Currently only supports PMID references (DOI and URLs coming soon)
  2. Text extraction - Only extracts text from abstracts and main article text (not figures, tables, or supplementary materials)
  3. Unicode normalization - Greek letters and special symbols are removed during normalization (e.g., α → a, β → b)
  4. No fuzzy matching - Uses deterministic substring matching only (intentional design choice)
  5. English-centric - Text normalization assumes English text

Known Edge Cases

  • Greek letters: "α-catenin" matches "a catenin" or "catenin"
  • Chemical formulas: "H₂O" becomes "h o" or "h2o"
  • Hyphens: "T-cell" matches "t cell"
  • Abbreviations: Must match exactly as they appear (normalized)

Comparison to Other Tools

vs. Manual Verification

Manual linkml-reference-validator
❌ Time consuming ✅ Automated
❌ Error prone ✅ Consistent
❌ Not scalable ✅ Validates 100s of quotes
❌ Not reproducible ✅ Cached, versioned

vs. Fuzzy Matching Tools

linkml-reference-validator uses deterministic substring matching, not fuzzy matching:

Predictable - Same input always gives same result ✅ Explainable - Easy to understand why validation passed/failed ✅ No false positives - Won't accept paraphrased text ✅ Fast - No complex similarity calculations


Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Priority Areas

  • DOI support
  • URL/webpage support
  • Better Unicode handling
  • Performance improvements for large batches
  • More comprehensive error messages

Citation

If you use this tool in your research, please cite:

@software{linkml_reference_validator,
  title = {linkml-reference-validator: Validation of supporting text from references},
  author = {Mungall, Chris},
  year = {2024},
  url = {https://github.com/linkml/linkml-reference-validator}
}

License

Apache 2.0 - see LICENSE


Support


Related Projects

About

Validate that supporting text quotes in your data actually appear in their cited references

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •