MedCrawler is a Python package that provides asynchronous interfaces for crawling medical literature databases. It currently supports crawling data from PubMed (via NCBI E-utilities) and ClinicalTrials.gov (via their API v2).
- Asynchronous HTTP requests for efficient data retrieval
- Built-in rate limiting and retry strategies with exponential backoff
- Caching with time-based expiration
- Batch processing capabilities
- Comprehensive error handling
- Date-based filtering for both PubMed and ClinicalTrials.gov
- Well-defined abstract interfaces for easy extension to other sources
pip install git+https://github.com/yourusername/MedCrawler.git
Add the repository as a submodule to your project:
git submodule add https://github.com/yourusername/MedCrawler.git
git submodule update --init --recursive
Install the package in editable mode:
pip install -e ./MedCrawler
import asyncio
from crawlers import PubMedCrawler
async def main():
async with PubMedCrawler() as crawler:
# Search for articles
async for pmid in crawler.search("cancer treatment", max_results=5):
# Fetch metadata for each article
metadata = await crawler.get_item(pmid)
print(f"Title: {metadata['title']}")
print(f"Authors: {', '.join(metadata['authors'])}")
print(f"Abstract: {metadata['abstract'][:100]}...")
print("\n" + "-" * 50 + "\n")
if __name__ == "__main__":
asyncio.run(main())
The package includes a demonstration script that showcases its functionality:
python main.py --source pubmed --query "diabetes" --max 10
python main.py --source clinicaltrials --query "covid" --max 5 --recent
Available options:
--source
:pubmed
,clinicaltrials
, orall
(default:all
)--query
: Search query string (default:cancer
)--max
: Maximum number of results (default:5
)--from-date
: Start date for filtering results (format depends on source)--to-date
: End date for filtering results (format depends on source)--recent
: Short for setting from-date to 90 days ago
The BaseCrawler
class provides core functionality used by all crawler implementations:
from crawlers import CrawlerConfig
from medcrawler.base import BaseCrawler
# Create a custom configuration
config = CrawlerConfig(
user_agent="YourApp/1.0",
email="[email protected]",
api_key="your-api-key", # Optional
min_interval=0.5 # Seconds between requests
)
# Use the crawler with the configuration
async with YourCrawler(config) as crawler:
# Your code here
from crawlers import PubMedCrawler
async with PubMedCrawler() as crawler:
# Search with date filtering (YYYY/MM/DD format)
async for pmid in crawler.search(
query="cancer treatment",
max_results=10,
from_date="2023/01/01",
to_date="2023/12/31"
):
metadata = await crawler.get_item(pmid)
# Batch retrieval for efficiency
pmids = ["12345678", "23456789", "34567890"]
results = await crawler.get_items_batch(pmids)
from crawlers import ClinicalTrialsCrawler
async with ClinicalTrialsCrawler() as crawler:
# Search with date filtering (YYYY-MM-DD format)
async for nct_id in crawler.search(
query="covid vaccine",
max_results=10,
from_date="2023-01-01",
to_date="2023-12-31"
):
metadata = await crawler.get_item(nct_id)
You can implement your own crawler by extending the BaseCrawler
class:
from medcrawler.base import BaseCrawler
from typing import Dict, Any, AsyncGenerator, Set, Optional
class YourCrawler(BaseCrawler):
def __init__(self, config=None):
super().__init__("https://your-api-base-url.com", config)
async def search(
self,
query: str,
max_results: Optional[int] = None,
old_item_ids: Optional[Set[str]] = None,
from_date: Optional[str] = None,
to_date: Optional[str] = None
) -> AsyncGenerator[str, None]:
# Implementation here
async def get_metadata_request_params(self, item_id: str) -> Dict:
# Implementation here
async def get_metadata_endpoint(self) -> str:
# Implementation here
def extract_metadata(self, response_data: Any) -> Dict[str, Any]:
# Implementation here
medcrawler/
├── __init__.py # Package version and exports
├── base.py # Base crawler implementation
├── pubmed.py # PubMed crawler
├── clinical_trials.py # ClinicalTrials.gov crawler
└── config.py # Configuration handling
pytest
This project follows PEP 8 style guidelines and uses:
- Black for code formatting
- isort for import sorting
- pytest for testing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request