A sophisticated multi-agent system built with CrewAI for extracting Continuing Professional Development (CPD) training activities from professional medical websites like RCPI (Royal College of Physicians of Ireland).
✅ WORKING: Successfully retrieves comprehensive CPD activity data from RCPI calendar
✅ PROVEN: CrewAI multi-agent concept validated with YAML-based configuration
📋 READY: Clean JSON output with 8 activities and 42 total CPD credits extracted
The system successfully extracts:
- Event titles and descriptions
- Dates and times
- Locations (physical/online/hybrid)
- CPD credit values
- Event formats and categories
- Clean JSON output without unnecessary metadata fields
Here's what the system successfully extracts from RCPI:
{
"activities": [
{
"title": "The Royal College of Physicians of Ireland Intern Open Day",
"date": "2025-08-30",
"time": "10:30",
"location": "Number Six Kildare Street and Online",
"format": "Hybrid (In-person and Online)",
"cpd_credits": 6,
"category": "College Meetings"
},
{
"title": "The Faculty of Occupational Medicine Autumn Conference",
"date": "2025-09-26",
"time": "09:30",
"location": "Number Six Kildare Street and Online",
"format": "Hybrid Conference",
"cpd_credits": 6,
"category": "Occupational Medicine"
}
],
"metadata": {
"total_activities": 8,
"source_url": "https://www.rcpi.ie/Calendar/calendar?Type=CPDEvents",
"extraction_timestamp": "2025-08-22T20:49:49.826817",
"schema_version": "1.0"
}
}- Framework: CrewAI - Multi-agent AI orchestration
- AI Provider: Anthropic Claude 3.5 Sonnet via API
- Web Automation: Playwright - Modern browser automation
- Language: Python 3.12+
- Configuration: YAML-based agent and task definitions
# Install Python 3.12+
python3 --version
# Set up environment variables
export ANTHROPIC_API_KEY="your_claude_api_key_here"# Install dependencies
make install
# or: pip install -r requirements.txt
# Install Playwright browser
playwright install chromium# Run CPD scraper on RCPI website
make run URL=https://www.rcpi.ie/Calendar/calendar?Type=CPDEvents
# Run integration tests
make test
# Clean output files
make clean
# View available commands
make helpThe system uses 4 specialized CrewAI agents configured via YAML files:
🌐 Web Content Exploration Specialist
├── Playwright-powered web navigation and content extraction
├── Comprehensive page structure analysis
└── CPD-related content identification
🔍 AI-Powered CPD Content Analyst
├── Claude AI analysis of extracted web content
├── Intelligent activity identification and data extraction
└── Contextual information parsing
📄 CPD Data Formatting Specialist
├── Transforms extracted data into clean JSON structure
├── Removes unnecessary validation fields
└── Ensures consistent schema compliance
💾 Data Persistence and Verification Specialist
├── Saves formatted JSON to activities.json
├── File integrity verification
└── Operation success reporting
All agent behavior is defined in easily tweakable YAML files:
config/agents.yaml- Agent roles, goals, and backstoriesconfig/tasks.yaml- Task descriptions, requirements, and expected outputs
This YAML-based approach makes it easy to refine prompts and agent behavior without code changes.
While the core CPD data extraction works reliably, speaker information extraction remains a challenge:
Phase 1: Complete Data Extraction
- Retrieve detailed event links for each activity
- Extract speaker/facilitator details from individual event pages
- Capture event images and promotional graphics
- Add full event descriptions and learning objectives
- Multi-website support beyond RCPI
Phase 2: Infrastructure & Deployment
- AWS Deployment strategy and infrastructure planning
- Containerization with Docker for consistent deployment
- Scheduled execution with AWS Lambda or ECS
- Data storage with DynamoDB or RDS
- API endpoint for accessing extracted data
Phase 3: Scale & Reliability
- Robust error handling and retry mechanisms
- Data deduplication and change detection
- Performance monitoring and alerting
This project demonstrates CrewAI's strength for complex, multi-step processes:
- ✅ Easy Configuration: YAML files make prompt tweaking simple
- ✅ Agent Coordination: Automatic task handoffs between specialists
- ✅ Tool Integration: Seamless browser automation and AI analysis
- ✅ Error Resilience: Built-in retry and fallback mechanisms
The system includes some integration tests:
# Run full integration test suite
make test
# Run quick validation (if activities.json exists)
python3 test_quick.py
# Run complete pipeline test
python3 test_integration.py