🤖 CrewAI Multi-Agent CPD Scraper

A sophisticated multi-agent system built with CrewAI for extracting Continuing Professional Development (CPD) training activities from professional medical websites like RCPI (Royal College of Physicians of Ireland).

🎯 Project Status & Achievements

✅ WORKING: Successfully retrieves comprehensive CPD activity data from RCPI calendar
✅ PROVEN: CrewAI multi-agent concept validated with YAML-based configuration
⚠️ PARTIAL: Speaker extraction framework built but not yet reliable (requires detailed page navigation)
📋 READY: Clean JSON output with 8 activities and 42 total CPD credits extracted

Current Capabilities

The system successfully extracts:

Event titles and descriptions
Dates and times
Locations (physical/online/hybrid)
CPD credit values
Event formats and categories
Clean JSON output without unnecessary metadata fields

Sample Output

Here's what the system successfully extracts from RCPI:

{
  "activities": [
    {
      "title": "The Royal College of Physicians of Ireland Intern Open Day",
      "date": "2025-08-30",
      "time": "10:30",
      "location": "Number Six Kildare Street and Online",
      "format": "Hybrid (In-person and Online)",
      "cpd_credits": 6,
      "category": "College Meetings"
    },
    {
      "title": "The Faculty of Occupational Medicine Autumn Conference",
      "date": "2025-09-26",
      "time": "09:30", 
      "location": "Number Six Kildare Street and Online",
      "format": "Hybrid Conference",
      "cpd_credits": 6,
      "category": "Occupational Medicine"
    }
  ],
  "metadata": {
    "total_activities": 8,
    "source_url": "https://www.rcpi.ie/Calendar/calendar?Type=CPDEvents",
    "extraction_timestamp": "2025-08-22T20:49:49.826817",
    "schema_version": "1.0"
  }
}

🏗️ Tech Stack

Framework: CrewAI - Multi-agent AI orchestration
AI Provider: Anthropic Claude 3.5 Sonnet via API
Web Automation: Playwright - Modern browser automation
Language: Python 3.12+
Configuration: YAML-based agent and task definitions

🚀 Quick Start

Prerequisites

# Install Python 3.12+
python3 --version

# Set up environment variables
export ANTHROPIC_API_KEY="your_claude_api_key_here"

Installation

# Install dependencies
make install
# or: pip install -r requirements.txt

# Install Playwright browser
playwright install chromium

Usage

# Run CPD scraper on RCPI website
make run URL=https://www.rcpi.ie/Calendar/calendar?Type=CPDEvents

# Run integration tests
make test

# Clean output files
make clean

# View available commands
make help

🤖 Agent Architecture

The system uses 4 specialized CrewAI agents configured via YAML files:

🌐 Web Content Exploration Specialist
├── Playwright-powered web navigation and content extraction
├── Comprehensive page structure analysis
└── CPD-related content identification

🔍 AI-Powered CPD Content Analyst  
├── Claude AI analysis of extracted web content
├── Intelligent activity identification and data extraction
└── Contextual information parsing

📄 CPD Data Formatting Specialist
├── Transforms extracted data into clean JSON structure
├── Removes unnecessary validation fields
└── Ensures consistent schema compliance

💾 Data Persistence and Verification Specialist
├── Saves formatted JSON to activities.json
├── File integrity verification
└── Operation success reporting

YAML Configuration Files

All agent behavior is defined in easily tweakable YAML files:

config/agents.yaml - Agent roles, goals, and backstories
config/tasks.yaml - Task descriptions, requirements, and expected outputs

This YAML-based approach makes it easy to refine prompts and agent behavior without code changes.

🎯 Current Limitations & Next Steps

Speaker Information Challenge

While the core CPD data extraction works reliably, speaker information extraction remains a challenge:

Planned Enhancements

Phase 1: Complete Data Extraction

Retrieve detailed event links for each activity
Extract speaker/facilitator details from individual event pages
Capture event images and promotional graphics
Add full event descriptions and learning objectives
Multi-website support beyond RCPI

Phase 2: Infrastructure & Deployment

AWS Deployment strategy and infrastructure planning
Containerization with Docker for consistent deployment
Scheduled execution with AWS Lambda or ECS
Data storage with DynamoDB or RDS
API endpoint for accessing extracted data

Phase 3: Scale & Reliability

Robust error handling and retry mechanisms
Data deduplication and change detection
Performance monitoring and alerting

CrewAI Findings

This project demonstrates CrewAI's strength for complex, multi-step processes:

✅ Easy Configuration: YAML files make prompt tweaking simple
✅ Agent Coordination: Automatic task handoffs between specialists
✅ Tool Integration: Seamless browser automation and AI analysis
✅ Error Resilience: Built-in retry and fallback mechanisms

🧪 Testing

The system includes some integration tests:

# Run full integration test suite
make test

# Run quick validation (if activities.json exists)
python3 test_quick.py

# Run complete pipeline test
python3 test_integration.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
tools		tools
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
activities.json		activities.json
analysis_results.json		analysis_results.json
browser_results.json		browser_results.json
formatted_data.json		formatted_data.json
main.py		main.py
requirements.txt		requirements.txt
test_integration.py		test_integration.py
test_quick.py		test_quick.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 CrewAI Multi-Agent CPD Scraper

🎯 Project Status & Achievements

Current Capabilities

Sample Output

🏗️ Tech Stack

🚀 Quick Start

Prerequisites

Installation

Usage

🤖 Agent Architecture

YAML Configuration Files

🎯 Current Limitations & Next Steps

Speaker Information Challenge

Planned Enhancements

CrewAI Findings

🧪 Testing

About

Uh oh!

Contributors 2

Uh oh!

Languages

bayshanntech/experiment_activities_scraping_crewai

Folders and files

Latest commit

History

Repository files navigation

🤖 CrewAI Multi-Agent CPD Scraper

🎯 Project Status & Achievements

Current Capabilities

Sample Output

🏗️ Tech Stack

🚀 Quick Start

Prerequisites

Installation

Usage

🤖 Agent Architecture

YAML Configuration Files

🎯 Current Limitations & Next Steps

Speaker Information Challenge

Planned Enhancements

CrewAI Findings

🧪 Testing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages