Skip to content

bayshanntech/experiment_activities_scraping_crewai

Repository files navigation

🤖 CrewAI Multi-Agent CPD Scraper

A sophisticated multi-agent system built with CrewAI for extracting Continuing Professional Development (CPD) training activities from professional medical websites like RCPI (Royal College of Physicians of Ireland).

🎯 Project Status & Achievements

WORKING: Successfully retrieves comprehensive CPD activity data from RCPI calendar
PROVEN: CrewAI multi-agent concept validated with YAML-based configuration
⚠️ PARTIAL: Speaker extraction framework built but not yet reliable (requires detailed page navigation)
📋 READY: Clean JSON output with 8 activities and 42 total CPD credits extracted

Current Capabilities

The system successfully extracts:

  • Event titles and descriptions
  • Dates and times
  • Locations (physical/online/hybrid)
  • CPD credit values
  • Event formats and categories
  • Clean JSON output without unnecessary metadata fields

Sample Output

Here's what the system successfully extracts from RCPI:

{
  "activities": [
    {
      "title": "The Royal College of Physicians of Ireland Intern Open Day",
      "date": "2025-08-30",
      "time": "10:30",
      "location": "Number Six Kildare Street and Online",
      "format": "Hybrid (In-person and Online)",
      "cpd_credits": 6,
      "category": "College Meetings"
    },
    {
      "title": "The Faculty of Occupational Medicine Autumn Conference",
      "date": "2025-09-26",
      "time": "09:30", 
      "location": "Number Six Kildare Street and Online",
      "format": "Hybrid Conference",
      "cpd_credits": 6,
      "category": "Occupational Medicine"
    }
  ],
  "metadata": {
    "total_activities": 8,
    "source_url": "https://www.rcpi.ie/Calendar/calendar?Type=CPDEvents",
    "extraction_timestamp": "2025-08-22T20:49:49.826817",
    "schema_version": "1.0"
  }
}

🏗️ Tech Stack

  • Framework: CrewAI - Multi-agent AI orchestration
  • AI Provider: Anthropic Claude 3.5 Sonnet via API
  • Web Automation: Playwright - Modern browser automation
  • Language: Python 3.12+
  • Configuration: YAML-based agent and task definitions

🚀 Quick Start

Prerequisites

# Install Python 3.12+
python3 --version

# Set up environment variables
export ANTHROPIC_API_KEY="your_claude_api_key_here"

Installation

# Install dependencies
make install
# or: pip install -r requirements.txt

# Install Playwright browser
playwright install chromium

Usage

# Run CPD scraper on RCPI website
make run URL=https://www.rcpi.ie/Calendar/calendar?Type=CPDEvents

# Run integration tests
make test

# Clean output files
make clean

# View available commands
make help

🤖 Agent Architecture

The system uses 4 specialized CrewAI agents configured via YAML files:

🌐 Web Content Exploration Specialist
├── Playwright-powered web navigation and content extraction
├── Comprehensive page structure analysis
└── CPD-related content identification

🔍 AI-Powered CPD Content Analyst  
├── Claude AI analysis of extracted web content
├── Intelligent activity identification and data extraction
└── Contextual information parsing

📄 CPD Data Formatting Specialist
├── Transforms extracted data into clean JSON structure
├── Removes unnecessary validation fields
└── Ensures consistent schema compliance

💾 Data Persistence and Verification Specialist
├── Saves formatted JSON to activities.json
├── File integrity verification
└── Operation success reporting

YAML Configuration Files

All agent behavior is defined in easily tweakable YAML files:

  • config/agents.yaml - Agent roles, goals, and backstories
  • config/tasks.yaml - Task descriptions, requirements, and expected outputs

This YAML-based approach makes it easy to refine prompts and agent behavior without code changes.

🎯 Current Limitations & Next Steps

Speaker Information Challenge

While the core CPD data extraction works reliably, speaker information extraction remains a challenge:

Planned Enhancements

Phase 1: Complete Data Extraction

  • Retrieve detailed event links for each activity
  • Extract speaker/facilitator details from individual event pages
  • Capture event images and promotional graphics
  • Add full event descriptions and learning objectives
  • Multi-website support beyond RCPI

Phase 2: Infrastructure & Deployment

  • AWS Deployment strategy and infrastructure planning
  • Containerization with Docker for consistent deployment
  • Scheduled execution with AWS Lambda or ECS
  • Data storage with DynamoDB or RDS
  • API endpoint for accessing extracted data

Phase 3: Scale & Reliability

  • Robust error handling and retry mechanisms
  • Data deduplication and change detection
  • Performance monitoring and alerting

CrewAI Findings

This project demonstrates CrewAI's strength for complex, multi-step processes:

  • Easy Configuration: YAML files make prompt tweaking simple
  • Agent Coordination: Automatic task handoffs between specialists
  • Tool Integration: Seamless browser automation and AI analysis
  • Error Resilience: Built-in retry and fallback mechanisms

🧪 Testing

The system includes some integration tests:

# Run full integration test suite
make test

# Run quick validation (if activities.json exists)
python3 test_quick.py

# Run complete pipeline test
python3 test_integration.py

About

Working RCPI activities scraper using CrewAI multi-agent framework

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •