An intelligent web agent that automatically extracts and visualizes course information from educational websites using OpenAI's GPT models.
- 🌐 Automated web scraping of educational platforms
- 🤖 AI-powered content extraction and analysis
- 📊 Dynamic data visualization with interactive tables
- 🔄 Support for multiple educational platforms (NPTEL, DeepLearning.ai, etc.)
- 📸 Automatic screenshot capture of scraped websites
- 🧩 Modular and extensible architecture
- 🔗 Interactive course link generation
- 🖼️ Automatic image extraction
- 📱 Clean data presentation
- 🛠️ Customizable scraping configurations
-
Grabbing course data from deeplearning.ai (22nd April, 2025)
-
Grabbing course data from deeplearning.ai with context (22nd April, 2025)
- Python 3.12 or higher
- Jupyter Notebook or Google Colab
- OpenAI API key
- Web Browser
- Clone the repository:
git clone https://github.com/smaranjitghose/CourseMiner.git
cd CourseMiner
- Create and activate virtual environment:
# Windows
python -m venv env
.\env\Scripts\activate
# Linux/Mac
python3 -m venv env
source env/bin/activate
- Install dependencies using requirements.txt:
pip install -r requirements.txt
- Install Playwright browsers:
python -m playwright install
- Set up your OpenAI API key in a
.env
file:
OPENAI_API_KEY=your-api-key-here
- Launch Jupyter Notebook:
jupyter notebook
- Open the example notebook:
/course_miner.ipynb
- Define Pydantic models for your target website:
# Define the model for individual courses
class CourseModel(BaseModel):
title: str
description: str
instructors: List[str]
course_url: str
image_url: str
# Define the collection model
class CourseCollection(BaseModel):
courses: List[CourseModel]
- Update the base URL and target URL in the designated cells:
# Set your target website URLs
base_url = "https://example.edu"
target_url = "https://example.edu/courses"
-
Run the browser agent by executing the notebook cells in order
-
View the extracted course information in the interactive visualization table
-
API Errors
- Verify API key validity
- Check internet connection
- Confirm API usage limits
-
Scraping Issues
- Ensure website is accessible
- Check for rate limiting or blocking
- Adjust wait times for dynamic content
-
Model Errors
- Verify Pydantic model matches website structure
- Check field names and types
- Ensure system prompt is properly formatted
Contributions are welcome! Please follow these steps:
- Fork the project
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ by Smaranjit Ghose