🔍 CourseMiner

An intelligent web agent that automatically extracts and visualizes course information from educational websites using OpenAI's GPT models.

🌟 Features

🌐 Automated web scraping of educational platforms
🤖 AI-powered content extraction and analysis
📊 Dynamic data visualization with interactive tables
🔄 Support for multiple educational platforms (NPTEL, DeepLearning.ai, etc.)
📸 Automatic screenshot capture of scraped websites
🧩 Modular and extensible architecture
🔗 Interactive course link generation
🖼️ Automatic image extraction
📱 Clean data presentation
🛠️ Customizable scraping configurations

🖼️ Demo

Grabbing course data from NPTEL (22nd April, 2025)
Grabbing course data from deeplearning.ai (22nd April, 2025)
Grabbing course data from deeplearning.ai with context (22nd April, 2025)

🔧 Prerequisites

Python 3.12 or higher
Jupyter Notebook or Google Colab
OpenAI API key
Web Browser

📥 Setup

Clone the repository:

git clone https://github.com/smaranjitghose/CourseMiner.git
cd CourseMiner

Create and activate virtual environment:

# Windows
python -m venv env
.\env\Scripts\activate
# Linux/Mac
python3 -m venv env
source env/bin/activate

Install dependencies using requirements.txt:

pip install -r requirements.txt

Install Playwright browsers:

python -m playwright install

Set up your OpenAI API key in a .env file:

OPENAI_API_KEY=your-api-key-here

💡 Usage

Launch Jupyter Notebook:

jupyter notebook

Open the example notebook:

/course_miner.ipynb

Define Pydantic models for your target website:

# Define the model for individual courses
class CourseModel(BaseModel):
    title: str
    description: str
    instructors: List[str]
    course_url: str
    image_url: str
    
# Define the collection model
class CourseCollection(BaseModel):
    courses: List[CourseModel]

Update the base URL and target URL in the designated cells:

# Set your target website URLs
base_url = "https://example.edu"
target_url = "https://example.edu/courses"

Run the browser agent by executing the notebook cells in order
View the extracted course information in the interactive visualization table

🛠️ Troubleshooting

API Errors
- Verify API key validity
- Check internet connection
- Confirm API usage limits
Scraping Issues
- Ensure website is accessible
- Check for rate limiting or blocking
- Adjust wait times for dynamic content
Model Errors
- Verify Pydantic model matches website structure
- Check field names and types
- Ensure system prompt is properly formatted

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the project
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Made with ❤️ by Smaranjit Ghose

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CourseMiner.ipynb		CourseMiner.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔍 CourseMiner

🌟 Features

🖼️ Demo

🔧 Prerequisites

📥 Setup

💡 Usage

🛠️ Troubleshooting

🤝 Contributing

📝 License

About

Uh oh!

Uh oh!

Languages

License

smaranjitghose/CourseMiner

Folders and files

Latest commit

History

Repository files navigation

🔍 CourseMiner

🌟 Features

🖼️ Demo

🔧 Prerequisites

📥 Setup

💡 Usage

🛠️ Troubleshooting

🤝 Contributing

📝 License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages