A comprehensive web crawling tool with image extraction capabilities and keyword-based content filtering. Built with Streamlit and the crawl4ai library, this application efficiently crawls websites, extracts relevant content and images based on keywords, and presents the results in an interactive web interface.
- Interactive Web UI: User-friendly Streamlit interface for configuring and running crawls
- Keyword-Based Filtering: Extract only content relevant to specified keywords
- Real-Time Progress Tracking: Live updates during the crawling process
- Image Extraction: Automatically download and categorize images from websites
- Content Analysis: Analyze keyword frequency and relevance across pages
- Highlighted Content: View extracted content with keyword highlights
- Result Organization: Structured storage of crawl results for easy access
- Python 3.8+
- pip package manager
- chromium browser
- Clone the repository:
git clone https://github.com/lahiruramesh/web-snapper.git
cd web-snapper
- Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install required packages
pip install -r requirements.txt
- Starting the application
streamlit run app.py
docker build -t web-snapper .
docker run -p 8501:8501 -v $(pwd)/crawler_results:/app/crawler_results web-snapper
The application will open in your default web browser at http://localhost:8501