A web-based crawler that extracts useful data from websites using Scrapy, with a Flask interface for real-time control and display.
Manual data collection is time-consuming and error-prone. With data-driven decisions in demand, this project provides a reliable way to crawl and extract structured content through an easy-to-use interface, bridging automation with usability.
This project combines:
- Scrapy for crawling and data extraction
- Flask for user interaction
- Python threading and subprocess for smooth backend operation
It demonstrates practical applications in search engines, research, and analytics.
- Users input a URL in the Flask UI.
- Scrapy crawls and extracts:
- Titles, body text, links, image URLs
- Word/image counts
 
- Data is saved in JSON and shown on the web interface.
- Crawler can be started, stopped, and reset from the UI.
- Web Crawling (Scrapy) with robots.txt compliance
- Content Extraction: text, links, images
- Flask Interface: start/stop controls, real-time data view
- Threaded Execution for non-blocking performance
- Subprocess Control with safe termination
- JSON Storage with timestamps and structure
- Logging & Error Handling
- Technical: Built with open-source tools (Python, Scrapy, Flask)
- Economic: No cost; runs on standard hardware
Hardware:
- PC/Laptop
- 4GB RAM minimum
- Internet access
Software:
- Python 3.8+
- Flask
- Scrapy
- JSON viewer/editor
- Google Chrome (for testing)