A web crawler that uses WebDriver to extract and parse HTML content from web pages with intelligent duplicate detection and template pattern recognition.
- 🌐 Multi-URL Crawling: Crawl multiple URLs in a single session
 - 🔍 Intelligent Duplicate Detection: Automatically identifies and filters duplicate content patterns across domains
 - 📋 Template Pattern Recognition: Detects variable patterns in content (e.g., "42 comments" → "{count} comments")
 - 🌳 Structured HTML Tree: Provides filtered HTML tree view with duplicate marking
 - ⚡ WebDriver Integration: Uses WebDriver for dynamic content handling
 - 📊 Verbose Output: Detailed HTML tree analysis with filtering information
 
- Install SmartCrawler - Download from releases or build from source
 - Set up WebDriver - Install Firefox/Chrome and corresponding WebDriver
 - Start crawling - Run SmartCrawler with your target URLs
 
# Basic usage
smart-crawler --link "https://example.com"
# Multiple URLs with verbose output
smart-crawler --link "https://example.com" --link "https://another.com" --verbose
# Template detection mode
smart-crawler --link "https://example.com" --template --verboseChoose your operating system for detailed setup instructions:
- Windows Setup - Complete Windows installation guide
 - macOS Setup - macOS installation and setup
 - Linux Setup - Linux installation for various distributions
 
- CLI Options - Complete command-line reference and examples
 
- Development Guide - Setup, building, testing, and contributing instructions
 
- Operating System: Windows 10+, macOS 10.15+, or Linux
 - Browser: Firefox (recommended) or Chrome
 - WebDriver: GeckoDriver (Firefox) or ChromeDriver (Chrome)
 - Memory: 512MB RAM minimum, 1GB recommended
 
# Crawl a single URL
smart-crawler --link "https://example.com"
# Crawl with detailed output
smart-crawler --link "https://example.com" --verbose
# Template detection mode
smart-crawler --link "https://example.com" --template --verbose
# Multiple URLs
smart-crawler --link "https://site1.com" --link "https://site2.com"# Start Firefox WebDriver
geckodriver --port 4444
# Start Chrome WebDriver
chromedriver --port=4444For developers interested in contributing to SmartCrawler or building from source:
- Development Guide - Complete setup, building, testing, and contributing instructions
 
This project is licensed under the GPL-3.0 license - see the LICENSE file for details.
If you encounter issues:
- Check the getting started guides for your operating system
 - Review the CLI options documentation
 - Search existing GitHub issues
 - Create a new issue with detailed error information
 
Note: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service.