GitHub - meta4r/Offline-Website-Generator---Anti-Cloudflare---: Crawls->Scrapes->Parses an offline html version of the site.

🌐 Offline Site Generator Web Crawler + Scraper + Parser

A Python-based multithreaded web crawler for scraping and saving web pages with support for different crawling strategies, proxies, and user agents. This web crawler leverages cloudscraper to bypass common anti-bot mechanisms and BeautifulSoup for HTML parsing ensureing accurate reconstruction of websites for offline access.

Bypasses advanced protections and restrictions including:

Anti - [Cloudflare Protection, 403 Forbidden errors, TLS, JA3, HTTP2 fingerprinting]

🚀 Features

Multi-Threading
Parsing HTML, CSS, images, PDF's and other resources.
Offline Site Generation
Advanced Protection Bypass: Resistant to Cloudflare protection, 403 Forbidden errors, TLS, JA3, and HTTP2 fingerprinting.
Rotating Proxies and User Agents on every request.
Breadth-first search (BFS) and Depth-first search (DFS).

📦 Requirements

Python 3
cloudscraper 
BeautifulSoup 
requests 
urllib.parse

🛠️ Configuration

Before running the crawler, configure the following variables as needed:

START_URL: The starting URL for the crawler.
OUTPUT_DIR: The directory where crawled content will be saved.
PROXIES: A list of proxies to use during crawling. You can get proxies from here --> https://raw.githubusercontent.com/TheSpeedX/SOCKS-List/master/http.txt.
USER_AGENTS: A list of user agents to randomize requests.
MAX_RETRIES: Maximum number of retries for each URL.
TIMEOUT: Timeout for HTTP requests in seconds.

🔄 Customization

Max Depth: The maximum depth to crawl can be adjusted by changing the max_depth parameter in the WebCrawler class instantiation.
Max Threads: The number of threads used for crawling can be set by adjusting the max_threads parameter.

📜 Logging

The crawler uses Python's built-in logging module to log events. Logs are printed to the console with timestamps and log levels (INFO, WARNING, ERROR).

🔒 Error Handling

Script etries with a maximum of MAX_RETRIES attempts for each URL.
Uses exception handling to catch and log HTTP-related errors.

📝 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

License

meta4r/Offline-Website-Generator---Anti-Cloudflare---

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages