Web Crawler

Description This project is a Web Crawler designed to traverse the web and scrape relevant data while respecting the rules set out by "robots.txt" files.

Key Components

CrawlerController

This is the main control center for the web crawler. It coordinates the various components, controls the flow of the program, and manages the overall crawling process.

ScrapedData

This class represents the data that the web crawler scrapes from the web.

UrlDepth

The UrlDepth class encapsulates a URL along with its depth in the crawl hierarchy.

RobotstxtParser

This component is designed to read and interpret "robots.txt" files. It can fetch the "robots.txt" file from a given URL and parse the directives within it. It also provides functionality to check if a given URL is allowed to be crawled based on the robots.txt rules.

UserAgentGenerator

This component generates random user-agent strings that are used to identify the web crawler to the servers it interacts with.

WebCrawler

The core class where the actual crawling happens. It manages multiple threads for efficient crawling, respects crawl delays, and stores visited links and scraped data. The scraped data includes the URL, title of the page, and paragraph text.

ScrapedDataRepository

This component is where the scraped data gets stored and managed. It can handle large amounts of data and provides methods for querying and retrieving the stored data.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.idea		.idea
crawler		crawler
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
ironpdfengine.log		ironpdfengine.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Petarj123/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages