A Python script that checks URL availability while respecting robots.txt
rules, designed for ADW developers who need to verify the accessibility of their web content.
- Checks multiple URLs for availability
- Respects
robots.txt
rules before crawling - Custom user-agent support (ADW-bot)
- Detailed status reporting for each URL
- Error handling for network issues
- Python 3.x
requests
library
- Clone this repository or download the script
- Install the required dependencies:
pip install requests
-
Edit the URLchecks list in the script to include the URLs you want to check
-
Run the script:
python bot.py
The script will:
- Attempt to find and parse robots.txt for each domain
- Check if the ADW-bot is allowed to crawl each URL
- Report whether each URL is accessible or returns an error
- Wait 10 seconds between checks (adjustable in the code)
Conducting URL checks...
robots.txt found at https://adw-development.github.io/public/robots.txt
ADW-bot is allowed to crawl https://adw-development.github.io/teddy.html
The URL: https://adw-development.github.io/teddy.html was successfully reached. no errors while doing so.
robots.txt found at https://adw-development.github.io/public/robots.txt
ADW-bot is allowed to crawl https://adw-development.github.io/
The URL: https://adw-development.github.io/ was successfully reached. no errors while doing so.
- Modify the sleep time at the end of the script (if you want to)
- Add additional URL checks to the URLchecks list (becuase, you need to replace the whole thing.)