Excel API Web Scraper is a Python-based project/package that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in security checks (virus scanning and MIME-type validation) for downloaded Excel files, with an option to skip antivirus scans on Windows if ClamAV isnβt readily available.
- Asynchronous HTTP/2 downloads via
httpx.AsyncClient
- Recursive subpage discovery with Selenium
- Parallel CPU-bound hashing with
ProcessPoolExecutor
- Detailed logging with a rotating file handler
- Progress tracking via
tqdm
- SecurityManager for optional virus scanning (ClamAV) and file-type checks
- Skips ClamAV scanning on Windows by default, avoiding setup complexities while still functioning seamlessly
-
Web Scraping with Selenium
- Automates loading InfoHub pages (and sub-pages) in a headless Chrome browser to discover Excel file links.
-
Retries for Slow Connections
- Uses
tenacity
to retry downloads when timeouts or transient errors occur.
- Uses
-
Sub-Page Recursion
- Follows a regex-based pattern to find and crawl subpages (e.g., graduation results, attendance data).
-
HTTP/2 Async Downloads
- Downloads Excel files using
httpx
in streaming mode, allowing concurrent I/O while efficiently handling large files.
- Downloads Excel files using
-
Year Filtering
- Only keeps Excel files that have at least one year >= 2018 in the link (skips older or irrelevant data).
-
Parallel Hashing
- Uses
ProcessPoolExecutor
to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
- Uses
-
Security Checks
- In-memory virus scanning with ClamAV (via
pyclamd
orclamd
) - MIME-type validation with
python-magic
, ensuring files are truly Excel - Skip scanning on Windows by default (see below)
- In-memory virus scanning with ClamAV (via
-
Prevents Redundant Downloads
- Compares new file hashes with stored hashes; downloads only if the file has changed.
-
Progress & Logging
tqdm
for progress bars during downloads and hashing.- Detailed logs in
logs/excel_fetch.log
(rotated at 5MB, up to 2 backups).
By default, ClamAV scanning is not performed on Windows to avoid environment complexitiesβClamAV is primarily a Linux/UNIX daemon. The SecurityManager
class checks platform.system()
, and if itβs Windows
, it short-circuits scanning and returns a clean status. This behavior can be overridden by setting:
security_manager = SecurityManager(skip_windows_scan=False)
β¦in which case the code will attempt a normal ClamAV call. To make this work on Windows, youβd typically need:
- WSL or Docker to run the ClamAV daemon, or
- Another setup that exposes a ClamAV socket/port for Python to connect to.
Version: 1.0.8
A Python package for scraping and downloading Excel datasets from NYC InfoHub using Selenium, httpx, asyncio, and virus/MIME validation is available.
pip install nyc_infohub_excel_api_access
Run from the command line:
nyc-infohub-scraper
Installing this package gives you access to the CLI tool nyc-infohub-scraper, which launches the scraper pipeline from the terminal with a single command.
- Python 3.8 or higher
- ChromeDriver (installed and in your PATH for Selenium)
- ClamAV daemon (optional but recommended for virus scanning):
- On Linux/WSL, install via
sudo apt-get install clamav clamav-daemon
. - On Windows, you may run ClamAV in WSL or a Docker container, or skip virus scanning.
- On Linux/WSL, install via
Install required packages:
pip install -r requirements.txt
Dependencies:
httpx[http2]
: For performing asynchronous HTTP requests and HTTP/2 supporttenacity
: For retyingselenium
: For web scrapingpandas
: For processing Excel files (optional)tqdm
: To display download progressconcurrent.futures
: For multithreadingopenpyxl
,pyxlsb
,xlrd
: For handling different Excel file typespytest
,pytest-asyncio
,pytest-cov
: For module testingclamd
orpyclamd
: For ClamAV virus scanning (if enabled)python-magic
: For MIME-type checks
This project supports two primary scheduling methods:
- WSL / Linux Cron Jobs
- Windows Task Scheduler
Depending on where you prefer to run the scraper, you can pick one or both. Each method automates running the script on a specified interval (e.g., daily, weekly, monthly).
-
Why Use Cron: If youβre working in a Linux-like environment (including WSL), cron allows you to schedule commands at fixed times, dates, or intervals without any additional GUIs.
-
Setup:
-
Create a WSL-based virtual environment (e.g.,
venv_wsl
) and install project dependencies (includingselenium
,httpx
, etc.). -
Edit your cron table:
crontab -e
-
Add a line for your desired schedule, for example, to run every Sunday at 2:30 AM:
30 2 * * 0 /path/to/project/venv_wsl/bin/python /path/to/project/src/main.py >> /path/to/logs/cron_crawl.log 2>&1
-
Confirm that WSL remains active or open; if WSL is shut down, cron will not run.
-
-
Advantages:
- Well-known, lightweight system utility.
- Great if youβre primarily in Linux/WSL and comfortable with Bash.
-
Caveats:
- WSL suspends when no console is runningβso if you close WSL entirely, cron wonβt run.
- If you need a 24/7 environment, consider a real Linux server or keep WSL open.
-
Why Use Task Scheduler: On a Windows system, Task Scheduler is the built-in service for scheduling tasks when the machine is on, without needing WSL open.
-
Setup:
-
Create a Windows-based virtual environment (e.g.,
venv_win
) in your project folder:cd C:\path\to\project python -m venv venv_win .\venv_win\Scripts\activate pip install -r requirements.txt
-
Open Task Scheduler (
taskschd.msc
from Run or Start). -
Create a Basic Task:
- Name: e.g. βNYC InfoHub Scraper (Weekly).β
- Trigger: Weekly (pick day/time).
- Action: βStart a program.β
- Program/Script:
C:\path\to\project\venv_win\Scripts\python.exe
- Arguments:
C:\path\to\project\src\main.py
- Program/Script:
- Finish & test by right-clicking the task β βRun.β
-
-
Advantages:
- Runs natively in Windows, even if WSL is closed.
- Easy to configure via a GUI.
-
Caveats:
- If your environment or script references Linux paths, adapt them to Windows.
- For storing logs, you can rely on Pythonβs rotating file handler or redirect output to a custom file (e.g.,
task_crawl.log
).
project_root/
β
βββ .github # Workflow CI/CD integration
βββ .gitignore # Ignore logs, venv, data, and cache files
βββ .env # Environment variables (excluded from version control)
βββ README.md # Project documentation
βββ requirements.txt # Project dependencies
βββ setup.py # Project packaging file
βββ pyproject.toml # Specify build system requirements
βββ LICENSE # License file
β
βββ venv_wsl/ # WSL Virtual Environment (ignored by version control)
βββ venv_win/ # Windows Virtual Environment (ignored by version control)
β
βββ src/
β βββ __init__.py # Package initializer
β βββ main.py # Main scraper script
β βββ excel_scraper.py # Web scraping module
β
βββ logs/ # Directory for log files
β
βββ tests/ # Directory for unit, integration, and end-to-end testing
β
βββ data/ # Directory for downloaded Excel files
β βββ graduation/
β βββ attendance/
β βββ demographics/
β βββ test_results/
β βββ other_reports/
β
βββ hashes/ # Directory for storing file hashes
The structure is well-organized for both manual execution and packaging as a Python module.
-
Run the script to scrape and fetch new datasets:
python -m src.main
This treats src as a package and executes main.py accordingly (avoiding relative import errors).
-
View logs for download status and debugging:
tail -f logs/excel_fetch.log
- Subpage Discovery
- The scraper uses a regex (SUB_PAGE_PATTERN) to find subpages like graduation-results, school-quality, etc.
- Filtered Excel Links
- Each subpage is loaded in Selenium;
<a>
tags ending with .xls/.xlsx/.xlsb are collected, then further filtered if they do not contain a relevant year (β₯ 2018).
- Async Streaming Download
- Downloads use httpx.AsyncClient(http2=True) to fetch files in parallel. A progress bar (tqdm) shows how many files are in flight.
- Security Checks
- In-memory virus scanning (ClamAV) via SecurityManager.scan_for_viruses.
- MIME-type validation (ensure .xls/.xlsx is truly an Excel file).
- Parallel Hashing
- Each downloaded fileβs hash is computed using a ProcessPoolExecutor so multiple CPU cores can do the work simultaneously.
- Save if Changed
- If the fileβs new hash differs from the previously stored one, the file is saved and the .hash file updated.
- Logs
- The rotating log captures successes, skips, errors, etc.
- By default, if ClamAV is installed and
SecurityManager
is configured, files are scanned in memory before saving. - If you run on Windows without ClamAV, you can disable scanning by returning
"OK"
orTrue
inscan_for_viruses
or using a stub. - Make sure
StreamMaxLength
and other ClamAV config values (in/etc/clamav/clamd.conf
) are large enough to handle your file sizes.
- Includes:
- Log file: logs/excel_fetch.log
- Rotating File Handler: Rolls over at 5 MB, retains 2 backups.
- Console Output: Also mirrors log messages for convenience.
- Progress Bars: tqdm for both downloading and hashing steps.
We use Pytest for our test suite, located in the tests/
folder.
-
Install dev/test dependencies (either in your
setup.py
or viapip install -r requirements.txt
if you listed them there). -
Run tests:
python -m pytest tests/
- View Coverage (if you have
pytest-cov
):
python -m pytest tests/ --cov=src
A GitHub Actions workflow is set up in .github/workflows/ci-cd.yml
. It:
- Builds and tests the project on push or pull request to the
main
branch. - If tests pass and you push a tagged release, it builds a distribution and can upload to PyPI using Twine.
- Check the Actions tab on your repo to see logs and statuses of each workflow run.
- Connection Pooling: Addressed by a persistent
httpx.AsyncClient
. - Redundant Downloads: Prevented by storing file hashes and only updating on changes.
- Virus Scan Overhead: In-memory scanning might add overhead, but ensures security.
- Virus Scan Failures: If ClamAV is unavailable or fails (e.g., due to socket errors or size limits), the scraper falls back to MIME-type validation for Excel files instead of discarding them.
- Fallback Traceability: All skipped or MIME-only approved files are logged in
quarantine.log
with timestamp, reason, MIME type, and file size for audit and debugging. - Size Limit Errors: If you see βINSTREAM: Size limit reachedβ warnings, increase
StreamMaxLength
inclamd.conf
. - Windows Skipping: If you canβt run ClamAV natively, the skip mechanism means the scraper still works without throwing errors.
- HTTP/2 Support: If the server lacks HTTP/2, it falls back to HTTP/1.1 automatically.
- Depth Control: Subpage recursion is limited (by default depth=1). Extend if deeper crawling is needed.
- Year Parsing: Skips links that donβt contain a straightforward 4-digit year (β₯ 2018). Update the regex for more complex date formats.
- ClamAV on Windows: Typically requires WSL, Docker, or disabling the scanning code.
- Custom Retry Logic: For advanced back-off patterns, modify the
tenacity
settings or override in your subclass.
- Email Notifications: Notify users when a new dataset is fetched.
- Database Integration: Store metadata in a database for better tracking.
- More Robust Exception Handling: Log specific error types or integrate external alerting.
- Integrating BeautifulSoup: Incorporate
BeautifulSoup
as opposed to Selenium for a faster setup for raw Web Scraping.
This project is licensed under the MIT License. See the LICENSE file for details.
Developed by Dylan Picart at Partnership With Children.
For questions or contributions, contact: [email protected].