You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactors codebase to enhance modularity and maintainability
Introduces an object-oriented architecture by implementing distinct classes for file management and scraping logic.
Improves code organization, making it easier to maintain and test while ensuring a clean separation of concerns.
Updates logging and error handling practices for better debugging and user feedback.
Relates to ongoing efforts for improved project structure.
Copy file name to clipboardExpand all lines: README.md
+36-18
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,11 @@
2
2
3
3
## Description
4
4
5
-
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. The scraper dynamically discovers subpages, detects relevant Excel links (filtered by year), downloads them asynchronously, and ensures that only new or changed files are saved.
5
+
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. It employs an **object-oriented architecture**—splitting functionality across multiple classes to improve maintainability and testability:
6
+
7
+
-**`NYCInfoHubScraper`** (subclass of `BaseScraper`) provides specialized logic for discovering and filtering NYC InfoHub Excel links.
-**`BaseScraper`** handles core scraping concerns such as Selenium setup, asynchronous HTTP downloads, concurrency, and hashing.
6
10
7
11
This version features:
8
12
@@ -11,34 +15,38 @@ This version features:
11
15
-**Parallel CPU-bound hashing** with `ProcessPoolExecutor`
12
16
-**Detailed logging** with a rotating file handler
13
17
-**Progress tracking** via `tqdm`
18
+
-**Clean separation** of concerns thanks to the new classes (`FileManager`, `NYCInfoHubScraper`, etc.)
14
19
15
20
---
16
21
17
22
## Features
18
23
19
24
-**Web Scraping with Selenium**
20
-
Automatically loads InfoHub pages (and sub-pages) in a headless Chrome browser to discover Excel file links.
25
+
Automatically loads InfoHub pages (and sub-pages) in a headless Chrome browser to discover Excel file links.
21
26
22
27
-**Retries for Slow Connections**
23
-
Uses `tenacity` to retry downloads when timeouts or transient errors occur.
28
+
Uses `tenacity` to retry downloads when timeouts or transient errors occur.
24
29
25
30
-**Sub-Page Recursion**
26
-
Uses a regex-based pattern to find and crawl subpages (e.g., graduation results, attendance data).
31
+
Uses a regex-based pattern to find and crawl subpages (e.g., graduation results, attendance data).
27
32
28
33
-**HTTP/2 Async Downloads**
29
-
Downloads Excel files using `httpx` in **streaming mode**, allowing concurrent IO while efficiently handling large files.
34
+
Downloads Excel files using `httpx` in **streaming mode**, allowing concurrent IO while efficiently handling large files.
30
35
31
36
-**Year Filtering**
32
-
Only keeps Excel files that have at least one year >= 2018 in the link (skips older or irrelevant data).
37
+
Only keeps Excel files that have at least one year >= 2018 in the link (skips older or irrelevant data).
33
38
34
39
-**Parallel Hashing**
35
-
Uses `ProcessPoolExecutor` to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
40
+
Uses `ProcessPoolExecutor` to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
36
41
37
42
-**Prevents Redundant Downloads**
38
-
Compares new file hashes with stored hashes; downloads only if the file has changed.
43
+
Compares new file hashes with stored hashes; downloads only if the file has changed.
39
44
40
45
-**Progress & Logging**
41
-
Progress bars via `tqdm` for both downloads and hashing. Detailed logs to `logs/excel_fetch.log` (rotated at 5MB, up to 2 backups).
46
+
Progress bars via `tqdm` for both downloads and hashing. Detailed logs to `logs/excel_fetch.log` (rotated at 5MB, up to 2 backups).
47
+
48
+
-**Refactored OOP Architecture**
49
+
With the introduction of `FileManager` for file operations, `BaseScraper` for shared scraping logic, and `NYCInfoHubScraper` for specialized InfoHub routines, the code is more modular and maintainable.
42
50
43
51
---
44
52
@@ -55,17 +63,19 @@ To install required packages:
55
63
56
64
```bash
57
65
pip install -r requirements.txt
66
+
```
58
67
68
+
**Dependencies**:
59
69
60
-
Dependencies:
61
70
-`httpx[http2]`: For performing asynchronous HTTP requests and HTTP/2 support
62
71
-`tenacity`: For retying
63
72
-`selenium`: For web scraping
64
-
- `pandas`: For processing Excel files
73
+
-`pandas`: For processing Excel files (optional)
65
74
-`tqdm`: To display download progress
66
75
-`concurrent.futures`: For multithreading
67
76
-`openpyxl`, `pyxlsb`, `xlrd`: For handling different Excel file types
68
-
- `pytest`, `pytest-asyncio`, `pytest-cov`: For module testing
77
+
-`pytest`, `pytest-asyncio`, `pytest-cov`: For module testing
78
+
69
79
```
70
80
71
81
---
@@ -148,10 +158,11 @@ Depending on where you prefer to run the scraper, you can pick one or both. Each
148
158
149
159
## Directory Structure
150
160
151
-
```
161
+
```text
162
+
152
163
project_root/
153
164
│
154
-
├── __init__.py # Package initializer
165
+
├── **init**.py # Package initializer
155
166
├── .github # Workflow CI/CD integration
156
167
├── .gitignore # Ignore logs, venv, data, and cache files
157
168
├── .env # Environment variables (excluded from version control)
@@ -163,25 +174,27 @@ project_root/
163
174
│
164
175
├── venv_wsl/ # WSL Virtual Environment (ignored by version control)
165
176
├── venv_win/ # Windows Virtual Environment (ignored by version control)
166
-
│
177
+
│
167
178
├── src/
168
179
│ ├── main.py # Main scraper script
169
180
│ └── excel_scraper.py # Web scraping module
170
181
│
171
182
├── logs/ # Directory for log files
172
183
│
173
-
├── tests/ # Directory for unit, integration, and end-to-end testing
184
+
├── tests/ # Directory for unit, integration, and end-to-end testing
174
185
│
175
186
├── data/ # Directory for downloaded Excel files
176
187
│ ├── graduation/
177
188
│ ├── attendance/
178
189
│ ├── demographics/
190
+
│ ├── test_results/
179
191
│ └── other_reports/
180
192
│
181
193
└── hashes/ # Directory for storing file hashes
194
+
182
195
```
183
196
184
-
This structure ensures that the project is well-organized for both manual execution and packaging as a Python module.
197
+
The structure is well-organized for both manual execution and packaging as a Python module.
185
198
186
199
---
187
200
@@ -285,6 +298,7 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
285
298
2. Persistent HTTP Sessions: Using httpx.AsyncClient ensures that HTTP connections are reused, reducing overhead.
286
299
3. Efficient Hashing: Files are saved only if they have changed, determined by a computed hash. This ensures no unnecessary downloads.
287
300
4. Excluded older datasets by added `re` filtering logic to scrape only the latest available data.
301
+
5. Clearer Architecture: Splitting logic into `FileManager`, `BaseScraper`, and `NYCInfoHubScraper` has improved modularity and test coverage.
288
302
289
303
---
290
304
@@ -295,14 +309,18 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
295
309
-**Year Parsing**: If year formats differ (e.g., “19-20” instead of “2019-2020”), the regex must be adjusted or enhanced.
296
310
-**Retries**: Now incorporates an automatic retry strategy via `tenacity`. For highly customized or advanced retry logic, the code can be extended further.
297
311
-**Dual Virtual Environments**: Separate venvs are maintained (one for WSL and one for Windows). Both can run the script successfully when properly configured—via cron on WSL or Task Scheduler on Windows.
312
+
- File Download Security: Currently relies on chunked streaming and SHA-256 hashing for change detection, but does not verify the authenticity of the files. For higher security:
313
+
- Integrate virus scanning or malware checks after download.
314
+
- Validate MIME type or file signatures to confirm it’s an Excel file.
315
+
- Use TLS certificate pinning if the host supports it.
298
316
299
317
---
300
318
301
319
## **Other Potential Improvements**
302
320
303
321
-**Email Notifications**: Notify users when a new dataset is fetched.
304
322
-**Database Integration**: Store metadata in a database for better tracking.
305
-
- **Better Exception Handling**: Improve error logging for specific failures.
323
+
-**More Robust Exception Handling**: Log specific error types or integrate external alerting.
0 commit comments