You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+22-6
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Description
4
4
5
-
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. It features a **modular, object-oriented** design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files.
5
+
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available.
6
6
7
7
### Highlights
8
8
@@ -11,8 +11,8 @@
11
11
-**Parallel CPU-bound hashing** with `ProcessPoolExecutor`
12
12
-**Detailed logging** with a rotating file handler
13
13
-**Progress tracking** via `tqdm`
14
-
-**SecurityManager** for virus scanning (ClamAV) and file-type checks
15
-
-**Refined “only Excel files”**approach, skipping malicious or non-Excel data
14
+
-**SecurityManager** for optional virus scanning (ClamAV) and file-type checks
15
+
-**Skips ClamAV scanning on Windows**by default, avoiding setup complexities while still functioning seamlessly
16
16
17
17
---
18
18
@@ -37,9 +37,9 @@
37
37
- Uses `ProcessPoolExecutor` to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
38
38
39
39
7.**Security Checks**
40
-
-A**SecurityManager**class provides:
41
-
-**Virus scanning** with ClamAV (in-memory scanning before saving).
42
-
-**MIME-type validation**with `python-magic`, ensuring files truly are Excel.
40
+
-In-memory**virus scanning**with ClamAV (via `pyclamd` or `clamd`)
41
+
-**MIME-type validation** with `python-magic`, ensuring files are truly Excel
42
+
-**Skip scanning on Windows**by default (see below)
43
43
44
44
8.**Prevents Redundant Downloads**
45
45
- Compares new file hashes with stored hashes; downloads only if the file has changed.
@@ -50,6 +50,21 @@
50
50
51
51
---
52
52
53
+
## Windows Antivirus Skipping
54
+
55
+
By default, **ClamAV scanning** is not performed on **Windows** to avoid environment complexities—ClamAV is primarily a Linux/UNIX daemon. The `SecurityManager` class checks `platform.system()`, and if it’s `Windows`, it **short-circuits** scanning and returns a **clean** status. This behavior can be **overridden** by setting:
0 commit comments