Skip to content

Commit 6a0ae06

Browse files
committed
Update README and enhance SecurityManager to skip ClamAV scans on Windows by default
1 parent 0f75c63 commit 6a0ae06

File tree

3 files changed

+60
-11
lines changed

3 files changed

+60
-11
lines changed

README.md

+22-6
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Description
44

5-
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. It features a **modular, object-oriented** design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files.
5+
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available.
66

77
### Highlights
88

@@ -11,8 +11,8 @@
1111
- **Parallel CPU-bound hashing** with `ProcessPoolExecutor`
1212
- **Detailed logging** with a rotating file handler
1313
- **Progress tracking** via `tqdm`
14-
- **SecurityManager** for virus scanning (ClamAV) and file-type checks
15-
- **Refined “only Excel files”** approach, skipping malicious or non-Excel data
14+
- **SecurityManager** for optional virus scanning (ClamAV) and file-type checks
15+
- **Skips ClamAV scanning on Windows** by default, avoiding setup complexities while still functioning seamlessly
1616

1717
---
1818

@@ -37,9 +37,9 @@
3737
- Uses `ProcessPoolExecutor` to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
3838

3939
7. **Security Checks**
40-
- A **SecurityManager** class provides:
41-
- **Virus scanning** with ClamAV (in-memory scanning before saving).
42-
- **MIME-type validation** with `python-magic`, ensuring files truly are Excel.
40+
- In-memory **virus scanning** with ClamAV (via `pyclamd` or `clamd`)
41+
- **MIME-type validation** with `python-magic`, ensuring files are truly Excel
42+
- **Skip scanning on Windows** by default (see below)
4343

4444
8. **Prevents Redundant Downloads**
4545
- Compares new file hashes with stored hashes; downloads only if the file has changed.
@@ -50,6 +50,21 @@
5050

5151
---
5252

53+
## Windows Antivirus Skipping
54+
55+
By default, **ClamAV scanning** is not performed on **Windows** to avoid environment complexities—ClamAV is primarily a Linux/UNIX daemon. The `SecurityManager` class checks `platform.system()`, and if it’s `Windows`, it **short-circuits** scanning and returns a **clean** status. This behavior can be **overridden** by setting:
56+
57+
```python
58+
security_manager = SecurityManager(skip_windows_scan=False)
59+
```
60+
61+
…in which case the code will attempt a normal ClamAV call. To make this work on Windows, you’d typically need:
62+
63+
- **WSL or Docker** to run the ClamAV daemon, or
64+
- Another setup that exposes a ClamAV socket/port for Python to connect to.
65+
66+
---
67+
5368
## Requirements
5469

5570
### System Requirements
@@ -306,6 +321,7 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
306321
- **Redundant Downloads**: Prevented by storing file hashes and only updating on changes.
307322
- **Virus Scan Overhead**: In-memory scanning might add overhead, but ensures security.
308323
- **Size Limit Errors**: If you see “INSTREAM: Size limit reached” warnings, increase `StreamMaxLength` in `clamd.conf`.
324+
- **Windows Skipping**: If you can’t run ClamAV natively, the skip mechanism means the scraper still works without throwing errors.
309325

310326
---
311327

src/excel_scraper.py

+13-4
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from tqdm import tqdm
1717
import pyclamd
1818
import magic
19+
import platform
1920

2021
load_dotenv()
2122
logging.info(f"CHROMEDRIVER_PATH: {os.environ.get('CHROMEDRIVER_PATH')}")
@@ -94,14 +95,18 @@ class SecurityManager:
9495
"""
9596
Encapsulates file validation logic, including virus scanning
9697
(ClamAV via python-clamd) and MIME-type checks (python-magic).
98+
Skips ClamAV scan on Windows by default.
9799
"""
98100

99-
def __init__(self, clam_socket="/var/run/clamav/clamd.ctl"):
101+
def __init__(self, clam_socket="/var/run/clamav/clamd.ctl", skip_windows_scan=True):
100102
"""
101103
Pass the path to the ClamAV Unix socket file.
102104
By default, it's often /var/run/clamav/clamd.ctl on Debian/Ubuntu.
105+
If using Windows, automatically skips ClamAV scan.
103106
"""
104107
self._clam_socket = clam_socket
108+
self._skip_windows_scan = skip_windows_scan
109+
105110

106111
def scan_for_viruses(self, file_bytes: bytes) -> tuple:
107112
"""
@@ -110,8 +115,12 @@ def scan_for_viruses(self, file_bytes: bytes) -> tuple:
110115
Possible status_code values:
111116
- "ERROR": an exception or connection failure occurred
112117
- "FOUND": virus/malware was detected
113-
- "OK": no virus was found
118+
- "OK": file is clean or scanning is skipped on Windows
114119
"""
120+
if self._skip_windows_scan and platform.system().lower() == "windows":
121+
logging.info("Skipping virus scan on Windows.")
122+
return ("OK", "Skipping AV check on Windows")
123+
115124
try:
116125
cd = pyclamd.ClamdUnixSocket(filename=self._clam_socket)
117126
result = cd.scan_stream(file_bytes)
@@ -329,8 +338,8 @@ class NYCInfoHubScraper(BaseScraper):
329338
"other_reports": []
330339
}
331340

332-
def __init__(self, base_dir=None, data_dir=None, hash_dir=None, log_dir=None):
333-
super().__init__(security_manager=SecurityManager("/var/run/clamav/clamd.ctl"))
341+
def __init__(self, base_dir=None, data_dir=None, hash_dir=None, log_dir=None, skip_win_scan=True):
342+
super().__init__(security_manager=SecurityManager("/var/run/clamav/clamd.ctl", skip_windows_scan=skip_win_scan))
334343
script_dir = os.path.abspath(os.path.dirname(__file__)) if "__file__" in globals() else os.getcwd()
335344
self._base_dir = base_dir or os.path.join(script_dir, "..")
336345
self._data_dir = data_dir or os.path.join(self._base_dir, "data")

tests/test_excel_scraper.py

+25-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
import pytest
44
import hashlib
55
from unittest.mock import patch, MagicMock
6-
from src.excel_scraper import NYCInfoHubScraper
6+
from src.excel_scraper import NYCInfoHubScraper, SecurityManager
7+
import platform
78

89
def test_compute_file_hash():
910
"""
@@ -94,6 +95,29 @@ def mock_stream(method, _url, timeout=10):
9495
mock_stream_call.assert_called_once()
9596
mock_scan.assert_called_once_with(fake_excel_content)
9697
mock_mime.assert_called_once_with(fake_excel_content)
98+
99+
def test_skip_windows_scan_true():
100+
manager = SecurityManager(skip_windows_scan=True)
101+
102+
# Mock platform.system() to return 'Windows'
103+
with patch.object(platform, 'system', return_value='Windows'):
104+
status, message = manager.scan_for_viruses(b"FakeExcelData")
105+
assert status == "OK"
106+
assert "Skipping AV check on Windows" in message
107+
108+
def test_skip_windows_scan_false():
109+
manager = SecurityManager(skip_windows_scan=False)
110+
111+
# Mock platform.system() to return 'Windows'
112+
# We confirm it *doesn't* short-circuit and tries to do normal scanning
113+
# Since there's no real ClamAV, you might see an 'ERROR' or something
114+
# unless you mock the clamd calls as well. For illustration:
115+
with patch.object(platform, 'system', return_value='Windows'), \
116+
patch("pyclamd.ClamdUnixSocket", side_effect=Exception("No daemon")):
117+
118+
status, message = manager.scan_for_viruses(b"FakeExcelData")
119+
assert status == "ERROR"
120+
assert "No daemon" in message
97121

98122
@pytest.mark.asyncio
99123
async def test_download_excel_virus_found(test_scraper):

0 commit comments

Comments
 (0)