You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+42-11
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
-
# NYC InfoHub Excel Data Scraper
1
+
# Excel API Web Scraper
2
2
3
3
## Description
4
4
5
-
**NYC InfoHub Excel Data Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. The scraper dynamically discovers subpages, detects relevant Excel links (filtered by year), downloads them asynchronously, and ensures that only new or changed files are saved.
5
+
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. The scraper dynamically discovers subpages, detects relevant Excel links (filtered by year), downloads them asynchronously, and ensures that only new or changed files are saved.
6
6
7
7
This version features:
8
8
-**Asynchronous HTTP/2 downloads** via `httpx.AsyncClient`
@@ -13,8 +13,6 @@ This version features:
13
13
14
14
---
15
15
16
-
**Important Note**: The previous iteration was fully functional and efficient, however relied too heavily on hardcoding. In this version, I use RegEx patterns to parse through sub-pages.
17
-
18
16
## Features
19
17
20
18
-**Web Scraping with Selenium**
@@ -59,10 +57,10 @@ Dependencies:
59
57
- `httpx[http2]`: For performing asynchronous HTTP requests and HTTP/2 support
60
58
- `selenium`: For web scraping
61
59
- `pandas`: For processing Excel files
62
-
- `requests`: For downloading files
63
60
- `tqdm`: To display download progress
64
61
- `concurrent.futures`: For multithreading
65
62
- `openpyxl`, `pyxlsb`, `xlrd`: For handling different Excel file types
63
+
- `pytest`, `pytest-asyncio`, `pytest-cov`: For module testing
66
64
```
67
65
68
66
---
@@ -73,20 +71,24 @@ Dependencies:
73
71
project_root/
74
72
│
75
73
├── __init__.py # Package initializer
74
+
├── .github # Workflow CI/CD integration
76
75
├── .gitignore # Ignore logs, venv, data, and cache files
77
76
├── .env # Environment variables (excluded from version control)
78
77
├── README.md # Project documentation
79
78
├── requirements.txt # Project dependencies
80
79
├── setup.py # Project packaging file
80
+
├── pyproject.toml # Specify build system requirements
81
81
├── LICENSE # License file
82
82
│
83
83
├── venv/ # Virtual environment (ignored by version control)
84
-
│
85
-
├── nyc_infohub.py # Main scraper script
86
-
├── url_scraper.py # Web scraping module
84
+
│
85
+
├── src/
86
+
│ ├── main.py # Main scraper script
87
+
│ └── excel_scraper.py # Web scraping module
87
88
│
88
89
├── logs/ # Directory for log files
89
-
│ └── excel_fetch.log
90
+
│
91
+
├── tests/ # Directory for unit, integration, and end-to-end testing
90
92
│
91
93
├── data/ # Directory for downloaded Excel files
92
94
│ ├── graduation/
@@ -106,7 +108,7 @@ This structure ensures that the project is well-organized for both manual execut
106
108
### **Running the Scraper Manually**
107
109
1.**Run the script to scrape and fetch new datasets:**
108
110
```bash
109
-
python nyc_infohub.py
111
+
python main.py
110
112
```
111
113
2.**View logs for download status and debugging:**
112
114
```bash
@@ -145,6 +147,34 @@ This structure ensures that the project is well-organized for both manual execut
145
147
146
148
---
147
149
150
+
## Testing
151
+
152
+
We use **Pytest** for our test suite, located in the `tests/` folder.
153
+
154
+
1.**Install dev/test dependencies** (either in your `setup.py` or via `pip install -r requirements.txt` if you listed them there).
155
+
156
+
2.**Run tests**:
157
+
```bash
158
+
python -m pytest tests/
159
+
```
160
+
161
+
3.**View Coverage** (if you have `pytest-cov`):
162
+
```bash
163
+
python -m pytest tests/ --cov=src
164
+
```
165
+
166
+
---
167
+
168
+
## CI/CD Pipeline
169
+
170
+
A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
171
+
172
+
1.**Builds and tests** the project on push or pull request to the `main` branch.
173
+
2. If tests pass and you push a **tagged release**, it **builds a distribution** and can **upload** to PyPI using **Twine** (when secrets are configured).
174
+
3. Check the **Actions** tab on your repo to see logs and statuses of each workflow run.
175
+
176
+
---
177
+
148
178
## **Previous Limitations and Solutions**
149
179
***Bottlenecks***:
150
180
@@ -157,6 +187,7 @@ This structure ensures that the project is well-organized for both manual execut
157
187
1. Optimized Downloading: Parallel downloads using asyncio and ThreadPoolExecutor allow multiple downloads to happen concurrently, improving speed.
158
188
2. Persistent HTTP Sessions: Using httpx.AsyncClient ensures that HTTP connections are reused, reducing overhead.
159
189
3. Efficient Hashing: Files are saved only if they have changed, determined by a computed hash. This ensures no unnecessary downloads.
190
+
4. Excluded older datasets by added `re` filtering logic to scrape only the latest available data.
160
191
161
192
---
162
193
@@ -169,7 +200,7 @@ This structure ensures that the project is well-organized for both manual execut
169
200
---
170
201
171
202
## **Other Potential Improvements**
172
-
-**Exclude older datasets**: Add filtering logic to scrape only the latest available data.
203
+
-**Add NYSed Website**: Scrape data from NYSed.
173
204
-**Email Notifications**: Notify users when a new dataset is fetched.
174
205
-**Database Integration**: Store metadata in a database for better tracking.
175
206
-**Better Exception Handling**: Improve error logging for specific failures.
0 commit comments