Skip to content

Commit 814f146

Browse files
committed
Refactor project structure: rename package, update dependencies, include testing, and add CI/CD workflow
1 parent 6401f24 commit 814f146

11 files changed

+405
-52
lines changed

.github/workflows/ci-cd.yml

+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
name: CI-CD # CI/CD workflow
2+
3+
on:
4+
push:
5+
branches: [ "main" ] # Trigger on pushes to 'main'
6+
pull_request:
7+
branches: [ "main" ] # Trigger on PRs targeting 'main'
8+
9+
jobs:
10+
build-test: # Our first job for building and testing
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- name: Check out code # Check out your repository
15+
uses: actions/checkout@v3
16+
17+
- name: Set up Python # Install desired Python version
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: "3.11"
21+
22+
- name: Install dependencies
23+
run: |
24+
python -m pip install --upgrade pip
25+
# Editable install of your package so your code in src/
26+
# is recognized as a Python package
27+
pip install -e .
28+
# If you have additional dev/test dependencies,
29+
# either put them in setup.py or:
30+
# pip install -r requirements.txt
31+
32+
- name: Run tests
33+
run: |
34+
# Use python -m pytest to ensure we use the same Python interpreter
35+
python -m pytest tests/
36+
37+
deploy: # Second job for "CD" or deployment
38+
needs: [ build-test ] # Only run if 'build-test' succeeds
39+
runs-on: ubuntu-latest
40+
41+
steps:
42+
- name: Check out code
43+
uses: actions/checkout@v3
44+
45+
- name: Set up Python
46+
uses: actions/setup-python@v4
47+
with:
48+
python-version: "3.11"
49+
50+
- name: Build distribution
51+
run: |
52+
python -m pip install --upgrade pip
53+
pip install build twine # Tools needed to build & upload your package
54+
python -m build # Creates dist/*.whl and dist/*.tar.gz
55+
56+
- name: Publish to PyPI
57+
# Sample checks if the push is a tagged release.
58+
if: startsWith(github.ref, 'refs/tags/')
59+
env:
60+
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
61+
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
62+
run: |
63+
# By default, this will push to PyPI.
64+
# For TestPyPI, pass --repository-url or set env var:
65+
# python -m twine upload --repository-url https://test.pypi.org/legacy/ dist/*
66+
python -m twine upload dist/*

README.md

+42-11
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# NYC InfoHub Excel Data Scraper
1+
# Excel API Web Scraper
22

33
## Description
44

5-
**NYC InfoHub Excel Data Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. The scraper dynamically discovers subpages, detects relevant Excel links (filtered by year), downloads them asynchronously, and ensures that only new or changed files are saved.
5+
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. The scraper dynamically discovers subpages, detects relevant Excel links (filtered by year), downloads them asynchronously, and ensures that only new or changed files are saved.
66

77
This version features:
88
- **Asynchronous HTTP/2 downloads** via `httpx.AsyncClient`
@@ -13,8 +13,6 @@ This version features:
1313

1414
---
1515

16-
**Important Note**: The previous iteration was fully functional and efficient, however relied too heavily on hardcoding. In this version, I use RegEx patterns to parse through sub-pages.
17-
1816
## Features
1917

2018
- **Web Scraping with Selenium**
@@ -59,10 +57,10 @@ Dependencies:
5957
- `httpx[http2]`: For performing asynchronous HTTP requests and HTTP/2 support
6058
- `selenium`: For web scraping
6159
- `pandas`: For processing Excel files
62-
- `requests`: For downloading files
6360
- `tqdm`: To display download progress
6461
- `concurrent.futures`: For multithreading
6562
- `openpyxl`, `pyxlsb`, `xlrd`: For handling different Excel file types
63+
- `pytest`, `pytest-asyncio`, `pytest-cov`: For module testing
6664
```
6765

6866
---
@@ -73,20 +71,24 @@ Dependencies:
7371
project_root/
7472
7573
├── __init__.py # Package initializer
74+
├── .github # Workflow CI/CD integration
7675
├── .gitignore # Ignore logs, venv, data, and cache files
7776
├── .env # Environment variables (excluded from version control)
7877
├── README.md # Project documentation
7978
├── requirements.txt # Project dependencies
8079
├── setup.py # Project packaging file
80+
├── pyproject.toml # Specify build system requirements
8181
├── LICENSE # License file
8282
8383
├── venv/ # Virtual environment (ignored by version control)
84-
85-
├── nyc_infohub.py # Main scraper script
86-
├── url_scraper.py # Web scraping module
84+
85+
├── src/
86+
│ ├── main.py # Main scraper script
87+
│ └── excel_scraper.py # Web scraping module
8788
8889
├── logs/ # Directory for log files
89-
│ └── excel_fetch.log
90+
91+
├── tests/ # Directory for unit, integration, and end-to-end testing
9092
9193
├── data/ # Directory for downloaded Excel files
9294
│ ├── graduation/
@@ -106,7 +108,7 @@ This structure ensures that the project is well-organized for both manual execut
106108
### **Running the Scraper Manually**
107109
1. **Run the script to scrape and fetch new datasets:**
108110
```bash
109-
python nyc_infohub.py
111+
python main.py
110112
```
111113
2. **View logs for download status and debugging:**
112114
```bash
@@ -145,6 +147,34 @@ This structure ensures that the project is well-organized for both manual execut
145147

146148
---
147149

150+
## Testing
151+
152+
We use **Pytest** for our test suite, located in the `tests/` folder.
153+
154+
1. **Install dev/test dependencies** (either in your `setup.py` or via `pip install -r requirements.txt` if you listed them there).
155+
156+
2. **Run tests**:
157+
```bash
158+
python -m pytest tests/
159+
```
160+
161+
3. **View Coverage** (if you have `pytest-cov`):
162+
```bash
163+
python -m pytest tests/ --cov=src
164+
```
165+
166+
---
167+
168+
## CI/CD Pipeline
169+
170+
A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
171+
172+
1. **Builds and tests** the project on push or pull request to the `main` branch.
173+
2. If tests pass and you push a **tagged release**, it **builds a distribution** and can **upload** to PyPI using **Twine** (when secrets are configured).
174+
3. Check the **Actions** tab on your repo to see logs and statuses of each workflow run.
175+
176+
---
177+
148178
## **Previous Limitations and Solutions**
149179
***Bottlenecks***:
150180

@@ -157,6 +187,7 @@ This structure ensures that the project is well-organized for both manual execut
157187
1. Optimized Downloading: Parallel downloads using asyncio and ThreadPoolExecutor allow multiple downloads to happen concurrently, improving speed.
158188
2. Persistent HTTP Sessions: Using httpx.AsyncClient ensures that HTTP connections are reused, reducing overhead.
159189
3. Efficient Hashing: Files are saved only if they have changed, determined by a computed hash. This ensures no unnecessary downloads.
190+
4. Excluded older datasets by added `re` filtering logic to scrape only the latest available data.
160191

161192
---
162193

@@ -169,7 +200,7 @@ This structure ensures that the project is well-organized for both manual execut
169200
---
170201

171202
## **Other Potential Improvements**
172-
- **Exclude older datasets**: Add filtering logic to scrape only the latest available data.
203+
- **Add NYSed Website**: Scrape data from NYSed.
173204
- **Email Notifications**: Notify users when a new dataset is fetched.
174205
- **Database Integration**: Store metadata in a database for better tracking.
175206
- **Better Exception Handling**: Improve error logging for specific failures.

pyproject.toml

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[build-system]
2+
requires = ["setuptools", "wheel"]
3+
build-backend = "setuptools.build_meta"

requirements.txt

192 Bytes
Binary file not shown.

setup.py

+10-6
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,28 @@
11
from setuptools import setup, find_packages
22

33
setup(
4-
name="nyc_infohub_scraper",
5-
version="1.0.0",
4+
name="excel_api_access",
5+
version="1.0.3",
66
author="Dylan Picart",
77
author_email="[email protected]",
88
description="A Python scraper for downloading Excel datasets from NYC InfoHub.",
99
long_description=open("README.md").read(),
1010
long_description_content_type="text/markdown",
11-
url="https://github.com/dylanpicart/nyc_infohub_scraper",
12-
packages=find_packages(),
11+
url="https://github.com/dylanpicart/excel_api_access",
12+
packages=find_packages(where="src"),
13+
package_dir={"": "src"},
1314
install_requires=[
15+
"httpx[http2]>=0.28.1",
1416
"selenium>=4.10.0",
1517
"pandas>=1.3.0",
16-
"requests>=2.26.0",
1718
"tqdm>=4.62.0",
1819
"openpyxl>=3.0.9",
1920
"pyxlsb>=1.0.10",
2021
"xlrd>=2.0.1",
21-
"python-dotenv>=1.0.0"
22+
"python-dotenv>=1.0.0",
23+
"pytest>=7.0, <8.0",
24+
"pytest-asyncio",
25+
"pytest-cov"
2226
],
2327
classifiers=[
2428
"Programming Language :: Python :: 3",
File renamed without changes.

url_scraper.py renamed to src/excel_scraper.py

+12-10
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,6 @@
3232
"information-and-data-overview"
3333
}
3434

35-
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
36-
DATA_DIR = os.path.join(BASE_DIR, "data")
37-
HASH_DIR = os.path.join(BASE_DIR, "hashes")
38-
LOG_DIR = os.path.join(BASE_DIR, "logs")
3935

4036
# -------------------- NYCInfoHubScraperc Class --------------------
4137
class NYCInfoHubScraper:
@@ -47,11 +43,17 @@ class NYCInfoHubScraper:
4743
"other_reports": [] # Default category for uncategorized files
4844
}
4945

50-
def __init__(self):
46+
def __init__(self, base_dir=None, data_dir=None, hash_dir=None, log_dir=None):
47+
# Initialize directories
48+
self.base_dir = base_dir or os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
49+
self.data_dir = data_dir or os.path.join(self.base_dir, "data")
50+
self.hash_dir = hash_dir or os.path.join(self.base_dir, "hashes")
51+
self.log_dir = log_dir or os.path.join(self.base_dir, "logs")
52+
5153
# Re-create directories if needed
52-
os.makedirs(DATA_DIR, exist_ok=True)
53-
os.makedirs(HASH_DIR, exist_ok=True)
54-
os.makedirs(LOG_DIR, exist_ok=True)
54+
os.makedirs(self.data_dir, exist_ok=True)
55+
os.makedirs(self.hash_dir, exist_ok=True)
56+
os.makedirs(self.log_dir, exist_ok=True)
5557

5658
# Configure Selenium driver
5759
self.driver = self.configure_driver()
@@ -266,10 +268,10 @@ def save_file(self, url, content, new_hash):
266268
file_name = os.path.basename(url)
267269
category = self.categorize_file(file_name)
268270

269-
save_path = os.path.join(DATA_DIR, category, file_name)
271+
save_path = os.path.join(self.data_dir, category, file_name)
270272
os.makedirs(os.path.dirname(save_path), exist_ok=True)
271273

272-
hash_path = os.path.join(HASH_DIR, category, f"{file_name}.hash")
274+
hash_path = os.path.join(self.hash_dir, category, f"{file_name}.hash")
273275
os.makedirs(os.path.dirname(hash_path), exist_ok=True)
274276

275277
old_hash = None

nyc_infohub.py renamed to src/main.py

+16-25
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,9 @@
11
import os
22
import logging
33
import asyncio
4-
from url_scraper import NYCInfoHubScraper
4+
from excel_scraper import NYCInfoHubScraper
55
from logging.handlers import RotatingFileHandler
66

7-
# -------------------- CONFIGURATION --------------------
8-
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
9-
DATA_DIR = os.path.join(BASE_DIR, "data")
10-
HASH_DIR = os.path.join(BASE_DIR, "hashes")
11-
LOG_DIR = os.path.join(BASE_DIR, "logs")
12-
LOG_FILE_PATH = os.path.join(LOG_DIR, "excel_fetch.log")
13-
14-
# Ensure necessary directories exist
15-
os.makedirs(DATA_DIR, exist_ok=True)
16-
os.makedirs(HASH_DIR, exist_ok=True)
17-
os.makedirs(LOG_DIR, exist_ok=True)
18-
19-
# Set up Rotating Log Handler
20-
log_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
21-
rotating_handler = RotatingFileHandler(LOG_FILE_PATH, maxBytes=5 * 1024 * 1024, backupCount=2)
22-
rotating_handler.setFormatter(log_formatter)
23-
24-
logging.basicConfig(
25-
level=logging.INFO,
26-
format="%(asctime)s - %(levelname)s - %(message)s",
27-
handlers=[rotating_handler, logging.StreamHandler()]
28-
)
29-
307

318
# -------------------- SCRAPER EXECUTION --------------------
329
async def main():
@@ -61,5 +38,19 @@ async def main():
6138

6239
# Run the scraper process
6340
if __name__ == "__main__":
64-
logging.basicConfig(level=logging.INFO)
41+
base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
42+
logs_dir = os.path.join(base_dir, "logs")
43+
os.makedirs(logs_dir, exist_ok=True)
44+
45+
# (2) Create the rotating log handler
46+
log_file_path = os.path.join(logs_dir, "excel_fetch.log")
47+
rotating_handler = RotatingFileHandler(log_file_path, maxBytes=5_242_880, backupCount=2)
48+
rotating_handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(message)s"))
49+
50+
# (3) Call basicConfig once, referencing your rotating handler
51+
logging.basicConfig(
52+
level=logging.INFO,
53+
format="%(asctime)s - %(levelname)s - %(message)s",
54+
handlers=[rotating_handler, logging.StreamHandler()]
55+
)
6556
asyncio.run(main())

tests/conftest.py

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# conftest.py
2+
3+
import pytest
4+
import logging
5+
import asyncio
6+
from src.excel_scraper import NYCInfoHubScraper
7+
8+
9+
10+
@pytest.fixture(scope="session")
11+
def test_scraper():
12+
"""
13+
A session-scoped fixture that returns a NYCInfoHubScraper instance.
14+
Any test can use 'test_scraper' as a parameter, and it will share
15+
the same instance if scope="session".
16+
"""
17+
logging.info("Setting up NYCInfoHubScraper for tests.")
18+
scraper = NYCInfoHubScraper()
19+
yield scraper # run tests using this instance
20+
21+
# Teardown code after tests finish
22+
logging.info("Tearing down NYCInfoHubScraper after tests.")
23+
# Safely close the scraper's resources:
24+
try:
25+
asyncio.run(scraper.close())
26+
except Exception as e:
27+
logging.error(f"Error closing scraper during teardown: {e}")

0 commit comments

Comments
 (0)