This is a utility for running audits on scraper output for legislative data entities (bill
or event
) using SQLMesh. The script processes JSON data into a DuckDB database and runs a SQLMesh plan, returning any audit errors.
- Merges entity-level JSON files into one dataset.
- Initializes a DuckDB database with merged data.
- Runs
sqlmesh plan
on the staged models (staged.bill
orstaged.event
). - Extracts and prints any audit-related warnings or errors.
- Python 3.9+
poetry
Clone the repository and install dependencies using Poetry:
git clone [email protected]:openstates/scraper-audit.git
cd scraper-audit
poetry install
Ensure that the main directory contains a data folder with the JSON output files to audit.
These files are typically generated by the OpenStates Scraper and should follow the naming pattern:
*/*/<entity>*.json
For example:
_data/or/bill_0a3faf9c-1969-11f0-aaa5-4ef1b5972379.json
_data/or/bill_0a21cf0c-196b-11f0-aaa5-4ef1b5972379.json
Run poetry run python main.py --entity <entity name>
for example poetry run python main.py --entity bill
. This should produce output similar to:
INFO:openstates:Initializing data with arguments: entity=bill, jurisdiction=None
INFO:openstates:Merging JSON files matching pattern: ./*/*/bill*.json
INFO:openstates:Merged 1179 records into merged_entities.json
INFO:openstates:Creating DuckDB schema and loading data...
INFO:openstates:bill: initialized successfully
INFO:openstates:Running SQLMesh plan via subprocess...
INFO:openstates:SQLMesh plan output:
`prod` environment will be initialized
Models:
└── Added:
└── staged.bill
Models needing backfill:
└── staged.bill: [2025-04-24 - 2025-05-05]
staged.bill created
Updating physical layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
✔ Physical layer updated
[WARNING] staged.bill: 'assert_bills_have_sponsor' audit error: 1179 rows failed. Learn more in logs: ~/scraper-audit/logs/sqlmesh_2025_05_06_20_58_30.log
[1/1] staged.bill [insert/update rows, audits ❌1] 0.03s
Executing model batches ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
✔ Model batches executed
staged.bill created
Updating virtual layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
✔ Virtual layer updated
Audit failed:
[WARNING] staged.bill: 'assert_bills_have_sponsor' audit error: 1179 rows failed. Learn more in logs: ~/scraper-audit/logs/sqlmesh_2025_05_06_20_58_30.log
To run the scraper-audit using Docker, you can build the image locally and execute it.
Ensure your data directory is accessible to Docker.
For example, if your JSON files are in a local _data
folder:
From the root of the project run
docker build -t scraper-audit .
Assuming your JSON data is in a local _data directory run
docker run --rm -v ./_data:/app/_data scraper-audit --entity "bill"
Note: The --entity
flag is optional.
If provided, it can be set to bill
, event
, or vote_event
to audit a specific entity.
If omitted, audits will run for all entities.