Skip to content

openstates/scraper-audit

Repository files navigation

Scraper Audit

This is a utility for running audits on scraper output for legislative data entities (bill or event) using SQLMesh. The script processes JSON data into a DuckDB database and runs a SQLMesh plan, returning any audit errors.

Features

  • Merges entity-level JSON files into one dataset.
  • Initializes a DuckDB database with merged data.
  • Runs sqlmesh plan on the staged models (staged.bill or staged.event).
  • Extracts and prints any audit-related warnings or errors.

Requirements

Installation

Clone the repository and install dependencies using Poetry:

git clone [email protected]:openstates/scraper-audit.git
cd scraper-audit
poetry install

Usage

Ensure that the main directory contains a data folder with the JSON output files to audit. These files are typically generated by the OpenStates Scraper and should follow the naming pattern: */*/<entity>*.json For example:

_data/or/bill_0a3faf9c-1969-11f0-aaa5-4ef1b5972379.json
_data/or/bill_0a21cf0c-196b-11f0-aaa5-4ef1b5972379.json

Run poetry run python main.py --entity <entity name> for example poetry run python main.py --entity bill. This should produce output similar to:

INFO:openstates:Initializing data with arguments: entity=bill, jurisdiction=None
INFO:openstates:Merging JSON files matching pattern: ./*/*/bill*.json
INFO:openstates:Merged 1179 records into merged_entities.json
INFO:openstates:Creating DuckDB schema and loading data...
INFO:openstates:bill: initialized successfully
INFO:openstates:Running SQLMesh plan via subprocess...
INFO:openstates:SQLMesh plan output:

`prod` environment will be initialized

Models:
└── Added:
    └── staged.bill
Models needing backfill:
└── staged.bill: [2025-04-24 - 2025-05-05]

staged.bill  created
Updating physical layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00

✔ Physical layer updated


[WARNING] staged.bill: 'assert_bills_have_sponsor' audit error: 1179 rows failed. Learn more in logs: ~/scraper-audit/logs/sqlmesh_2025_05_06_20_58_30.log

[1/1] staged.bill  [insert/update rows, audits ❌1] 0.03s   
Executing model batches ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                 ✔ Model batches executed

staged.bill  created
Updating virtual layer  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00

✔ Virtual layer updated


Audit failed:
 [WARNING] staged.bill: 'assert_bills_have_sponsor' audit error: 1179 rows failed. Learn more in logs: ~/scraper-audit/logs/sqlmesh_2025_05_06_20_58_30.log

Docker Usage

To run the scraper-audit using Docker, you can build the image locally and execute it. Ensure your data directory is accessible to Docker. For example, if your JSON files are in a local _data folder:

Build the Docker Image

From the root of the project run

docker build -t scraper-audit .

Run the Container

Assuming your JSON data is in a local _data directory run

docker run --rm -v ./_data:/app/_data scraper-audit --entity "bill"

Note: The --entity flag is optional. If provided, it can be set to bill, event, or vote_event to audit a specific entity. If omitted, audits will run for all entities.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published