Public Detective

Open Source Data Investigation

An AI-powered tool for enhancing transparency and accountability in Brazilian public procurement.

🕵️‍♂️ What's This All About?

Ever feel like public spending is a black box? In Brazil, billions are spent on public contracts, but keeping an eye on all of it is a Herculean task. Mistakes, inefficiencies, and even fraud can hide in mountains of documents.

Public Detective is here to change the game. We're an AI-powered watchdog that sniffs out irregularities in public tenders. Think of it as a digital detective, working 24/7 to help journalists, activists, and you demand transparency.

This isn't just code; it's a mission. Developed at PUCPR with the help of the amazing folks at Transparência Brasil, this project puts cutting-edge tech in the hands of the people.

🌟 Core Features

🤖 Automated Data Retrieval: Fetches procurement data directly from the official PNCP APIs.
💡 AI-Powered Analysis: Uses a Generative AI model to flag potential red flags and provide a detailed risk score with a rationale.
🗃️ Full Traceability: Archives both original and processed documents in Google Cloud Storage for every analysis.
🛡️ Idempotent by Design: Avoids re-analyzing unchanged documents by checking a content hash.

⚙️ How the Magic Happens

The application operates in a two-stage pipeline: a lightweight Pre-analysis stage to discover and prepare data, followed by an on-demand, AI-powered Analysis stage. This decoupled architecture ensures efficiency and cost-effectiveness.

Here’s a simplified look at how it works:

graph LR
    subgraph "Input"
        A[Public Procurement Data]
    end

    subgraph "Public Detective's Magic"
        B(Automated Analysis)
        C(AI-Powered Insights)
        D(Risk Scoring)
    end

    subgraph "Output"
        E[Transparency Reports]
        F[Actionable Insights for Journalists & Activists]
    end

    A --> B;
    B --> C;
    C --> D;
    D --> E;
    D --> F;

🛠️ Built With

Language: Python 3.12+
AI / NLP: Google Gemini API
CLI Framework: Click
Database & Migrations: PostgreSQL, managed with Alembic
Core Toolkit:
- SQLAlchemy Core: For writing safe, raw SQL queries.
- Pydantic: For data validation and settings management.
- Tenacity: For robust HTTP request retries.
- LibreOffice Headless: For office document conversion.
Infrastructure: Docker, Google Cloud Storage, Google Cloud Pub/Sub

🏁 Get Started

To get a local copy up and running, follow these simple steps.

Prerequisites

Python 3.12
Poetry
Docker
LibreOffice Headless

⚙️ Installation

Clone the repository:

git clone https://github.com/hunsche/public-detective.git
cd public-detective

Install dependencies:
```
poetry install
```
Set up environment variables: Create a .env file from the example. This is primarily used to configure local emulators.
```
cp .env.example .env
```
Authentication with Google Cloud is handled automatically. See the Authentication section for more details.
Start services:
```
docker compose up -d
```
Apply database migrations:
```
poetry run alembic upgrade head
```

🔐 Authentication

This project uses the Vertex AI backend for the Google Gemini API and authenticates using a standard Google Cloud pattern called Application Default Credentials (ADC). This provides a secure and flexible mechanism that works across different environments.

The application attempts to find credentials in the following order:

GOOGLE_APPLICATION_CREDENTIALS Environment Variable:
- Use Case: This is the standard Google Cloud method to force the application to use a specific service account. It's useful for local development or CI/CD.
- To Use: Set the environment variable to the absolute path of your service account's JSON key file.
- ⭐ E2E Test Convention: To make running E2E tests easier, this project uses the GCP_SERVICE_ACCOUNT_CREDENTIALS variable (defined in .env.example). You should paste the full JSON content of your key there. The test suite will automatically handle creating a temporary file and setting the GOOGLE_APPLICATION_CREDENTIALS path for you during the test run.
gcloud CLI Credentials (for Local Development):
- Use Case: The most common method for local development.
- To Use: If the GCP_SERVICE_ACCOUNT_CREDENTIALS variable is not set, the application will use the credentials of the user logged into the gcloud CLI. To set this up, run:
```
gcloud auth application-default login
```
Attached Service Account (Recommended for Production on GCP):
- Use Case: When running the application on Google Cloud infrastructure (e.g., Cloud Run, GKE, Compute Engine).
- How it Works: The application automatically detects and uses the service account attached to the host resource. This is the most secure method for production as it eliminates the need to manage and store credential files.
- To Use: Ensure the GCP_SERVICE_ACCOUNT_CREDENTIALS environment variable is unset, and the host's service account has the necessary IAM permissions (e.g., "Vertex AI User"). Also, ensure any emulator-specific environment variables (like GCP_GEMINI_HOST) are cleared so the application connects to the live Google Cloud APIs.

💻 How to Use

The application is controlled via a unified Command-Line Interface (CLI) accessible through the pd alias. This provides a structured and intuitive way to manage the application's lifecycle, from database migrations to data analysis.

Core Commands

The CLI is organized into logical groups:

analysis: Commands for running the different stages of the procurement analysis pipeline.
config: Tools for managing the application's configuration.
db: Utilities for database management, including migrations.
worker: Commands to control the background worker responsible for processing analysis tasks.

To see all available commands, you can run:

pd --help

`analysis` Group

This group contains the core logic for the analysis pipeline.

pd analysis prepare: Scans for new procurements within a given date range and prepares them for analysis.

# Prepare procurements from a specific date range
pd analysis prepare --start-date 2025-01-01 --end-date 2025-01-05

pd analysis run: Triggers a specific analysis by its ID.

# Run analysis for a specific ID
pd analysis run --analysis-id "a1b2c3d4-..."

pd analysis rank: Ranks pending analyses based on a budget and triggers them.

# Trigger ranked analysis with a manual budget
pd analysis rank --budget 100.00

pd analysis retry: Retries failed or stale analyses.

# Retry analyses that have been stuck for 1 hour
pd analysis retry --timeout-hours 1

`config` Group

Manage your application's environment settings.

pd config list: Lists all configuration key-value pairs.

# List all configurations
pd config list

# Show secret values without masking
pd config list --show-secrets

pd config get: Retrieves a specific configuration value.

# Get the value of a specific key
pd config get POSTGRES_USER

pd config set: Sets or unsets a configuration value.

# Set a new value
pd config set LOG_LEVEL "DEBUG"

# Unset a value
pd config set LOG_LEVEL --unset

`db` Group

Handle database operations.

pd db migrate: Applies all pending database migrations.
```
pd db migrate
```
pd db downgrade: Reverts the last database migration.
```
pd db downgrade
```
pd db reset: (Destructive) Resets the database to its initial state.
```
pd db reset
```

`worker` Group

Control the background worker.

pd worker start: Starts the worker to listen for and process analysis tasks from the queue.
```
# Start the worker
pd worker start
```

🙌 Join the Mission!

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. Please refer to the CONTRIBUTING.md file for details.

📄 License

Distributed under the Creative Commons Attribution-NonCommercial 4.0 International License. See LICENSE for more information.

📬 Get In Touch

Matheus Aoki Hunsche

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
docs		docs
linting		linting
source/public_detective		source/public_detective
tests		tests
.bandit-migrations.yaml		.bandit-migrations.yaml
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.tool-versions		.tool-versions
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
poetry.toml		poetry.toml
problem		problem
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Public Detective

🕵️‍♂️ What's This All About?

🌟 Core Features

⚙️ How the Magic Happens

🛠️ Built With

🏁 Get Started

Prerequisites

⚙️ Installation

🔐 Authentication

💻 How to Use

Core Commands

`analysis` Group

`config` Group

`db` Group

`worker` Group

🙌 Join the Mission!

📄 License

📬 Get In Touch

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

hunsche/public-detective

Folders and files

Latest commit

History

Repository files navigation

Public Detective

🕵️‍♂️ What's This All About?

🌟 Core Features

⚙️ How the Magic Happens

🛠️ Built With

🏁 Get Started

Prerequisites

⚙️ Installation

🔐 Authentication

💻 How to Use

Core Commands

analysis Group

config Group

db Group

worker Group

🙌 Join the Mission!

📄 License

📬 Get In Touch

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

`analysis` Group

`config` Group

`db` Group

`worker` Group

Packages