Skip to content

tensorlakeai/motherduck

Repository files navigation

Tensorlake MotherDuck AI Risk Analysis Application

This application automatically processes financial documents (e.g., SEC filings) to extract AI-related risks and stores them in a MotherDuck database using DuckDB.

Overview

This Tensorlake application performs the following workflow:

  1. Document Ingestion: Accepts document URLs (e.g., SEC filings in PDF format)
  2. Page Classification: Uses Tensorlake's DocumentAI to classify pages containing AI risk factors
  3. Structured Data Extraction: Extracts structured AI risk information using LLM-based extraction
  4. Data Storage: Writes the extracted data to MotherDuck's cloud-based DuckDB database

The application creates two related tables in MotherDuck:

  • ai_risk_filings: Contains top-level filing information (company, filing date, etc.)
  • ai_risk_mentions: Contains individual AI risk mentions with denormalized join keys

Architecture

The application is built using Tensorlake's serverless framework with three main functions:

┌─────────────────────────────────────────────────────────────────┐
│                     Document URLs Input                          │
│              (SEC Filings, Financial Reports, etc.)              │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│              document_ingestion (Entry Point)                    │
│  • Accepts list of document URLs                                 │
│  • Classifies pages using DocumentAI                             │
│  • Identifies pages with "risk_factors"                          │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│          extract_structured_data (Parallel Processing)           │
│  • Processes each document in parallel                           │
│  • Extracts structured AI risk data using LLM                    │
│  • Follows AIRiskExtraction schema                               │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│           write_to_duckdb (Data Persistence)                     │
│  • Connects to MotherDuck: md:ai_risk_factors                    │
│  • Creates/updates ai_risk_filings table                         │
│  • Creates/updates ai_risk_mentions table                        │
│  • Denormalizes data for easy querying                           │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    MotherDuck Database                           │
│  Tables:                                                         │
│  • ai_risk_filings (parent table)                                │
│  • ai_risk_mentions (child table with risk details)              │
└─────────────────────────────────────────────────────────────────┘

Prerequisites

  1. Tensorlake Account: Sign up at tensorlake.ai
  2. MotherDuck Account: Sign up at motherduck.com
  3. API Keys:
    • TENSORLAKE_API_KEY: Your Tensorlake API key
    • motherduck_token: Your MotherDuck access token

Installation

  1. Clone this repository:
git clone https://github.com/tensorlakeai/motherduck.git
cd motherduck
  1. Install dependencies:
pip install -r requirements.txt

Configuration

Environment Variables

The application requires the following secrets to be configured:

  • TENSORLAKE_API_KEY: Used for DocumentAI operations (classification and extraction)
  • MOTHERDUCK_API_KEY: Used in the extraction step (if needed)
  • motherduck_token: Used to connect to MotherDuck database

How It Works

1. Document Ingestion (document_ingestion)

Function: Entry point that accepts document URLs

Input:

DocumentURLs(urls=[
    "https://example.com/filing.pdf"
])

Process:

  • Creates a DocumentAI client
  • Defines page classification config for "risk_factors"
  • Classifies each document URL to identify pages with AI risk content
  • Returns a mapping of document URLs to parse IDs

2. Structured Data Extraction (extract_structured_data)

Function: Extracts structured AI risk data from classified documents

Process:

  • Waits for page classification to complete
  • Identifies pages classified as "risk_factors"
  • Extracts structured data using the AIRiskExtraction schema
  • Passes the extraction result to the database writer

Schema: The extraction follows a comprehensive schema including:

  • Company information (name, ticker, filing details)
  • AI risk indicators (mentioned, strategy, investment, competition)
  • Individual risk mentions with categories and severity

3. Database Writing (write_to_duckdb)

Function: Writes extracted data to MotherDuck

Process:

  • Connects to MotherDuck using md:ai_risk_factors database
  • Creates two tables if they don't exist:
    • ai_risk_filings: Parent table with filing-level data
    • ai_risk_mentions: Child table with individual risk mentions
  • Inserts data using pandas DataFrames

Database Schema:

ai_risk_filings table:

  • company_name, ticker, filing_type, filing_date
  • fiscal_year, fiscal_quarter
  • ai_risk_mentioned, num_ai_risk_mentions
  • ai_strategy_mentioned, ai_investment_mentioned, ai_competition_mentioned
  • regulatory_ai_risk

ai_risk_mentions table:

  • risk_category, risk_description, severity_indicator, citation
  • Denormalized columns: company_name, ticker, filing_type, filing_date, fiscal_year, fiscal_quarter

Local Testing

Run the application locally with the example:

export TENSORLAKE_API_KEY="your_api_key"
export motherduck_token="your_motherduck_token"

python app.py

The default example processes a Confluent SEC filing.

Deployment to Tensorlake

1. Install Tensorlake CLI

pip install tensorlake

2. Login to Tensorlake

tensorlake login

3. Configure Secrets

Add your API keys as secrets in Tensorlake:

tensorlake secret create TENSORLAKE_API_KEY "your_tensorlake_api_key"
tensorlake secret create MOTHERDUCK_API_KEY "your_motherduck_api_key"
tensorlake secret create motherduck_token "your_motherduck_token"

4. Deploy the Application

tensorlake deploy app.py

5. Invoke the Application

After deployment, invoke the application with document URLs:

tensorlake invoke document_ingestion \
  --input '{"urls": ["https://example.com/sec-filing.pdf"]}'

Or use the Tensorlake Python SDK:

from tensorlake.client import TensorlakeClient

client = TensorlakeClient(api_key="your_api_key")
response = client.invoke(
    "document_ingestion",
    {"urls": ["https://example.com/sec-filing.pdf"]}
)

Data Schema

AIRiskMention

Individual AI-related risk mention extracted from documents:

{
    "risk_category": str,  # Operational, Regulatory, Competitive, etc.
    "risk_description": str,
    "severity_indicator": Optional[str],
    "citation": str  # Page reference
}

AIRiskExtraction

Complete AI risk data structure:

{
    "company_name": str,
    "ticker": str,
    "filing_type": str,
    "filing_date": str,
    "fiscal_year": str,
    "fiscal_quarter": Optional[str],
    "ai_risk_mentioned": bool,
    "ai_risk_mentions": List[AIRiskMention],
    "num_ai_risk_mentions": int,
    "ai_strategy_mentioned": bool,
    "ai_investment_mentioned": bool,
    "ai_competition_mentioned": bool,
    "regulatory_ai_risk": bool
}

Querying the Data

Once data is in MotherDuck, you can query it using SQL:

-- Connect to your MotherDuck database
SELECT * FROM ai_risk_filings ORDER BY filing_date DESC;

-- Get risk mentions for a specific company
SELECT * FROM ai_risk_mentions 
WHERE company_name = 'Confluent' 
ORDER BY filing_date DESC;

-- Analyze risk categories
SELECT risk_category, COUNT(*) as count 
FROM ai_risk_mentions 
GROUP BY risk_category 
ORDER BY count DESC;

Advanced Usage

Processing Multiple Documents

document_urls = DocumentURLs(urls=[
    "https://example.com/filing1.pdf",
    "https://example.com/filing2.pdf",
    "https://example.com/filing3.pdf"
])

response = run_local_application(
    document_ingestion, 
    document_urls.model_dump_json()
)

Custom Page Classification

Modify the page_classifications in document_ingestion to classify different types of pages:

page_classifications = [
    PageClassConfig(
        name="risk_factors",
        description="Pages that contain risk factors related to AI."
    ),
    PageClassConfig(
        name="financial_statements",
        description="Pages containing financial data and statements."
    )
]

Troubleshooting

Connection Issues

If you encounter connection issues with MotherDuck:

  1. Verify your motherduck_token is correct
  2. Check that the database ai_risk_factors exists or can be created
  3. Ensure network connectivity to MotherDuck services

Classification Issues

If documents aren't being classified correctly:

  1. Verify the document URL is accessible
  2. Check that the PDF is not password-protected
  3. Review the page classification description for clarity

Extraction Issues

If structured data extraction fails:

  1. Ensure pages were correctly classified as "risk_factors"
  2. Verify the document contains the expected content
  3. Check the TENSORLAKE_API_KEY has sufficient quota

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

See LICENSE file for details.

Support

For issues related to:

About

Blueprints and examples to use deliver data to Motherduck from Tensorlake applications

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages