Tensorlake MotherDuck AI Risk Analysis Application

This application automatically processes financial documents (e.g., SEC filings) to extract AI-related risks and stores them in a MotherDuck database using DuckDB.

Overview

This Tensorlake application performs the following workflow:

Document Ingestion: Accepts document URLs (e.g., SEC filings in PDF format)
Page Classification: Uses Tensorlake's DocumentAI to classify pages containing AI risk factors
Structured Data Extraction: Extracts structured AI risk information using LLM-based extraction
Data Storage: Writes the extracted data to MotherDuck's cloud-based DuckDB database

The application creates two related tables in MotherDuck:

ai_risk_filings: Contains top-level filing information (company, filing date, etc.)
ai_risk_mentions: Contains individual AI risk mentions with denormalized join keys

Architecture

The application is built using Tensorlake's serverless framework with three main functions:

┌─────────────────────────────────────────────────────────────────┐
│                     Document URLs Input                          │
│              (SEC Filings, Financial Reports, etc.)              │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│              document_ingestion (Entry Point)                    │
│  • Accepts list of document URLs                                 │
│  • Classifies pages using DocumentAI                             │
│  • Identifies pages with "risk_factors"                          │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│          extract_structured_data (Parallel Processing)           │
│  • Processes each document in parallel                           │
│  • Extracts structured AI risk data using LLM                    │
│  • Follows AIRiskExtraction schema                               │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│           write_to_duckdb (Data Persistence)                     │
│  • Connects to MotherDuck: md:ai_risk_factors                    │
│  • Creates/updates ai_risk_filings table                         │
│  • Creates/updates ai_risk_mentions table                        │
│  • Denormalizes data for easy querying                           │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    MotherDuck Database                           │
│  Tables:                                                         │
│  • ai_risk_filings (parent table)                                │
│  • ai_risk_mentions (child table with risk details)              │
└─────────────────────────────────────────────────────────────────┘

Prerequisites

Tensorlake Account: Sign up at tensorlake.ai
MotherDuck Account: Sign up at motherduck.com
API Keys:
- TENSORLAKE_API_KEY: Your Tensorlake API key
- motherduck_token: Your MotherDuck access token

Installation

Clone this repository:

git clone https://github.com/tensorlakeai/motherduck.git
cd motherduck

Install dependencies:

pip install -r requirements.txt

Configuration

Environment Variables

The application requires the following secrets to be configured:

TENSORLAKE_API_KEY: Used for DocumentAI operations (classification and extraction)
MOTHERDUCK_API_KEY: Used in the extraction step (if needed)
motherduck_token: Used to connect to MotherDuck database

How It Works

1. Document Ingestion (`document_ingestion`)

Function: Entry point that accepts document URLs

Input:

DocumentURLs(urls=[
    "https://example.com/filing.pdf"
])

Process:

Creates a DocumentAI client
Defines page classification config for "risk_factors"
Classifies each document URL to identify pages with AI risk content
Returns a mapping of document URLs to parse IDs

2. Structured Data Extraction (`extract_structured_data`)

Function: Extracts structured AI risk data from classified documents

Process:

Waits for page classification to complete
Identifies pages classified as "risk_factors"
Extracts structured data using the AIRiskExtraction schema
Passes the extraction result to the database writer

Schema: The extraction follows a comprehensive schema including:

Company information (name, ticker, filing details)
AI risk indicators (mentioned, strategy, investment, competition)
Individual risk mentions with categories and severity

3. Database Writing (`write_to_duckdb`)

Function: Writes extracted data to MotherDuck

Process:

Connects to MotherDuck using md:ai_risk_factors database
Creates two tables if they don't exist:
- ai_risk_filings: Parent table with filing-level data
- ai_risk_mentions: Child table with individual risk mentions
Inserts data using pandas DataFrames

Database Schema:

ai_risk_filings table:

company_name, ticker, filing_type, filing_date
fiscal_year, fiscal_quarter
ai_risk_mentioned, num_ai_risk_mentions
ai_strategy_mentioned, ai_investment_mentioned, ai_competition_mentioned
regulatory_ai_risk

ai_risk_mentions table:

risk_category, risk_description, severity_indicator, citation
Denormalized columns: company_name, ticker, filing_type, filing_date, fiscal_year, fiscal_quarter

Local Testing

Run the application locally with the example:

export TENSORLAKE_API_KEY="your_api_key"
export motherduck_token="your_motherduck_token"

python app.py

The default example processes a Confluent SEC filing.

Deployment to Tensorlake

1. Install Tensorlake CLI

pip install tensorlake

2. Login to Tensorlake

tensorlake login

3. Configure Secrets

Add your API keys as secrets in Tensorlake:

tensorlake secret create TENSORLAKE_API_KEY "your_tensorlake_api_key"
tensorlake secret create MOTHERDUCK_API_KEY "your_motherduck_api_key"
tensorlake secret create motherduck_token "your_motherduck_token"

4. Deploy the Application

tensorlake deploy app.py

5. Invoke the Application

After deployment, invoke the application with document URLs:

tensorlake invoke document_ingestion \
  --input '{"urls": ["https://example.com/sec-filing.pdf"]}'

Or use the Tensorlake Python SDK:

from tensorlake.client import TensorlakeClient

client = TensorlakeClient(api_key="your_api_key")
response = client.invoke(
    "document_ingestion",
    {"urls": ["https://example.com/sec-filing.pdf"]}
)

Data Schema

AIRiskMention

Individual AI-related risk mention extracted from documents:

{
    "risk_category": str,  # Operational, Regulatory, Competitive, etc.
    "risk_description": str,
    "severity_indicator": Optional[str],
    "citation": str  # Page reference
}

AIRiskExtraction

Complete AI risk data structure:

{
    "company_name": str,
    "ticker": str,
    "filing_type": str,
    "filing_date": str,
    "fiscal_year": str,
    "fiscal_quarter": Optional[str],
    "ai_risk_mentioned": bool,
    "ai_risk_mentions": List[AIRiskMention],
    "num_ai_risk_mentions": int,
    "ai_strategy_mentioned": bool,
    "ai_investment_mentioned": bool,
    "ai_competition_mentioned": bool,
    "regulatory_ai_risk": bool
}

Querying the Data

Once data is in MotherDuck, you can query it using SQL:

-- Connect to your MotherDuck database
SELECT * FROM ai_risk_filings ORDER BY filing_date DESC;

-- Get risk mentions for a specific company
SELECT * FROM ai_risk_mentions 
WHERE company_name = 'Confluent' 
ORDER BY filing_date DESC;

-- Analyze risk categories
SELECT risk_category, COUNT(*) as count 
FROM ai_risk_mentions 
GROUP BY risk_category 
ORDER BY count DESC;

Advanced Usage

Processing Multiple Documents

document_urls = DocumentURLs(urls=[
    "https://example.com/filing1.pdf",
    "https://example.com/filing2.pdf",
    "https://example.com/filing3.pdf"
])

response = run_local_application(
    document_ingestion, 
    document_urls.model_dump_json()
)

Custom Page Classification

Modify the page_classifications in document_ingestion to classify different types of pages:

page_classifications = [
    PageClassConfig(
        name="risk_factors",
        description="Pages that contain risk factors related to AI."
    ),
    PageClassConfig(
        name="financial_statements",
        description="Pages containing financial data and statements."
    )
]

Troubleshooting

Connection Issues

If you encounter connection issues with MotherDuck:

Verify your motherduck_token is correct
Check that the database ai_risk_factors exists or can be created
Ensure network connectivity to MotherDuck services

Classification Issues

If documents aren't being classified correctly:

Verify the document URL is accessible
Check that the PDF is not password-protected
Review the page classification description for clarity

Extraction Issues

If structured data extraction fails:

Ensure pages were correctly classified as "risk_factors"
Verify the document contains the expected content
Check the TENSORLAKE_API_KEY has sufficient quota

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

See LICENSE file for details.

Support

For issues related to:

Tensorlake: Contact [email protected]
MotherDuck: Contact [email protected]
This Application: Open an issue on GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.env.example		.env.example
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
QUICKREF.md		QUICKREF.md
README.md		README.md
app.py		app.py
example_usage.py		example_usage.py
requirements.txt		requirements.txt

License

tensorlakeai/motherduck

Folders and files

Latest commit

History

Repository files navigation

Tensorlake MotherDuck AI Risk Analysis Application

Overview

Architecture

Prerequisites

Installation

Configuration

Environment Variables

How It Works

1. Document Ingestion (document_ingestion)

2. Structured Data Extraction (extract_structured_data)

3. Database Writing (write_to_duckdb)

Local Testing

Deployment to Tensorlake

1. Install Tensorlake CLI

2. Login to Tensorlake

3. Configure Secrets

4. Deploy the Application

5. Invoke the Application

Data Schema

AIRiskMention

AIRiskExtraction

Querying the Data

Advanced Usage

Processing Multiple Documents

Custom Page Classification

Troubleshooting

Connection Issues

Classification Issues

Extraction Issues

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Document Ingestion (`document_ingestion`)

2. Structured Data Extraction (`extract_structured_data`)

3. Database Writing (`write_to_duckdb`)

Packages