This application automatically processes financial documents (e.g., SEC filings) to extract AI-related risks and stores them in a MotherDuck database using DuckDB.
This Tensorlake application performs the following workflow:
- Document Ingestion: Accepts document URLs (e.g., SEC filings in PDF format)
- Page Classification: Uses Tensorlake's DocumentAI to classify pages containing AI risk factors
- Structured Data Extraction: Extracts structured AI risk information using LLM-based extraction
- Data Storage: Writes the extracted data to MotherDuck's cloud-based DuckDB database
The application creates two related tables in MotherDuck:
ai_risk_filings: Contains top-level filing information (company, filing date, etc.)ai_risk_mentions: Contains individual AI risk mentions with denormalized join keys
The application is built using Tensorlake's serverless framework with three main functions:
┌─────────────────────────────────────────────────────────────────┐
│ Document URLs Input │
│ (SEC Filings, Financial Reports, etc.) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ document_ingestion (Entry Point) │
│ • Accepts list of document URLs │
│ • Classifies pages using DocumentAI │
│ • Identifies pages with "risk_factors" │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ extract_structured_data (Parallel Processing) │
│ • Processes each document in parallel │
│ • Extracts structured AI risk data using LLM │
│ • Follows AIRiskExtraction schema │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ write_to_duckdb (Data Persistence) │
│ • Connects to MotherDuck: md:ai_risk_factors │
│ • Creates/updates ai_risk_filings table │
│ • Creates/updates ai_risk_mentions table │
│ • Denormalizes data for easy querying │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ MotherDuck Database │
│ Tables: │
│ • ai_risk_filings (parent table) │
│ • ai_risk_mentions (child table with risk details) │
└─────────────────────────────────────────────────────────────────┘
- Tensorlake Account: Sign up at tensorlake.ai
- MotherDuck Account: Sign up at motherduck.com
- API Keys:
TENSORLAKE_API_KEY: Your Tensorlake API keymotherduck_token: Your MotherDuck access token
- Clone this repository:
git clone https://github.com/tensorlakeai/motherduck.git
cd motherduck- Install dependencies:
pip install -r requirements.txtThe application requires the following secrets to be configured:
TENSORLAKE_API_KEY: Used for DocumentAI operations (classification and extraction)MOTHERDUCK_API_KEY: Used in the extraction step (if needed)motherduck_token: Used to connect to MotherDuck database
Function: Entry point that accepts document URLs
Input:
DocumentURLs(urls=[
"https://example.com/filing.pdf"
])Process:
- Creates a DocumentAI client
- Defines page classification config for "risk_factors"
- Classifies each document URL to identify pages with AI risk content
- Returns a mapping of document URLs to parse IDs
Function: Extracts structured AI risk data from classified documents
Process:
- Waits for page classification to complete
- Identifies pages classified as "risk_factors"
- Extracts structured data using the
AIRiskExtractionschema - Passes the extraction result to the database writer
Schema: The extraction follows a comprehensive schema including:
- Company information (name, ticker, filing details)
- AI risk indicators (mentioned, strategy, investment, competition)
- Individual risk mentions with categories and severity
Function: Writes extracted data to MotherDuck
Process:
- Connects to MotherDuck using
md:ai_risk_factorsdatabase - Creates two tables if they don't exist:
ai_risk_filings: Parent table with filing-level dataai_risk_mentions: Child table with individual risk mentions
- Inserts data using pandas DataFrames
Database Schema:
ai_risk_filings table:
- company_name, ticker, filing_type, filing_date
- fiscal_year, fiscal_quarter
- ai_risk_mentioned, num_ai_risk_mentions
- ai_strategy_mentioned, ai_investment_mentioned, ai_competition_mentioned
- regulatory_ai_risk
ai_risk_mentions table:
- risk_category, risk_description, severity_indicator, citation
- Denormalized columns: company_name, ticker, filing_type, filing_date, fiscal_year, fiscal_quarter
Run the application locally with the example:
export TENSORLAKE_API_KEY="your_api_key"
export motherduck_token="your_motherduck_token"
python app.pyThe default example processes a Confluent SEC filing.
pip install tensorlaketensorlake loginAdd your API keys as secrets in Tensorlake:
tensorlake secret create TENSORLAKE_API_KEY "your_tensorlake_api_key"
tensorlake secret create MOTHERDUCK_API_KEY "your_motherduck_api_key"
tensorlake secret create motherduck_token "your_motherduck_token"tensorlake deploy app.pyAfter deployment, invoke the application with document URLs:
tensorlake invoke document_ingestion \
--input '{"urls": ["https://example.com/sec-filing.pdf"]}'Or use the Tensorlake Python SDK:
from tensorlake.client import TensorlakeClient
client = TensorlakeClient(api_key="your_api_key")
response = client.invoke(
"document_ingestion",
{"urls": ["https://example.com/sec-filing.pdf"]}
)Individual AI-related risk mention extracted from documents:
{
"risk_category": str, # Operational, Regulatory, Competitive, etc.
"risk_description": str,
"severity_indicator": Optional[str],
"citation": str # Page reference
}Complete AI risk data structure:
{
"company_name": str,
"ticker": str,
"filing_type": str,
"filing_date": str,
"fiscal_year": str,
"fiscal_quarter": Optional[str],
"ai_risk_mentioned": bool,
"ai_risk_mentions": List[AIRiskMention],
"num_ai_risk_mentions": int,
"ai_strategy_mentioned": bool,
"ai_investment_mentioned": bool,
"ai_competition_mentioned": bool,
"regulatory_ai_risk": bool
}Once data is in MotherDuck, you can query it using SQL:
-- Connect to your MotherDuck database
SELECT * FROM ai_risk_filings ORDER BY filing_date DESC;
-- Get risk mentions for a specific company
SELECT * FROM ai_risk_mentions
WHERE company_name = 'Confluent'
ORDER BY filing_date DESC;
-- Analyze risk categories
SELECT risk_category, COUNT(*) as count
FROM ai_risk_mentions
GROUP BY risk_category
ORDER BY count DESC;document_urls = DocumentURLs(urls=[
"https://example.com/filing1.pdf",
"https://example.com/filing2.pdf",
"https://example.com/filing3.pdf"
])
response = run_local_application(
document_ingestion,
document_urls.model_dump_json()
)Modify the page_classifications in document_ingestion to classify different types of pages:
page_classifications = [
PageClassConfig(
name="risk_factors",
description="Pages that contain risk factors related to AI."
),
PageClassConfig(
name="financial_statements",
description="Pages containing financial data and statements."
)
]If you encounter connection issues with MotherDuck:
- Verify your
motherduck_tokenis correct - Check that the database
ai_risk_factorsexists or can be created - Ensure network connectivity to MotherDuck services
If documents aren't being classified correctly:
- Verify the document URL is accessible
- Check that the PDF is not password-protected
- Review the page classification description for clarity
If structured data extraction fails:
- Ensure pages were correctly classified as "risk_factors"
- Verify the document contains the expected content
- Check the TENSORLAKE_API_KEY has sufficient quota
Contributions are welcome! Please feel free to submit a Pull Request.
See LICENSE file for details.
For issues related to:
- Tensorlake: Contact [email protected]
- MotherDuck: Contact [email protected]
- This Application: Open an issue on GitHub