Multi-provider LLM microservice and data pipeline for practical information intelligence over social data (4chan, X). It provides a clean API, a robust ingestion/stratification/embedding workflow, and optional natural-language-to-SQL queries when a database is available.
- Natural language to SQL (NL→SQL) queries when PostgreSQL is available (default in Replit).
- Ingestion → stratification → embedding generation with environment-aware storage.
- Multi‑provider LLM support: OpenAI (required), Grok (optional), Venice (optional).
- FastAPI API with background processing and task management.
- Smarter S3 selection with pagination and filename date-range parsing. Prefers the latest snapshot file that overlaps your retention window and falls back to LastModified when needed. Implemented in
knowledge_agents/data_processing/cloud_handler.py. - Interactive NL Query improvements in
scripts/nl_query.py:- Accepts base host,
/api,/api/v1, or full/api/v1/nl_queryand normalizes - Better error messages and dynamic table rendering
- Respects
API_BASE_URL; addstabulatedependency - Note: The
/api/v1/nl_queryendpoint requires PostgreSQL (e.g., Replit)
- Accepts base host,
- Data wipe utility now supports production and object storage:
scripts/wipe_all_data.py --yes [--database-url ... | --pg-host ... --pg-user ... --pg-password ...]- Optional skips:
--no-kv,--no-objects,--no-files - Clears PostgreSQL tables, Replit KV, Replit Object Storage artifacts, and file-based artifacts
- Consolidated environment checks into Python: use
python scripts/process_data.py --check.scripts/replit_setup.shnow does lightweight app verification only. Removedscripts/check_replit_db.py.
- Core orchestrator:
ChanScopeDataManagercontrols ingestion, stratified sampling, and embedding generation. - Storage backends by environment:
- Replit: PostgreSQL (complete data), Replit Key‑Value (stratified sample), Replit Object Storage (embeddings), Object Storage (process locks).
- Docker/Local: File‑based CSV/NPZ/JSON with file locks.
- API: FastAPI app (
api.app) with health, data ops, and NL→SQL endpoints (NL→SQL requires PostgreSQL). - Scheduling: Optional updates via
scripts/scheduled_update.pywith interval control.
Chanscope's architecture follows a biologically-inspired pattern with distinct yet interconnected processing stages:
┌─────────────────┐ ┌──────────────────────────┐ ┌─────────────────┐
│ Data Sources │ │ Processing Core │ │ Query System │
│ ┌────────────┐ │ │ ┌────────────────────┐ │ │ ┌────────────┐ │
│ │ S3 │◄├─┐ │ │ ChanScopeDataMgr │ │ ┌───┼►│ Query │ │
│ │ Storage │ │ │ │ │ ┌────────────────┐ │ │ │ │ │ Processing │ │
│ └────────────┘ │ │ │ │ │ Stratified │ │ │ │ │ └─────┬──────┘ │
└─────────────────┘ │ │ │ │ Sampling │ │ │ │ │ │ │
│ │ │ └────────┬───────┘ │ │ │ │ ┌─────▼──────┐ │
┌─────────────────┐ │ │ │ │ │ │ │ │ │ Chunk │ │
│ Memory System │ │ │ │ ┌────────▼───────┐ │ │ │ │ │ Processing │ │
│ ┌────────────┐ │ │ │ │ │ Embedding │ │ │ │ │ └─────┬──────┘ │
│ │ Complete │◄┼─┘ │ │ │ Generation │ │ │ │ │ │ │
│ │ Data │ │ │ │ └────────────────┘ │ │ │ │ ┌─────▼──────┐ │
│ └────────────┘ │ │ └────────────────────┘ │ │ │ │ Final │ │
│ ┌────────────┐ │ │ │ │ │ │ │ Summarizer │ │
│ │ Stratified │◄├─────────┼───────────┘ │ │ │ └────────────┘ │
│ │ Sample │ │ │ │ │ │ │
│ └────────────┘ │ │ ┌────────────────────┐ │ │ └─────────────────┘
│ ┌────────────┐ │ │ │ Chanscope │ │ │
│ │ Embeddings │◄├─────────┼──┤ (Singleton LLM) ├──┼─────┘
│ │ (.npz) │ │ │ └────────────────────┘ │
│ └────────────┘ │ └──────────────────────────┘
└─────────────────┘
▲
│
┌──────┴──────┐
│ Storage ABCs │
└─────────────┘
- Docker/Local (file‑based):
- Complete data:
data/complete_data.csv - Stratified sample:
data/stratified/stratified_sample.csv - Embeddings:
data/stratified/embeddings.npz - Locks: file-based
- Complete data:
- Replit (database‑backed):
- Complete data: PostgreSQL tables (
complete_data,metadata) - Stratified sample: Replit Key‑Value store
- Embeddings: Replit Object Storage (.npz)
- Locks: Object Storage
- Complete data: PostgreSQL tables (
Environment detection comes from config/env_loader.detect_environment() and is used consistently across the codebase.
# Run the API locally
python -m uvicorn api.app:app --host 0.0.0.0 --port 80For comprehensive deployment instructions covering Docker and Replit environments, see deployment/DEPLOYMENT.md.
Quick Links:
Note: NL→SQL queries (/api/v1/nl_query) require PostgreSQL and are only available in Replit environments.
# Process all stages (ingestion, stratification, embeddings)
python scripts/process_data.py
# Status only (non‑invasive)
python scripts/process_data.py --check
Wipe dev (Replit development) completely:
# Force refresh all data
python scripts/process_data.py --force-refresh
# Regenerate from existing data
python scripts/process_data.py --regenerate --stratified-only
python scripts/process_data.py --regenerate --embeddings-only
# Bypass process locks (use with caution)
python scripts/process_data.py --ignore-lock# Run the API locally (if not using Replit)
python -m uvicorn api.app:app --host 0.0.0.0 --port 80
# Example NL→SQL (requires PostgreSQL)
curl -X POST "http://localhost/api/v1/nl_query" \
-H "Content-Type: application/json" \
-d '{
"query": "Show posts about Bitcoin from last week",
"limit": 20
}'For detailed API routes and request bodies, see api/README_REQUESTS.md.
The Knowledge Agent includes a web-based refresh dashboard for monitoring and controlling automated data refreshes.
- UI: Open
http://localhost/refreshto monitor status, current row count, and control auto-refresh - API: Dashboard exposes endpoints under
/refresh/api(e.g.,/refresh/api/status) - CLI:
python scripts/refresh_control.py status|start|stop|run-once [--interval SEC] [--base http://host/refresh/api]
For detailed dashboard usage and configuration, see deployment/DEPLOYMENT.md#refresh-dashboard.
- Paginates via
ListObjectsV2and parses filename date ranges like*_YYYY-MM-DD_YYYY-MM-DD_*.csv. - Prefers the latest snapshot whose end date overlaps the requested window.
- Applies board filters from
SELECT_BOARD. - Falls back to LastModified filtering if date ranges are absent.
- Replit: Object Storage locks ensure single‑instance processing across restarts.
- Docker/Local: File‑based locks with stale lock cleanup.
- Deployment Guide: Comprehensive deployment instructions for Docker and Replit
- API Reference: Complete API endpoint documentation with examples
- Testing Guide: Testing instructions and guidance
- Tests cover ingestion, embeddings, API endpoints, and the end‑to‑end pipeline.
- See
tests/README_TESTING.mdfor running guidance.
- OpenAI (required): completions and embeddings
- Grok (optional): completions and chunking
- Venice (optional): completions and chunking
- Data Gathering Lambda: https://github.com/joelwk/chanscope-lambda
- Original Chanscope R&D: https://github.com/joelwk/chanscope
- R&D Sandbox: https://github.com/joelwk/knowledge-agents
- Providers: OpenAI, Grok (x.ai), Venice