Chanscope Retrieval

Multi-provider LLM microservice and data pipeline for practical information intelligence over social data (4chan, X). It provides a clean API, a robust ingestion/stratification/embedding workflow, and optional natural-language-to-SQL queries when a database is available.

Highlights

Natural language to SQL (NL→SQL) queries when PostgreSQL is available (default in Replit).
Ingestion → stratification → embedding generation with environment-aware storage.
Multi‑provider LLM support: OpenAI (required), Grok (optional), Venice (optional).
FastAPI API with background processing and task management.

What’s New

Smarter S3 selection with pagination and filename date-range parsing. Prefers the latest snapshot file that overlaps your retention window and falls back to LastModified when needed. Implemented in knowledge_agents/data_processing/cloud_handler.py.
Interactive NL Query improvements in scripts/nl_query.py:
- Accepts base host, /api, /api/v1, or full /api/v1/nl_query and normalizes
- Better error messages and dynamic table rendering
- Respects API_BASE_URL; adds tabulate dependency
- Note: The /api/v1/nl_query endpoint requires PostgreSQL (e.g., Replit)
Data wipe utility now supports production and object storage:
- scripts/wipe_all_data.py --yes [--database-url ... | --pg-host ... --pg-user ... --pg-password ...]
- Optional skips: --no-kv, --no-objects, --no-files
- Clears PostgreSQL tables, Replit KV, Replit Object Storage artifacts, and file-based artifacts
Consolidated environment checks into Python: use python scripts/process_data.py --check. scripts/replit_setup.sh now does lightweight app verification only. Removed scripts/check_replit_db.py.

Architecture (Concise)

Core orchestrator: ChanScopeDataManager controls ingestion, stratified sampling, and embedding generation.
Storage backends by environment:
- Replit: PostgreSQL (complete data), Replit Key‑Value (stratified sample), Replit Object Storage (embeddings), Object Storage (process locks).
- Docker/Local: File‑based CSV/NPZ/JSON with file locks.
API: FastAPI app (api.app) with health, data ops, and NL→SQL endpoints (NL→SQL requires PostgreSQL).
Scheduling: Optional updates via scripts/scheduled_update.py with interval control.

System Architecture

Chanscope's architecture follows a biologically-inspired pattern with distinct yet interconnected processing stages:

┌─────────────────┐         ┌──────────────────────────┐         ┌─────────────────┐
│   Data Sources  │         │    Processing Core       │         │  Query System   │
│  ┌────────────┐ │         │  ┌────────────────────┐  │         │ ┌────────────┐  │
│  │    S3      │◄├─┐       │  │ ChanScopeDataMgr   │  │     ┌───┼►│   Query    │  │
│  │  Storage   │ │ │       │  │ ┌────────────────┐ │  │     │   │ │ Processing │  │
│  └────────────┘ │ │       │  │ │   Stratified   │ │  │     │   │ └─────┬──────┘  │
└─────────────────┘ │       │  │ │    Sampling    │ │  │     │   │       │         │
                    │       │  │ └────────┬───────┘ │  │     │   │ ┌─────▼──────┐  │
┌─────────────────┐ │       │  │          │         │  │     │   │ │   Chunk    │  │
│  Memory System  │ │       │  │ ┌────────▼───────┐ │  │     │   │ │ Processing │  │
│  ┌────────────┐ │ │       │  │ │   Embedding    │ │  │     │   │ └─────┬──────┘  │
│  │ Complete   │◄┼─┘       │  │ │   Generation   │ │  │     │   │       │         │
│  │    Data    │ │         │  │ └────────────────┘ │  │     │   │ ┌─────▼──────┐  │
│  └────────────┘ │         │  └────────────────────┘  │     │   │ │   Final    │  │
│  ┌────────────┐ │         │           │              │     │   │ │ Summarizer │  │
│  │ Stratified │◄├─────────┼───────────┘              │     │   │ └────────────┘  │
│  │   Sample   │ │         │                          │     │   │                 │
│  └────────────┘ │         │  ┌────────────────────┐  │     │   └─────────────────┘
│  ┌────────────┐ │         │  │     Chanscope      │  │     │
│  │ Embeddings │◄├─────────┼──┤  (Singleton LLM)   ├──┼─────┘
│  │   (.npz)   │ │         │  └────────────────────┘  │
│  └────────────┘ │         └──────────────────────────┘
└─────────────────┘
           ▲
           │
    ┌──────┴──────┐
    │ Storage ABCs │
    └─────────────┘

Storage & Environments

Docker/Local (file‑based):
- Complete data: data/complete_data.csv
- Stratified sample: data/stratified/stratified_sample.csv
- Embeddings: data/stratified/embeddings.npz
- Locks: file-based
Replit (database‑backed):
- Complete data: PostgreSQL tables (complete_data, metadata)
- Stratified sample: Replit Key‑Value store
- Embeddings: Replit Object Storage (.npz)
- Locks: Object Storage

Environment detection comes from config/env_loader.detect_environment() and is used consistently across the codebase.

Quick Start

Local Development

# Run the API locally
python -m uvicorn api.app:app --host 0.0.0.0 --port 80

Deployment

For comprehensive deployment instructions covering Docker and Replit environments, see deployment/DEPLOYMENT.md.

Quick Links:

Note: NL→SQL queries (/api/v1/nl_query) require PostgreSQL and are only available in Replit environments.

Data Processing CLI

# Process all stages (ingestion, stratification, embeddings)
python scripts/process_data.py

# Status only (non‑invasive)
python scripts/process_data.py --check
Wipe dev (Replit development) completely:

# Force refresh all data
python scripts/process_data.py --force-refresh

# Regenerate from existing data
python scripts/process_data.py --regenerate --stratified-only
python scripts/process_data.py --regenerate --embeddings-only

# Bypass process locks (use with caution)
python scripts/process_data.py --ignore-lock

API Quick Start

# Run the API locally (if not using Replit)
python -m uvicorn api.app:app --host 0.0.0.0 --port 80

# Example NL→SQL (requires PostgreSQL)
curl -X POST "http://localhost/api/v1/nl_query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Show posts about Bitcoin from last week",
    "limit": 20
  }'

For detailed API routes and request bodies, see api/README_REQUESTS.md.

Refresh Dashboard

The Knowledge Agent includes a web-based refresh dashboard for monitoring and controlling automated data refreshes.

UI: Open http://localhost/refresh to monitor status, current row count, and control auto-refresh
API: Dashboard exposes endpoints under /refresh/api (e.g., /refresh/api/status)
CLI: python scripts/refresh_control.py status|start|stop|run-once [--interval SEC] [--base http://host/refresh/api]

For detailed dashboard usage and configuration, see deployment/DEPLOYMENT.md#refresh-dashboard.

S3 Ingestion Behavior

Paginates via ListObjectsV2 and parses filename date ranges like *_YYYY-MM-DD_YYYY-MM-DD_*.csv.
Prefers the latest snapshot whose end date overlaps the requested window.
Applies board filters from SELECT_BOARD.
Falls back to LastModified filtering if date ranges are absent.

Process Locks

Replit: Object Storage locks ensure single‑instance processing across restarts.
Docker/Local: File‑based locks with stale lock cleanup.

Documentation

Deployment Guide: Comprehensive deployment instructions for Docker and Replit
API Reference: Complete API endpoint documentation with examples
Testing Guide: Testing instructions and guidance

Testing

Tests cover ingestion, embeddings, API endpoints, and the end‑to‑end pipeline.
See tests/README_TESTING.md for running guidance.

Supported Models

OpenAI (required): completions and embeddings
Grok (optional): completions and chunking
Venice (optional): completions and chunking

References

Data Gathering Lambda: https://github.com/joelwk/chanscope-lambda
Original Chanscope R&D: https://github.com/joelwk/chanscope
R&D Sandbox: https://github.com/joelwk/knowledge-agents
Providers: OpenAI, Grok (x.ai), Venice

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.cursor/rules		.cursor/rules
api		api
config		config
deployment		deployment
docs		docs
examples		examples
knowledge_agents		knowledge_agents
scripts		scripts
tests		tests
.cursorignore		.cursorignore
.dockerignore		.dockerignore
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
.replit		.replit
README.md		README.md
pytest.ini		pytest.ini
replit.nix		replit.nix
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chanscope Retrieval

Highlights

What’s New

Architecture (Concise)

System Architecture

Storage & Environments

Quick Start

Local Development

Deployment

Data Processing CLI

API Quick Start

Refresh Dashboard

S3 Ingestion Behavior

Process Locks

Documentation

Testing

Supported Models

References

About

Uh oh!

Releases

Packages

Languages

joelwk/chanscope-knowledge-agents

Folders and files

Latest commit

History

Repository files navigation

Chanscope Retrieval

Highlights

What’s New

Architecture (Concise)

System Architecture

Storage & Environments

Quick Start

Local Development

Deployment

Data Processing CLI

API Quick Start

Refresh Dashboard

S3 Ingestion Behavior

Process Locks

Documentation

Testing

Supported Models

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages