Experimental Git-Native Document Intelligence System
Caia Library is an experimental document management system that leverages Git's immutable history and cryptographic integrity to create auditable, versioned document intelligence pipelines. Built with Temporal workflows and designed for human-in-the-loop operations, it provides a foundation for building trustworthy AI data systems.
Focus on Ethical Academic Sources: Caia Library specializes in collecting from academic sources that explicitly allow programmatic access, with proper attribution to Caia Tech and full compliance with terms of service.
β οΈ Experimental Software: This project is under active development and APIs may change. Use in production at your own risk.
- Every document ingestion creates an immutable Git commit
- Complete audit trail of when, how, and from where data was collected
- Cryptographic proof of data integrity via Git's SHA-1 hashing
- No possibility of silent data corruption or tampering
- Temporal-based workflow orchestration for reliable processing
- Parallel text extraction and embedding generation
- Automatic retry logic for transient failures
- Extensible architecture for adding ML models and processors
- Only sources that allow programmatic access: arXiv, PubMed Central, DOAJ, PLOS
- Strict rate limiting: Respects each source's API limits
- Full attribution: Every document credits both the source and Caia Tech
- Transparent identification: Clear User-Agent with contact information
- Cron-based scheduled collection from academic sources
- Batch processing for importing multiple documents efficiently
- Automatic deduplication to prevent redundant processing
- Configurable filters for targeted data collection
- Production-ready Docker Compose configuration
- Kubernetes manifests for cloud deployments
- Development mode with hot reload
- Built-in health checks and monitoring
- Git branches allow review before merging to main
- Clear commit messages document each ingestion
- Manual intervention points for quality control
- Transparent processing history
- PDF text extraction with OCR support (planned)
- HTML content cleaning and metadata extraction
- 384-dimensional embeddings without external dependencies
- Extensible extractor and embedder interfaces
- SQL-like syntax for querying documents in Git
- Attribution tracking queries to ensure compliance
- Time-travel queries through Git history
- Performance optimized using Git's efficient storage
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β REST API ββββββΆβ Temporal ββββββΆβ Git β
β (Fiber) β β Workflows β β Repository β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β
βββββββββ΄βββββββββ
β β
βββββββΌββββββ βββββββΌββββββ
β Text β β Embedding β
β Extractor β β Generator β
βββββββββββββ βββββββββββββ
- Git as Primary Database: Leverages Git's distributed, immutable design
- Document Versioning: Full history of all document changes
- Parallel Processing: Simultaneous text extraction and embedding generation
- Multiple Format Support: Text, HTML, PDF
- Scheduled Ingestion: Automated collection from RSS, APIs, and websites
- Batch Processing: Import multiple documents efficiently
- RESTful API: Comprehensive HTTP interface for all operations
- Workflow Tracking: Monitor processing status via Temporal
- Docker & Kubernetes: Production-ready deployment options
# Clone the repository
git clone https://github.com/caiatech/caia-library
cd caia-library
# Start all services
docker-compose up -d
# Check service health
curl http://localhost:8080/health
# Run with hot reload
docker-compose -f docker-compose.yml -f docker-compose.dev.yml up
# Install dependencies
go mod download
# Start Temporal
temporal server start-dev
# Run the server
go run ./cmd/server
# Collect an arXiv paper with proper attribution
curl -X POST http://localhost:8080/api/v1/documents \
-H "Content-Type: application/json" \
-d '{
"url": "https://arxiv.org/pdf/2301.00001.pdf",
"type": "pdf",
"metadata": {
"source": "arXiv",
"attribution": "Content from arXiv.org, collected by Caia Tech",
"ethical_compliance": "true"
}
}'
# Set up scheduled arXiv collection (daily)
curl -X POST http://localhost:8080/api/v1/ingestion/scheduled \
-H "Content-Type: application/json" \
-d '{
"name": "arxiv",
"type": "arxiv",
"url": "http://export.arxiv.org/api/query",
"schedule": "0 2 * * *",
"filters": ["cs.AI", "cs.LG"],
"metadata": {
"attribution": "Caia Tech"
}
}'
curl http://localhost:8080/api/v1/workflows/{workflow_id}
# Find all arXiv papers
curl -X POST http://localhost:8080/api/v1/query \
-H "Content-Type: application/json" \
-d '{"query": "SELECT FROM documents WHERE source = \"arXiv\" ORDER BY created_at DESC"}'
# Check attribution compliance
curl http://localhost:8080/api/v1/stats/attribution
# Search by content
curl -X POST http://localhost:8080/api/v1/query \
-H "Content-Type: application/json" \
-d '{"query": "SELECT FROM documents WHERE title ~ \"machine learning\" LIMIT 20"}'
Every document in Caia Library maintains:
- Source Attribution: Original URL or path
- Processing Timeline: Timestamps for each stage
- Transformation History: What was extracted/generated
- Error Documentation: Any failures during processing
- Human Annotations: Review notes and quality markers
- API Reference - Complete API documentation
- Git Query Language - SQL-like queries for document discovery
- Deployment Guide - Production deployment instructions
- Ethical Scraping - Academic source compliance guide
- Development Roadmap - 10-week feature roadmap
- Automated Collection - Setting up data pipelines
- β PDF support with basic detection
- β Advanced embeddings (384-dimensional)
- β Docker Compose deployment
- β Scheduled ingestion workflows
- β Git Query Language for document discovery
- β ONNX Runtime integration
- β Full PDF text extraction with ledongthuc/pdf
- β Git merge functionality with fast-forward support
- β Concurrent operation safety with mutex protection
- Authentication & rate limiting
- Input validation and SSRF protection
- Monitoring and alerting integration
- Production hardening and optimization
- Semantic search capabilities
- Multi-modal embeddings
- Differential privacy options
- Federated learning support
This is experimental software. Contributions welcome, but expect breaking changes.
- All documents stored in plaintext in Git
- No built-in encryption (use git-crypt if needed)
- Authentication not yet implemented
- API rate limiting not yet implemented
- SSRF protection not yet implemented
Caia Library embraces the principle that AI systems should be:
- Auditable: Every decision traceable to source data
- Reproducible: Same inputs always produce same outputs
- Transparent: Clear visibility into data transformations
- Correctable: Errors can be identified and fixed
- Attributable: All data sources properly credited
By using Git as our foundation, we ensure these properties are not just features, but fundamental guarantees of the system architecture.
Marvin Tutt
Chief Executive Officer, Caia Tech
[email protected]
Built with π§ by Caia Tech - Experimental Intelligence Infrastructure