Skip to content

that-guy-scott/file-indexer

Repository files navigation

File Indexer Logo

File Indexer

Rust Tokio Axum Qdrant Ollama Docker

A high-performance Rust-based file indexing system that provides semantic search capabilities over your codebase using vector embeddings.

Overview

File Indexer recursively scans directories, generates embeddings for text files via Ollama, and stores metadata in Qdrant vector database for lightning-fast semantic and metadata-based search.

Key Features

  • 🚀 Fast Parallel Processing - Processes files concurrently (10 at a time)
  • 🧠 Semantic Search - Find files by meaning, not just keywords
  • 📦 Batch Operations - Efficient batch uploads (100 files per batch)
  • 🔍 Smart Binary Detection - Extension whitelist + UTF-8 validation
  • 📊 Job Tracking - Monitor indexing progress in real-time
  • 🐳 Docker Ready - Full containerization with docker-compose
  • Fully Tested - 67 tests including comprehensive E2E tests

Quick Start

Prerequisites

  • Rust 1.75+ (for building from source)
  • Docker & Docker Compose (for running services)
  • Ollama with qwen3-embedding:4b model

1. Start Services

# Start Qdrant
docker-compose up -d qdrant

# Ensure Ollama is running with the model
ollama pull qwen3-embedding:4b

2. Configure

Edit config.toml:

[indexer]
base_path = "/path/to/index"
max_lines_to_index = 100

[ollama]
host = "http://localhost:11434"
embedding_model = "qwen3-embedding:4b"

[qdrant]
host = "http://localhost:6334"
collection_name = "file_index"
vector_size = 2560

[server]
host = "0.0.0.0"
port = 8080

3. Run

# Build and run
cargo run --release

# Or use Docker
docker-compose up -d

The API will be available at http://localhost:8080

API Documentation

Index Operations

Start Indexing

POST /index
Content-Type: application/json

{
  "path": "/optional/override/path",
  "recursive": true,
  "force_reindex": false
}

# Response
{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "message": "Indexing initiated"
}

Check Job Status

GET /index/status/{job_id}

# Response
{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "files_processed": 1543,
  "files_total": 1543,
  "errors": []
}

Search Operations

Semantic Search

POST /search
Content-Type: application/json

{
  "query": "authentication implementation",
  "limit": 10,
  "filters": {
    "file_extension": ["rs", "py"],
    "modified_after": "2025-01-01T00:00:00Z",
    "path_pattern": "*/src/*"
  },
  "include_binary": false
}

# Response
{
  "results": [
    {
      "score": 0.85,
      "metadata": {
        "absolute_path": "/path/to/auth.rs",
        "file_name": "auth.rs",
        "file_extension": "rs",
        "is_binary": false,
        "line_count": 150,
        ...
      },
      "content_preview": "impl Authentication { ... }"
    }
  ],
  "total": 10,
  "query_time_ms": 45
}

Metadata Search

GET /search/metadata?file_extension=rs&limit=20

# Response
{
  "results": [ /* array of FileMetadata */ ],
  "total": 15
}

System Operations

Health Check

GET /health

# Response
{
  "status": "healthy",
  "qdrant_connected": true,
  "ollama_connected": true,
  "indexed_files_count": 1543
}

Statistics

GET /stats

# Response
{
  "total_files": 1543,
  "text_files": 1200,
  "binary_files": 343,
  "total_size_bytes": 52428800,
  "last_index_time": "2025-10-18T10:30:00Z",
  "collection_name": "file_index"
}

Architecture

Data Flow

Directory
    ↓
Scanner (walkdir)
    ↓
Processor (binary detection, content reading)
    ↓
Embedder (Ollama API)
    ↓
Qdrant Storage (vector DB)
    ↓
REST API (search & retrieval)

Components

  1. File Scanner - Recursive directory traversal with metadata collection
  2. Content Processor - Binary/text detection and content extraction (first 100 lines)
  3. Embedding Generator - Ollama integration for vector embeddings
  4. Qdrant Client - Vector database operations with batch support
  5. REST API - Full-featured HTTP API with job management
  6. Indexing Orchestrator - Parallel processing with job tracking

Configuration

Environment Variables

All config values can be overridden via environment variables with the INDEXER_ prefix:

export INDEXER_BASE_PATH="/home/user/code"
export INDEXER_MAX_LINES_TO_INDEX=100
export INDEXER_OLLAMA_HOST="http://localhost:11434"
export INDEXER_OLLAMA_EMBEDDING_MODEL="qwen3-embedding:4b"
export INDEXER_QDRANT_HOST="http://localhost:6334"
export INDEXER_QDRANT_COLLECTION_NAME="file_index"
export INDEXER_QDRANT_VECTOR_SIZE=2560
export INDEXER_SERVER_HOST="0.0.0.0"
export INDEXER_SERVER_PORT=8080

Supported Text Extensions

txt, md, rs, py, js, ts, json, yaml, yml, toml, xml, html, css,
sh, bash, c, cpp, h, java, go, rb, php, sql, log, csv, ini, env,
jsx, tsx, vue, svelte, astro, dockerfile, makefile, cmake, gradle, maven

Unknown extensions are tested via UTF-8 validation (first 8KB).

Development

Running Tests

# All tests
cargo test

# Unit tests only
cargo test --lib

# E2E tests only
cargo test --lib services::indexer::tests::test_e2e

# With output
cargo test -- --nocapture

Building

# Debug build
cargo build

# Release build (optimized)
cargo build --release

# Docker build
docker-compose build

Project Structure

src/
├── main.rs              # HTTP server entry point
├── lib.rs               # Library exports
├── config.rs            # Configuration management
├── errors.rs            # Error types
├── models/
│   ├── file_metadata.rs # FileMetadata struct + helpers
│   └── api.rs           # API request/response models
├── services/
│   ├── scanner.rs       # Directory scanning
│   ├── processor.rs     # Content processing
│   ├── embedder.rs      # Ollama integration
│   └── indexer.rs       # Main orchestration
├── storage/
│   └── qdrant.rs        # Qdrant client wrapper
└── api/
    ├── handlers.rs      # Route handlers
    └── routes.rs        # Route definitions

Docker Deployment

Using Docker Compose

# Start all services (Qdrant + file-indexer)
docker-compose up -d

# View logs
docker-compose logs -f file-indexer

# Stop services
docker-compose down

# Rebuild after code changes
docker-compose up -d --build

WSL2 Notes

When running in WSL2 with Ollama on Windows:

  • Ollama is accessible via http://localhost:11434 from WSL2
  • Docker containers use http://host.docker.internal:11434 to reach Windows Ollama
  • Qdrant runs in Docker and communicates with file-indexer via Docker network

Performance

Benchmarks

  • Indexing Speed: ~10-20 files/second (depends on Ollama latency)
  • Batch Efficiency: 100 files per Qdrant upsert
  • Search Latency: ~45ms for typical queries
  • Concurrent Processing: 10 files in parallel

Optimizations

  • Parallel file processing with Tokio semaphore
  • Batch Qdrant upserts to reduce network calls
  • Async I/O throughout the pipeline
  • Minimal lock contention (mutex only for job state)
  • Connection pooling for HTTP clients

Technical Details

File ID Generation

Files are identified using deterministic UUID v5 generated from absolute paths. This ensures:

  • Idempotent updates (re-indexing updates existing points)
  • No duplicate entries
  • Fast lookups

SHA-256 hash of the path is also stored in the payload for reference.

Vector Storage

  • Text files: Full embedding vector (2560 dimensions)
  • Binary files: Zero vector + metadata only
  • Payload: Complete file metadata + content preview (first 100 lines)

Binary Detection Strategy

  1. Check against text extension whitelist
  2. If unknown extension, read first 8KB
  3. Attempt UTF-8 decoding
  4. Valid UTF-8 → text, invalid → binary

Testing

Test Coverage

  • Unit Tests: 45 tests (config, models, services)
  • Integration Tests: 18 tests (Qdrant, Ollama)
  • End-to-End Tests: 4 comprehensive pipeline tests

E2E Test Scenarios

  1. Full indexing pipeline (text + binary files)
  2. Semantic search with relevance ranking
  3. Binary file handling (metadata-only)
  4. Large directory batch processing (150 files)

All tests verify:

  • ✅ Files stored in Qdrant
  • ✅ Embeddings generated correctly
  • ✅ Data is searchable
  • ✅ Metadata is retrievable

Troubleshooting

Qdrant Connection Issues

# Check if Qdrant is running
curl http://localhost:6333/collections

# Restart Qdrant
docker-compose restart qdrant

Ollama Connection Issues

# Check if Ollama is running
curl http://localhost:11434/api/version

# Verify model is installed
ollama list | grep qwen3-embedding

Server Won't Start

# Check if port 8080 is already in use
lsof -i :8080

# View server logs
RUST_LOG=debug cargo run

Contributing

The project follows vertical slice development with comprehensive testing. See docs/master.TODO.md for the complete development roadmap.

Development Workflow

  1. All changes must have tests
  2. Run cargo test before committing
  3. Follow existing code patterns
  4. Update documentation as needed

License

[Your License Here]

Acknowledgments

Built with:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published