File Indexer

A high-performance Rust-based file indexing system that provides semantic search capabilities over your codebase using vector embeddings.

Overview

File Indexer recursively scans directories, generates embeddings for text files via Ollama, and stores metadata in Qdrant vector database for lightning-fast semantic and metadata-based search.

Key Features

🚀 Fast Parallel Processing - Processes files concurrently (10 at a time)
🧠 Semantic Search - Find files by meaning, not just keywords
📦 Batch Operations - Efficient batch uploads (100 files per batch)
🔍 Smart Binary Detection - Extension whitelist + UTF-8 validation
📊 Job Tracking - Monitor indexing progress in real-time
🐳 Docker Ready - Full containerization with docker-compose
✅ Fully Tested - 67 tests including comprehensive E2E tests

Quick Start

Prerequisites

Rust 1.75+ (for building from source)
Docker & Docker Compose (for running services)
Ollama with qwen3-embedding:4b model

1. Start Services

# Start Qdrant
docker-compose up -d qdrant

# Ensure Ollama is running with the model
ollama pull qwen3-embedding:4b

2. Configure

Edit config.toml:

[indexer]
base_path = "/path/to/index"
max_lines_to_index = 100

[ollama]
host = "http://localhost:11434"
embedding_model = "qwen3-embedding:4b"

[qdrant]
host = "http://localhost:6334"
collection_name = "file_index"
vector_size = 2560

[server]
host = "0.0.0.0"
port = 8080

3. Run

# Build and run
cargo run --release

# Or use Docker
docker-compose up -d

The API will be available at http://localhost:8080

API Documentation

Index Operations

Start Indexing

POST /index
Content-Type: application/json

{
  "path": "/optional/override/path",
  "recursive": true,
  "force_reindex": false
}

# Response
{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "message": "Indexing initiated"
}

Check Job Status

GET /index/status/{job_id}

# Response
{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "files_processed": 1543,
  "files_total": 1543,
  "errors": []
}

Search Operations

Semantic Search

POST /search
Content-Type: application/json

{
  "query": "authentication implementation",
  "limit": 10,
  "filters": {
    "file_extension": ["rs", "py"],
    "modified_after": "2025-01-01T00:00:00Z",
    "path_pattern": "*/src/*"
  },
  "include_binary": false
}

# Response
{
  "results": [
    {
      "score": 0.85,
      "metadata": {
        "absolute_path": "/path/to/auth.rs",
        "file_name": "auth.rs",
        "file_extension": "rs",
        "is_binary": false,
        "line_count": 150,
        ...
      },
      "content_preview": "impl Authentication { ... }"
    }
  ],
  "total": 10,
  "query_time_ms": 45
}

Metadata Search

GET /search/metadata?file_extension=rs&limit=20

# Response
{
  "results": [ /* array of FileMetadata */ ],
  "total": 15
}

System Operations

Health Check

GET /health

# Response
{
  "status": "healthy",
  "qdrant_connected": true,
  "ollama_connected": true,
  "indexed_files_count": 1543
}

Statistics

GET /stats

# Response
{
  "total_files": 1543,
  "text_files": 1200,
  "binary_files": 343,
  "total_size_bytes": 52428800,
  "last_index_time": "2025-10-18T10:30:00Z",
  "collection_name": "file_index"
}

Architecture

Data Flow

Directory
    ↓
Scanner (walkdir)
    ↓
Processor (binary detection, content reading)
    ↓
Embedder (Ollama API)
    ↓
Qdrant Storage (vector DB)
    ↓
REST API (search & retrieval)

Components

File Scanner - Recursive directory traversal with metadata collection
Content Processor - Binary/text detection and content extraction (first 100 lines)
Embedding Generator - Ollama integration for vector embeddings
Qdrant Client - Vector database operations with batch support
REST API - Full-featured HTTP API with job management
Indexing Orchestrator - Parallel processing with job tracking

Configuration

Environment Variables

All config values can be overridden via environment variables with the INDEXER_ prefix:

export INDEXER_BASE_PATH="/home/user/code"
export INDEXER_MAX_LINES_TO_INDEX=100
export INDEXER_OLLAMA_HOST="http://localhost:11434"
export INDEXER_OLLAMA_EMBEDDING_MODEL="qwen3-embedding:4b"
export INDEXER_QDRANT_HOST="http://localhost:6334"
export INDEXER_QDRANT_COLLECTION_NAME="file_index"
export INDEXER_QDRANT_VECTOR_SIZE=2560
export INDEXER_SERVER_HOST="0.0.0.0"
export INDEXER_SERVER_PORT=8080

Supported Text Extensions

txt, md, rs, py, js, ts, json, yaml, yml, toml, xml, html, css,
sh, bash, c, cpp, h, java, go, rb, php, sql, log, csv, ini, env,
jsx, tsx, vue, svelte, astro, dockerfile, makefile, cmake, gradle, maven

Unknown extensions are tested via UTF-8 validation (first 8KB).

Development

Running Tests

# All tests
cargo test

# Unit tests only
cargo test --lib

# E2E tests only
cargo test --lib services::indexer::tests::test_e2e

# With output
cargo test -- --nocapture

Building

# Debug build
cargo build

# Release build (optimized)
cargo build --release

# Docker build
docker-compose build

Project Structure

src/
├── main.rs              # HTTP server entry point
├── lib.rs               # Library exports
├── config.rs            # Configuration management
├── errors.rs            # Error types
├── models/
│   ├── file_metadata.rs # FileMetadata struct + helpers
│   └── api.rs           # API request/response models
├── services/
│   ├── scanner.rs       # Directory scanning
│   ├── processor.rs     # Content processing
│   ├── embedder.rs      # Ollama integration
│   └── indexer.rs       # Main orchestration
├── storage/
│   └── qdrant.rs        # Qdrant client wrapper
└── api/
    ├── handlers.rs      # Route handlers
    └── routes.rs        # Route definitions

Docker Deployment

Using Docker Compose

# Start all services (Qdrant + file-indexer)
docker-compose up -d

# View logs
docker-compose logs -f file-indexer

# Stop services
docker-compose down

# Rebuild after code changes
docker-compose up -d --build

WSL2 Notes

When running in WSL2 with Ollama on Windows:

Ollama is accessible via http://localhost:11434 from WSL2
Docker containers use http://host.docker.internal:11434 to reach Windows Ollama
Qdrant runs in Docker and communicates with file-indexer via Docker network

Performance

Benchmarks

Indexing Speed: ~10-20 files/second (depends on Ollama latency)
Batch Efficiency: 100 files per Qdrant upsert
Search Latency: ~45ms for typical queries
Concurrent Processing: 10 files in parallel

Optimizations

Parallel file processing with Tokio semaphore
Batch Qdrant upserts to reduce network calls
Async I/O throughout the pipeline
Minimal lock contention (mutex only for job state)
Connection pooling for HTTP clients

Technical Details

File ID Generation

Files are identified using deterministic UUID v5 generated from absolute paths. This ensures:

Idempotent updates (re-indexing updates existing points)
No duplicate entries
Fast lookups

SHA-256 hash of the path is also stored in the payload for reference.

Vector Storage

Text files: Full embedding vector (2560 dimensions)
Binary files: Zero vector + metadata only
Payload: Complete file metadata + content preview (first 100 lines)

Binary Detection Strategy

Check against text extension whitelist
If unknown extension, read first 8KB
Attempt UTF-8 decoding
Valid UTF-8 → text, invalid → binary

Testing

Test Coverage

Unit Tests: 45 tests (config, models, services)
Integration Tests: 18 tests (Qdrant, Ollama)
End-to-End Tests: 4 comprehensive pipeline tests

E2E Test Scenarios

Full indexing pipeline (text + binary files)
Semantic search with relevance ranking
Binary file handling (metadata-only)
Large directory batch processing (150 files)

All tests verify:

✅ Files stored in Qdrant
✅ Embeddings generated correctly
✅ Data is searchable
✅ Metadata is retrievable

Troubleshooting

Qdrant Connection Issues

# Check if Qdrant is running
curl http://localhost:6333/collections

# Restart Qdrant
docker-compose restart qdrant

Ollama Connection Issues

# Check if Ollama is running
curl http://localhost:11434/api/version

# Verify model is installed
ollama list | grep qwen3-embedding

Server Won't Start

# Check if port 8080 is already in use
lsof -i :8080

# View server logs
RUST_LOG=debug cargo run

Contributing

The project follows vertical slice development with comprehensive testing. See docs/master.TODO.md for the complete development roadmap.

Development Workflow

All changes must have tests
Run cargo test before committing
Follow existing code patterns
Update documentation as needed

License

[Your License Here]

Acknowledgments

Built with:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
screenshots		screenshots
src		src
static		static
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
config.toml		config.toml
docker-compose.yml		docker-compose.yml
indexed_paths.json		indexed_paths.json
runtime_config.toml		runtime_config.toml

that-guy-scott/file-indexer

Folders and files

Latest commit

History

Repository files navigation