🧱 wouldchunk

wouldchunk is an AI-powered content chunking microservice designed to intelligently process, segment, and prepare company documents, code, and data for downstream LLM applications like RAG, Q&A, and semantic search.

It takes in raw input (like a PDF, .cob file, Slack export, or Jira data), classifies it, routes it to the appropriate domain-specific chunker, uses an LLM (like Mistral via Ollama) to chunk the content meaningfully, and returns structured output with rich metadata.

🚀 Features

🧠 LLM-Powered Semantic Chunking — uses Mistral (locally via Ollama) to create meaningful chunks instead of naive slicing
📂 Multi-type Ingestion — supports PDFs (coming soon: text files, code, Git logs, Slack exports, Jira issues, Jira comments, Confluence, Google Docs...)
🔎 Smart Routing — auto-detects file type and content category (e.g. "culture", "code", "git") and routes to the correct handler
🏗️ Single vs Multiple Handling — dynamically chooses the right chunking pipeline based on number of files uploaded
🏷️ Metadata-Rich Output — includes category, chunk index, timestamps, file source, versioning (more coming soon)
📜 Output Logging — every ingestion is saved as .jsonl for later reprocessing or training

🔧 Requirements

Python 3.10+
ollama running locally with Mistral (ollama run mistral)
pdfplumber for PDF ingestion
python-magic for MIME sniffing

Install dependencies:

pip install -r requirements.txt

🧪 Running the server

Start the FastAPI server:

uvicorn app.main:app --reload

Docs

Access the API docs: (coming soon)

http://localhost:8000/docs

📤 Testing

Testing with curl:

curl -X POST http://localhost:8000/auto-chunk \
  -F 'files=@/path/to/Your Document.pdf'

You’ll get back: • A list of chunked content • Metadata per chunk • The file’s category and type • The saved path to your .jsonl log file

Chunk Output Example:

{
  "chunk_text": "Trust is earned. We do what we say.",
  "category": "culture",
  "chunk_index": 0,
  "source_file": "Our Company Culture.pdf",
  "version": "v1.0",
  "timestamp": "2025-04-08T22:30:00Z"
}

🗺️ Roadmap Ideas

Add support for text files, code, Git logs, Slack exports, Jira issues, Jira comments, Confluence, Google Docs
Unit tests
Allow CLI-based batch ingestion

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧱 wouldchunk

🚀 Features

🔧 Requirements

🧪 Running the server

Docs

📤 Testing

🗺️ Roadmap Ideas

About

Releases

Packages

Languages

IDHalverson/wouldchunk

Folders and files

Latest commit

History

Repository files navigation

🧱 wouldchunk

🚀 Features

🔧 Requirements

🧪 Running the server

Docs

📤 Testing

🗺️ Roadmap Ideas

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages