Semantica is a lightweight semantic search engine for PDF documents. It processes PDF files, converts them into vectorized text chunks, and enables intelligent retrieval using embedding-based similarity search — all without using an LLM.
- 📄 Upload any PDF file
- ✂️ Automatic chunking of document content
- 🔢 Embedding with HuggingFace (
MiniLM) - 🧠 Vector search with Qdrant
- ⚡ Fast and local — no OpenAI API required
- 📆 Built with FastAPI, LangChain, and Qdrant
| Layer | Tool |
|---|---|
| Backend | FastAPI |
| Parsing | pymupdf4llm |
| Chunking | LangChain MarkdownTextSplitter |
| Embedding | sentence-transformers/all-MiniLM-L6-v2 |
| Vector DB | Qdrant via Docker |
git clone https://github.com/yourname/semantica.git
cd semanticapip install -r requirements.txtdocker run -p 6333:6333 qdrant/qdrantfastapi dev mainThen open the Swagger UI at:
📍 http://localhost:8000/docs
Uploads and parses a PDF file. Chunks it and saves to Qdrant with embeddings.
Send a semantic query and receive relevant chunks. Example request:
{
"query": "Does this PDF mention 'fun' keyword?"
}Example response:
[
{
"score": 0.92,
"text": "This is a simple PDF file. Fun fun fun.",
"source_file": "sample.pdf",
"chunk_id": 1
}
]- LLM-based answer generation
- Multi-document support
- Frontend interface for document search (possibly separated)
Pull requests, feedback and ideas are always welcome. If you use this project, feel free to ⭐️ the repo and share your feedback.
MIT