An intelligent, configurable microservice for document text extraction, AI-powered summarization, text embedding, and token counting. Built with FastAPI and designed for scalability, QuickDoc lets you enable only the features you need to optimize resource usage.
- Multi-format Support: PDF, DOCX, ODT, RTF, Markdown, EPUB, and images (JPG, PNG, BMP, TIFF)
- Page-by-Page Extraction: Extract text from PDFs page by page or as complete document
- Chapter-by-Chapter Extraction: Extract text from EPUBs chapter by chapter or as complete book
- Intelligent OCR: Automatic text extraction from scanned PDFs and images using PaddleOCR
- Configurable Processing: Enable/disable specific document types to save resources
- Text Summarization: Advanced summarization with configurable quality levels using transformer models
- Document Embeddings: Convert PDFs/EPUBs to page-by-page embeddings with intelligent chunking
- Text Embeddings: Generate semantic embeddings for texts and documents
- Token Counting: Accurate token counting for Llama 3, Mistral, and Gemini models
- Async Processing: Non-blocking AI operations with queue management
- Modular Features: Enable only the services you need
- Resource Optimization: Conditional model loading based on configuration
- Configurable Models: Choose your preferred AI models via environment variables
- Production Ready: Docker support with health checks and proper logging
-
Clone the repository
git clone https://github.com/digitaldrreamer/quickdoc.git cd quickdoc -
Configure your deployment
cp env.example .env # Edit .env to enable/disable features as needed -
Start with Docker Compose
docker-compose up --build
-
Test the service
# Test document extraction echo "Hello, QuickDoc!" > test.md curl -F "[email protected]" http://localhost:8005/extract # Check service status curl http://localhost:8005/health
The service will be available at http://localhost:8005 with interactive documentation at http://localhost:8005/docs. You can set the port in .env
System Dependencies:
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y \
pandoc poppler-utils libmagic1 tesseract-ocr \
libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libgomp1
# macOS
brew install pandoc poppler libmagic tesseractPython Setup:
# Clone and setup
git clone https://github.com/digitaldrreamer/quickdoc.git
cd quickdoc
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Copy and configure environment
cp env.example .env
# Edit .env file with your preferences# Development
python -m uvicorn app.main:app --host 0.0.0.0 --port 8005 --reload
# Production
gunicorn app.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8005QuickDoc is highly configurable through environment variables. Copy env.example to .env and customize:
# Enable/disable major components
ENABLE_SUMMARIZATION=true # AI text summarization
ENABLE_EMBEDDING_MODEL=true # Text embedding generation
ENABLE_TOKEN_COUNTING=true # Token counting for various models
ENABLE_DOCUMENT_PROCESSING=true # Document text extraction# Fine-grained document type control
ENABLE_PDF_PROCESSING=true # PDF text extraction & OCR
ENABLE_DOCX_PROCESSING=true # Word/ODT/RTF processing
ENABLE_IMAGE_OCR=true # Image text extraction
ENABLE_MARKDOWN_PROCESSING=true # Markdown processing# Specify which models to use
SUMMARIZATION_MODEL=google/flan-t5-small # Hugging Face model for summarization
EMBEDDING_MODEL=all-MiniLM-L6-v2 # Sentence transformer modelMinimal Deployment (Text extraction only):
ENABLE_SUMMARIZATION=false
ENABLE_EMBEDDING_MODEL=false
ENABLE_TOKEN_COUNTING=false
ENABLE_IMAGE_OCR=falsePDF-only Service:
ENABLE_DOCX_PROCESSING=false
ENABLE_IMAGE_OCR=false
ENABLE_MARKDOWN_PROCESSING=falseAI-only Service (no document processing):
ENABLE_DOCUMENT_PROCESSING=falseExtract text from documents.
curl -X POST -F "[email protected]" http://localhost:8005/extractResponse:
{
"text": "Extracted text content...",
"filename": "document.pdf",
"file_type": ".pdf",
"character_count": 1250,
"metrics": {
"processing_duration_ms": 150.2,
"memory_usage_mb": 75.3,
"processing_method": "pdfminer"
}
}Convert documents to PDF format.
Generate embeddings for text.
curl -X POST -H "Content-Type: application/json" \
-d '{"text": "Hello world", "normalize": true}' \
http://localhost:8005/ai/embed/textExtract text from document and generate embeddings.
curl -X POST -F "[email protected]" http://localhost:8005/ai/embed/documentConvert PDF/EPUB to page-by-page embeddings with intelligent chunking.
curl -X POST -F "[email protected]" -F "chunking_strategy=semantic" http://localhost:8005/embed/documentResponse:
{
"success": true,
"filename": "document.pdf",
"file_type": ".pdf",
"chunks": [
{
"chunk_id": "document.pdf_page_1_chunk_0",
"text": "Chapter 1: Introduction...",
"embedding": [0.123, -0.456, 0.789, ...],
"metadata": {
"page_number": 1,
"chunk_index": 0,
"char_count": 1250,
"word_count": 200,
"contains_headers": true,
"semantic_boundary": "paragraph"
}
}
],
"stats": {
"total_chunks": 45,
"total_pages": 20,
"embedding_dimensions": 384,
"processing_time_ms": 2340.5
}
}Summarize text with configurable quality.
curl -X POST -H "Content-Type: application/json" \
-d '{"text": "Long text to summarize...", "max_length": 150, "quality": "high"}' \
http://localhost:8005/ai/summarizeCount tokens for specific models (llama3, mistral, gemini).
curl -X POST -H "Content-Type: application/json" \
-d '{"text": "Text to count tokens for"}' \
http://localhost:8005/ai/tokens/count/llama3GET /health- Service health check with feature statusGET /- API overview and available endpointsGET /docs- Interactive API documentation (Swagger UI)
Basic deployment:
docker-compose up -dWith custom configuration:
# Create custom .env file
cp env.example .env
# Edit .env with your settings
docker-compose up -dUsing Docker with resource limits:
version: '3.8'
services:
quickdoc:
image: quickdoc:latest
ports:
- "8005:8005"
env_file: .env
deploy:
resources:
limits:
memory: 2G
cpus: '1.0'
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8005/health"]
interval: 30s
timeout: 10s
retries: 3Environment Variables for Production:
# Resource optimization
MAX_FILE_SIZE_MB=50
SUMMARIZATION_TIMEOUT=600
MAX_SUMMARIZATION_QUEUE_SIZE=50
LOG_LEVEL=WARNING
# Security (if using external APIs)
HUGGING_FACE_HUB_TOKEN=your_secure_tokenapiVersion: apps/v1
kind: Deployment
metadata:
name: quickdoc
spec:
replicas: 3
selector:
matchLabels:
app: quickdoc
template:
metadata:
labels:
app: quickdoc
spec:
containers:
- name: quickdoc
image: quickdoc:latest
ports:
- containerPort: 8005
env:
- name: ENABLE_SUMMARIZATION
value: "true"
- name: ENABLE_EMBEDDING_MODEL
value: "true"
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health
port: 8005
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: quickdoc-service
spec:
selector:
app: quickdoc
ports:
- port: 80
targetPort: 8005
type: LoadBalancer# Clone and setup
git clone https://github.com/digitaldrreamer/quickdoc.git
cd quickdoc
# Setup virtual environment
python -m venv venv
source venv/bin/activate
# Install development dependencies
pip install -r requirements.txt
# Run in development mode
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8005# Run the test suite
python -m pytest
# Test specific endpoints
python test_pdf_endpoints.py
python test_enhanced_pdf.pyThe project follows Google Python Style Guide and includes:
- Type hints throughout the codebase
- Comprehensive error handling
- Structured logging
- Resource tracking and metrics
- Async/await patterns for scalability
- Framework: FastAPI 0.104+
- AI/ML:
- Transformers 4.41+ (summarization)
- Sentence Transformers 2.7+ (embeddings)
- PaddleOCR 2.7+ (OCR)
- Document Processing:
- PDFMiner.six (PDF text extraction)
- Pandoc (document conversion)
- PyMuPDF (PDF rendering)
- Infrastructure:
- Docker & Docker Compose
- Uvicorn/Gunicorn (ASGI servers)
- Pydantic (configuration & validation)
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes with tests
- Follow the coding standards: Google Python Style Guide
- Submit a pull request
- Add type hints to all functions
- Include docstrings for public methods
- Write tests for new features
- Update documentation for API changes
- Use Better Comments style for inline comments
- Use good branch names so one can understand at first glance
This project is licensed under the MIT License - see the LICENSE file for details.
- PaddleOCR for excellent OCR capabilities
- Hugging Face Transformers for state-of-the-art NLP models
- FastAPI for being so excellent
- The open source community for inspiration and tools
- Documentation: Check
/docsendpoint when service is running - Issues: Please report bugs via GitHub Issues
- Discussions: Start discussions for feature requests and questions