A FastAPI-based web service that converts PDF files to various formats (Markdown, JSON, HTML, or Chunks) using the datalab-to/marker library. The API supports OCR, image extraction, and optional LLM integration, delivering results as a ZIP file.
- Multiple Output Formats: Convert PDFs to Markdown, JSON, HTML, or structured chunks.
- OCR Support: Force OCR or strip existing OCR for better accuracy.
- Image Extraction: Extract images from PDFs (can be disabled).
- Page Range Selection: Process specific pages or ranges.
- Multi-language OCR: Support for multiple languages.
- LLM Integration: Optional LLM for enhanced conversion accuracy.
- Custom Prompts: Use custom prompts for LLM block correction.
- Math Enhancement: Optional inline math conversion with LLM.
- API Key Authentication: Secure endpoint access with an API key.
- CORS Restrictions: Configurable allowed origins for browser requests.
- File Name Normalization: Handles spaces and invalid characters in file names.
- Interactive Documentation: Scalar API documentation interface.
- Automatic Cleanup: Temporary files are deleted after processing.
- Python 3.10 or higher
- pip
-
Clone or download the project files:
git clone <repository-url> cd pdf-to-md-api
-
Create a virtual environment:
python -m venv .venv
-
Activate the virtual environment:
# Windows .\.venv\Scripts\activate # macOS/Linux source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
Note: Install
marker-pdfandscalar_fastapiseparately if required (check their documentation). -
Configure environment variables: Create a
.envfile in the project root:API_KEY=your-secure-api-key ALLOWED_ORIGINS=http://localhost:3000,http://127.0.0.1:3000 PORT=8080
Generate a secure API key:
python -c "import secrets; print(secrets.token_urlsafe(32))" -
Run the application:
uvicorn main:app --host 127.0.0.1 --port 8080 --reload
The API will be available at
http://localhost:8080.
- Scalar Docs:
http://localhost:8080/docs - OpenAPI Schema:
http://localhost:8080/openapi.json
Get API metadata including title, description, version, and tags.
Response:
{
"title": "PDF to Markdown/JSON Converter",
"description": "API to convert PDF files to various formats using datalab-to/marker (LLM optional)",
"version": "1.0.0",
"openapi_url": "/openapi.json",
"openapi_tags": [
{"name": "PDF Conversion", "description": "Endpoints for converting PDF files to Markdown, JSON, HTML, or Chunks."},
{"name": "API Info", "description": "Endpoint to retrieve API metadata."}
]
}Convert a PDF file to the specified format, returning a ZIP file containing the output, metadata, and images.
Headers:
X-API-Key: Required API key from.env.Origin: Must match an allowed origin inALLOWED_ORIGINS.
Parameters:
file(required): PDF file to upload.output_format(optional):markdown(default),json,html, orchunks.force_ocr(optional): Force OCR on all pages (falsedefault).strip_existing_ocr(optional): Remove existing OCR (falsedefault).disable_image_extraction(optional): Skip image extraction (falsedefault).page_range(optional): Pages to process, e.g.,"0,5-10,20".langs(optional): Comma-separated languages for OCR, e.g.,"en,fr".use_llm(optional): Enable LLM for improved accuracy (falsedefault).block_correction_prompt(optional): Custom prompt for LLM correction.llm_service(optional): LLM service (see supported services below).redo_inline_math(optional): Enhance math conversion (falsedefault).
Supported LLM Services:
marker.services.gemini.GoogleGeminiServicemarker.services.vertex.GoogleVertexServicemarker.services.ollama.OllamaServicemarker.services.claude.ClaudeServicemarker.services.openai.OpenAIServicemarker.services.azure_openai.AzureOpenAIService
Response: A ZIP file containing:
output.{md|json|html|chunks}: Converted content.metadata.json: Conversion metadata (e.g., table of contents, page stats).- Extracted images as JPEGs (if enabled).
curl -X POST "http://localhost:8080/convert" \
-H "X-API-Key: your-secure-api-key" \
-H "Origin: http://localhost:3000" \
-F "[email protected]"curl -X POST "http://localhost:8080/convert" \
-H "X-API-Key: your-secure-api-key" \
-H "Origin: http://localhost:3000" \
-F "[email protected]" \
-F "output_format=json" \
-F "force_ocr=true"curl -X POST "http://localhost:8080/convert" \
-H "X-API-Key: your-secure-api-key" \
-H "Origin: http://localhost:3000" \
-F "[email protected]" \
-F "page_range=0,5-10" \
-F "use_llm=true" \
-F "llm_service=marker.services.openai.OpenAIService"import requests
url = "http://localhost:8080/convert"
headers = {
"X-API-Key": "your-secure-api-key",
"Origin": "http://localhost:3000"
}
# Basic conversion
with open("document.pdf", "rb") as f:
files = {"file": f}
data = {"output_format": "markdown"}
response = requests.post(url, headers=headers, files=files, data=data)
with open("output.zip", "wb") as f:
f.write(response.content)Set in .env:
API_KEY: Secure key for endpoint access.ALLOWED_ORIGINS: Comma-separated allowed origins (e.g.,http://localhost:3000).PORT: Port for the server (default:8000).
For LLM services, add:
- OpenAI:
OPENAI_API_KEY - Google Gemini:
GOOGLE_API_KEY - Claude:
ANTHROPIC_API_KEY - Azure OpenAI:
AZURE_OPENAI_API_KEY,AZURE_OPENAI_ENDPOINT
The application creates uploads and output directories for temporary storage, which are cleaned up after processing.
project/
├── main.py
├── .env
├── requirements.txt
├── README.md
├── uploads/ # Auto-created for temporary files
└── output/ # Auto-created for temporary files
The API provides clear error responses:
- 400 Bad Request: Invalid file type or parameters (e.g.,
{"detail": "Only PDF files are supported"}). - 401 Unauthorized: Missing or invalid API key.
- 500 Internal Server Error: Unexpected processing errors (e.g.,
{"detail": "Internal Server Error"}).
uvicorn main:app --host 127.0.0.1 --port 8080 --reloadFor production:
- Configure
ALLOWED_ORIGINSstrictly. - Use HTTPS with a reverse proxy (e.g., Nginx).
- Use a production ASGI server:
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker
- Set environment variables securely.
- Limit file upload size (e.g., via Nginx or code).
- Memory Usage: ZIP files are read into memory; large PDFs may increase memory usage.
- Processing Time: OCR and LLM processing can be slow.
- Concurrent Requests: Scale workers for high load.
- File Cleanup: Temporary files are automatically removed.
Key dependencies:
- FastAPI: Web framework
- uvicorn: ASGI server
- python-dotenv: Environment variable management
- pillow: Image processing
- marker: PDF conversion library (install separately)
- scalar-fastapi: API documentation (install separately)
See requirements.txt for core dependencies.
Check the license of datalab-to/marker for usage terms.
- Fork the repository.
- Create a feature branch.
- Make and test changes.
- Submit a pull request.
For issues:
- PDF conversion: See marker documentation.
- API errors: Check response messages and server logs.
- LLM integration: Verify API keys and service configuration.