PDF to Markdown/JSON Converter API

A FastAPI-based web service that converts PDF files to various formats (Markdown, JSON, HTML, or Chunks) using the datalab-to/marker library. The API supports OCR, image extraction, and optional LLM integration, delivering results as a ZIP file.

Features

Multiple Output Formats: Convert PDFs to Markdown, JSON, HTML, or structured chunks.
OCR Support: Force OCR or strip existing OCR for better accuracy.
Image Extraction: Extract images from PDFs (can be disabled).
Page Range Selection: Process specific pages or ranges.
Multi-language OCR: Support for multiple languages.
LLM Integration: Optional LLM for enhanced conversion accuracy.
Custom Prompts: Use custom prompts for LLM block correction.
Math Enhancement: Optional inline math conversion with LLM.
API Key Authentication: Secure endpoint access with an API key.
CORS Restrictions: Configurable allowed origins for browser requests.
File Name Normalization: Handles spaces and invalid characters in file names.
Interactive Documentation: Scalar API documentation interface.
Automatic Cleanup: Temporary files are deleted after processing.

Installation

Prerequisites

Python 3.10 or higher
pip

Setup

Clone or download the project files:

git clone <repository-url>
cd pdf-to-md-api

Create a virtual environment:
```
python -m venv .venv
```

Activate the virtual environment:

# Windows
.\.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Note: Install marker-pdf and scalar_fastapi separately if required (check their documentation).

Configure environment variables: Create a .env file in the project root:

API_KEY=your-secure-api-key
ALLOWED_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
PORT=8080

Generate a secure API key:

python -c "import secrets; print(secrets.token_urlsafe(32))"

Run the application:
```
uvicorn main:app --host 127.0.0.1 --port 8080 --reload
```
The API will be available at http://localhost:8080.

API Documentation

Interactive Documentation

Scalar Docs: http://localhost:8080/docs
OpenAPI Schema: http://localhost:8080/openapi.json

Endpoints

GET `/`

Get API metadata including title, description, version, and tags.

Response:

{
  "title": "PDF to Markdown/JSON Converter",
  "description": "API to convert PDF files to various formats using datalab-to/marker (LLM optional)",
  "version": "1.0.0",
  "openapi_url": "/openapi.json",
  "openapi_tags": [
    {"name": "PDF Conversion", "description": "Endpoints for converting PDF files to Markdown, JSON, HTML, or Chunks."},
    {"name": "API Info", "description": "Endpoint to retrieve API metadata."}
  ]
}

POST `/convert`

Convert a PDF file to the specified format, returning a ZIP file containing the output, metadata, and images.

Headers:

X-API-Key: Required API key from .env.
Origin: Must match an allowed origin in ALLOWED_ORIGINS.

Parameters:

file (required): PDF file to upload.
output_format (optional): markdown (default), json, html, or chunks.
force_ocr (optional): Force OCR on all pages (false default).
strip_existing_ocr (optional): Remove existing OCR (false default).
disable_image_extraction (optional): Skip image extraction (false default).
page_range (optional): Pages to process, e.g., "0,5-10,20".
langs (optional): Comma-separated languages for OCR, e.g., "en,fr".
use_llm (optional): Enable LLM for improved accuracy (false default).
block_correction_prompt (optional): Custom prompt for LLM correction.
llm_service (optional): LLM service (see supported services below).
redo_inline_math (optional): Enhance math conversion (false default).

Supported LLM Services:

marker.services.gemini.GoogleGeminiService
marker.services.vertex.GoogleVertexService
marker.services.ollama.OllamaService
marker.services.claude.ClaudeService
marker.services.openai.OpenAIService
marker.services.azure_openai.AzureOpenAIService

Response: A ZIP file containing:

output.{md|json|html|chunks}: Converted content.
metadata.json: Conversion metadata (e.g., table of contents, page stats).
Extracted images as JPEGs (if enabled).

Usage Examples

cURL Examples

Basic Conversion (PDF to Markdown)

curl -X POST "http://localhost:8080/convert" \
  -H "X-API-Key: your-secure-api-key" \
  -H "Origin: http://localhost:3000" \
  -F "[email protected]"

Convert to JSON with OCR

curl -X POST "http://localhost:8080/convert" \
  -H "X-API-Key: your-secure-api-key" \
  -H "Origin: http://localhost:3000" \
  -F "[email protected]" \
  -F "output_format=json" \
  -F "force_ocr=true"

Convert Specific Pages with LLM

curl -X POST "http://localhost:8080/convert" \
  -H "X-API-Key: your-secure-api-key" \
  -H "Origin: http://localhost:3000" \
  -F "[email protected]" \
  -F "page_range=0,5-10" \
  -F "use_llm=true" \
  -F "llm_service=marker.services.openai.OpenAIService"

Python Example

import requests

url = "http://localhost:8080/convert"
headers = {
    "X-API-Key": "your-secure-api-key",
    "Origin": "http://localhost:3000"
}

# Basic conversion
with open("document.pdf", "rb") as f:
    files = {"file": f}
    data = {"output_format": "markdown"}
    response = requests.post(url, headers=headers, files=files, data=data)
    
with open("output.zip", "wb") as f:
    f.write(response.content)

Configuration

Environment Variables

Set in .env:

API_KEY: Secure key for endpoint access.
ALLOWED_ORIGINS: Comma-separated allowed origins (e.g., http://localhost:3000).
PORT: Port for the server (default: 8000).

For LLM services, add:

OpenAI: OPENAI_API_KEY
Google Gemini: GOOGLE_API_KEY
Claude: ANTHROPIC_API_KEY
Azure OpenAI: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT

Directory Structure

The application creates uploads and output directories for temporary storage, which are cleaned up after processing.

project/
├── main.py
├── .env
├── requirements.txt
├── README.md
├── uploads/  # Auto-created for temporary files
└── output/   # Auto-created for temporary files

Error Handling

The API provides clear error responses:

400 Bad Request: Invalid file type or parameters (e.g., {"detail": "Only PDF files are supported"}).
401 Unauthorized: Missing or invalid API key.
500 Internal Server Error: Unexpected processing errors (e.g., {"detail": "Internal Server Error"}).

Development

Running in Development Mode

uvicorn main:app --host 127.0.0.1 --port 8080 --reload

Production Deployment

For production:

Configure ALLOWED_ORIGINS strictly.
Use HTTPS with a reverse proxy (e.g., Nginx).

Use a production ASGI server:

gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker

Set environment variables securely.
Limit file upload size (e.g., via Nginx or code).

Performance Considerations

Memory Usage: ZIP files are read into memory; large PDFs may increase memory usage.
Processing Time: OCR and LLM processing can be slow.
Concurrent Requests: Scale workers for high load.
File Cleanup: Temporary files are automatically removed.

Dependencies

Key dependencies:

FastAPI: Web framework
uvicorn: ASGI server
python-dotenv: Environment variable management
pillow: Image processing
marker: PDF conversion library (install separately)
scalar-fastapi: API documentation (install separately)

See requirements.txt for core dependencies.

License

Check the license of datalab-to/marker for usage terms.

Contributing

Fork the repository.
Create a feature branch.
Make and test changes.
Submit a pull request.

Support

For issues:

PDF conversion: See marker documentation.
API errors: Check response messages and server logs.
LLM integration: Verify API keys and service configuration.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.idea		.idea
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF to Markdown/JSON Converter API

Features

Installation

Prerequisites

Setup

API Documentation

Interactive Documentation

Endpoints

GET `/`

POST `/convert`

Usage Examples

cURL Examples

Basic Conversion (PDF to Markdown)

Convert to JSON with OCR

Convert Specific Pages with LLM

Python Example

Configuration

Environment Variables

Directory Structure

Error Handling

Development

Running in Development Mode

Production Deployment

Performance Considerations

Dependencies

License

Contributing

Support

About

Uh oh!

Uh oh!

Languages

pitzzahh/pdf-to-md-api

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown/JSON Converter API

Features

Installation

Prerequisites

Setup

API Documentation

Interactive Documentation

Endpoints

GET /

POST /convert

Usage Examples

cURL Examples

Basic Conversion (PDF to Markdown)

Convert to JSON with OCR

Convert Specific Pages with LLM

Python Example

Configuration

Environment Variables

Directory Structure

Error Handling

Development

Running in Development Mode

Production Deployment

Performance Considerations

Dependencies

License

Contributing

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

GET `/`

POST `/convert`