Skip to content

FastAPI service for converting PDFs to Markdown/JSON/HTML using marker library with OCR and optional LLM integration

Notifications You must be signed in to change notification settings

pitzzahh/pdf-to-md-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF to Markdown/JSON Converter API

A FastAPI-based web service that converts PDF files to various formats (Markdown, JSON, HTML, or Chunks) using the datalab-to/marker library. The API supports OCR, image extraction, and optional LLM integration, delivering results as a ZIP file.

Features

  • Multiple Output Formats: Convert PDFs to Markdown, JSON, HTML, or structured chunks.
  • OCR Support: Force OCR or strip existing OCR for better accuracy.
  • Image Extraction: Extract images from PDFs (can be disabled).
  • Page Range Selection: Process specific pages or ranges.
  • Multi-language OCR: Support for multiple languages.
  • LLM Integration: Optional LLM for enhanced conversion accuracy.
  • Custom Prompts: Use custom prompts for LLM block correction.
  • Math Enhancement: Optional inline math conversion with LLM.
  • API Key Authentication: Secure endpoint access with an API key.
  • CORS Restrictions: Configurable allowed origins for browser requests.
  • File Name Normalization: Handles spaces and invalid characters in file names.
  • Interactive Documentation: Scalar API documentation interface.
  • Automatic Cleanup: Temporary files are deleted after processing.

Installation

Prerequisites

  • Python 3.10 or higher
  • pip

Setup

  1. Clone or download the project files:

    git clone <repository-url>
    cd pdf-to-md-api
  2. Create a virtual environment:

    python -m venv .venv
  3. Activate the virtual environment:

    # Windows
    .\.venv\Scripts\activate
    
    # macOS/Linux
    source .venv/bin/activate
  4. Install dependencies:

    pip install -r requirements.txt

    Note: Install marker-pdf and scalar_fastapi separately if required (check their documentation).

  5. Configure environment variables: Create a .env file in the project root:

    API_KEY=your-secure-api-key
    ALLOWED_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
    PORT=8080

    Generate a secure API key:

    python -c "import secrets; print(secrets.token_urlsafe(32))"
  6. Run the application:

    uvicorn main:app --host 127.0.0.1 --port 8080 --reload

    The API will be available at http://localhost:8080.

API Documentation

Interactive Documentation

  • Scalar Docs: http://localhost:8080/docs
  • OpenAPI Schema: http://localhost:8080/openapi.json

Endpoints

GET /

Get API metadata including title, description, version, and tags.

Response:

{
  "title": "PDF to Markdown/JSON Converter",
  "description": "API to convert PDF files to various formats using datalab-to/marker (LLM optional)",
  "version": "1.0.0",
  "openapi_url": "/openapi.json",
  "openapi_tags": [
    {"name": "PDF Conversion", "description": "Endpoints for converting PDF files to Markdown, JSON, HTML, or Chunks."},
    {"name": "API Info", "description": "Endpoint to retrieve API metadata."}
  ]
}

POST /convert

Convert a PDF file to the specified format, returning a ZIP file containing the output, metadata, and images.

Headers:

  • X-API-Key: Required API key from .env.
  • Origin: Must match an allowed origin in ALLOWED_ORIGINS.

Parameters:

  • file (required): PDF file to upload.
  • output_format (optional): markdown (default), json, html, or chunks.
  • force_ocr (optional): Force OCR on all pages (false default).
  • strip_existing_ocr (optional): Remove existing OCR (false default).
  • disable_image_extraction (optional): Skip image extraction (false default).
  • page_range (optional): Pages to process, e.g., "0,5-10,20".
  • langs (optional): Comma-separated languages for OCR, e.g., "en,fr".
  • use_llm (optional): Enable LLM for improved accuracy (false default).
  • block_correction_prompt (optional): Custom prompt for LLM correction.
  • llm_service (optional): LLM service (see supported services below).
  • redo_inline_math (optional): Enhance math conversion (false default).

Supported LLM Services:

  • marker.services.gemini.GoogleGeminiService
  • marker.services.vertex.GoogleVertexService
  • marker.services.ollama.OllamaService
  • marker.services.claude.ClaudeService
  • marker.services.openai.OpenAIService
  • marker.services.azure_openai.AzureOpenAIService

Response: A ZIP file containing:

  • output.{md|json|html|chunks}: Converted content.
  • metadata.json: Conversion metadata (e.g., table of contents, page stats).
  • Extracted images as JPEGs (if enabled).

Usage Examples

cURL Examples

Basic Conversion (PDF to Markdown)

curl -X POST "http://localhost:8080/convert" \
  -H "X-API-Key: your-secure-api-key" \
  -H "Origin: http://localhost:3000" \
  -F "[email protected]"

Convert to JSON with OCR

curl -X POST "http://localhost:8080/convert" \
  -H "X-API-Key: your-secure-api-key" \
  -H "Origin: http://localhost:3000" \
  -F "[email protected]" \
  -F "output_format=json" \
  -F "force_ocr=true"

Convert Specific Pages with LLM

curl -X POST "http://localhost:8080/convert" \
  -H "X-API-Key: your-secure-api-key" \
  -H "Origin: http://localhost:3000" \
  -F "[email protected]" \
  -F "page_range=0,5-10" \
  -F "use_llm=true" \
  -F "llm_service=marker.services.openai.OpenAIService"

Python Example

import requests

url = "http://localhost:8080/convert"
headers = {
    "X-API-Key": "your-secure-api-key",
    "Origin": "http://localhost:3000"
}

# Basic conversion
with open("document.pdf", "rb") as f:
    files = {"file": f}
    data = {"output_format": "markdown"}
    response = requests.post(url, headers=headers, files=files, data=data)
    
with open("output.zip", "wb") as f:
    f.write(response.content)

Configuration

Environment Variables

Set in .env:

  • API_KEY: Secure key for endpoint access.
  • ALLOWED_ORIGINS: Comma-separated allowed origins (e.g., http://localhost:3000).
  • PORT: Port for the server (default: 8000).

For LLM services, add:

  • OpenAI: OPENAI_API_KEY
  • Google Gemini: GOOGLE_API_KEY
  • Claude: ANTHROPIC_API_KEY
  • Azure OpenAI: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT

Directory Structure

The application creates uploads and output directories for temporary storage, which are cleaned up after processing.

project/
├── main.py
├── .env
├── requirements.txt
├── README.md
├── uploads/  # Auto-created for temporary files
└── output/   # Auto-created for temporary files

Error Handling

The API provides clear error responses:

  • 400 Bad Request: Invalid file type or parameters (e.g., {"detail": "Only PDF files are supported"}).
  • 401 Unauthorized: Missing or invalid API key.
  • 500 Internal Server Error: Unexpected processing errors (e.g., {"detail": "Internal Server Error"}).

Development

Running in Development Mode

uvicorn main:app --host 127.0.0.1 --port 8080 --reload

Production Deployment

For production:

  1. Configure ALLOWED_ORIGINS strictly.
  2. Use HTTPS with a reverse proxy (e.g., Nginx).
  3. Use a production ASGI server:
    gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker
  4. Set environment variables securely.
  5. Limit file upload size (e.g., via Nginx or code).

Performance Considerations

  • Memory Usage: ZIP files are read into memory; large PDFs may increase memory usage.
  • Processing Time: OCR and LLM processing can be slow.
  • Concurrent Requests: Scale workers for high load.
  • File Cleanup: Temporary files are automatically removed.

Dependencies

Key dependencies:

  • FastAPI: Web framework
  • uvicorn: ASGI server
  • python-dotenv: Environment variable management
  • pillow: Image processing
  • marker: PDF conversion library (install separately)
  • scalar-fastapi: API documentation (install separately)

See requirements.txt for core dependencies.

License

Check the license of datalab-to/marker for usage terms.

Contributing

  1. Fork the repository.
  2. Create a feature branch.
  3. Make and test changes.
  4. Submit a pull request.

Support

For issues:

  • PDF conversion: See marker documentation.
  • API errors: Check response messages and server logs.
  • LLM integration: Verify API keys and service configuration.

About

FastAPI service for converting PDFs to Markdown/JSON/HTML using marker library with OCR and optional LLM integration

Topics

Resources

Stars

Watchers

Forks