This project provides an API for uploading PDF documents and extracting structured summaries using either OpenAI's GPT-4o or Anthropic's Claude models. It is optimized for environmental and agricultural reports, transforming unstructured PDF content into structured JSON for downstream use.
- Framework: Node.js with Express
- PDF Parsing:
pdf-parse
- LLM Integration: OpenAI GPT-4o and Claude Opus 4
- Storage: In-memory via Multer
- Language: TypeScript
- Upload a PDF and extract structured data in a single API call
- Choose between OpenAI or Claude as LLM provider
- Validates and cleans JSON response for consistent schema
- Returns fallback empty structure in case of model or parsing failure
-
Clone the Repository
git clone https://github.com/your-org/pdf-extractor.git cd pdf-extractor
-
Install Dependencies
npm install
-
Environment Configuration Create a
.env
file:OPENAI_API_KEY=your_openai_key ANTHROPIC_API_KEY=your_claude_key
-
Run the Server
npm run dev
Upload a PDF and extract a summary:
file
: PDF file (multipart/form-data)
provider
:openai
(default) orclaude
curl -X POST http://localhost:3000/upload?provider=claude -F "file=@./sample.pdf"
├── routes/
│ └── upload.ts → Express route for file upload
├── services/
│ ├── pdfService.ts → PDF parsing logic
│ └── llmService.ts → LLM integration (OpenAI & Claude)
├── utils/
│ ├── types.ts → ExtractedReport interface
│ └── constants.ts → Prompt templates
├── .env
├── README.md