Add OCRPDFLoader for extracting text from scanned PDFs- Implements OC… #321

Jugal-lachhwani · 2025-09-15T06:23:37Z

…R-based PDF text extraction using Tesseract- Supports custom page ranges and OCR configurations - Includes comprehensive unit tests with 14 test cases- Handles errors gracefully and filters empty pages- Adds proper documentation and type hints

Jugal-lachhwani · 2025-09-15T14:09:47Z

I have Added a Document loader for extracting text from scanned pdfs using ocr. The document loader which is apready present for this task are not mantained and compatible with recent versions(I have tried) so there is a need of ocr bassed scanner so that users dont have to use third party library as it doesn't provide proper metadata. So I think this problem of users can be solved by this ocr_pdf document loader.

Thank You

Jugal-lachhwani and others added 2 commits September 3, 2025 15:19

Merge branch 'main' into add-ocr-pdf-loader

9a8905d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OCRPDFLoader for extracting text from scanned PDFs- Implements OC… #321

Add OCRPDFLoader for extracting text from scanned PDFs- Implements OC… #321

Uh oh!

Jugal-lachhwani commented Sep 15, 2025

Uh oh!

Jugal-lachhwani commented Sep 15, 2025

Uh oh!

Uh oh!

Add OCRPDFLoader for extracting text from scanned PDFs- Implements OC… #321

Are you sure you want to change the base?

Add OCRPDFLoader for extracting text from scanned PDFs- Implements OC… #321

Uh oh!

Conversation

Jugal-lachhwani commented Sep 15, 2025

Uh oh!

Jugal-lachhwani commented Sep 15, 2025

Uh oh!

Uh oh!