Skip to content

build-ai-applications/Eval-PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ PDF to Searchable Text Evaluation

πŸš€ Evaluate frameworks for converting PDFs into searchable text using a Gradio app. Upload a PDF and optionally provide ground truth images, text, or tables for benchmarking extraction results.

βš–οΈ Frameworks Compared

πŸ“Œ PyMuPDF – Fast text & image extraction
πŸ“Œ Tesseract – OCR-based, best for scanned PDFs
πŸ“Œ PyPDF – Basic text extraction, lacks formatting retention
πŸ“Œ Camelot – Best for tabular data extraction
πŸ“Œ PDFMiner – Powerful text extraction with layout analysis
πŸ“Œ PDFPlumber – Structured data extraction, including tables & images

πŸ“Š Key Metrics

βœ… Text Accuracy – How well text is extracted
βœ… Table Detection – Table extraction accuracy
βœ… Image Preservation – Retention of embedded images
βœ… Format Retention – Maintenance of document structure
βœ… Processing Time – Speed of extraction

πŸ› οΈ Feature Comparison

Feature PyMuPDF PDFPlumber Tesseract PDFMiner PyPDF Camelot
πŸ“„ Text Extraction βœ… βœ… βœ… βœ… βœ… ❌
πŸ›οΈ Table Extraction βœ… βœ… ❌ ❌ ❌ βœ…
πŸ–ΌοΈ Image Extraction βœ… βœ… ❌ βœ… βœ… ❌
πŸ“‘ Scanned PDFs Support βœ… (OCR) βœ… (OCR) βœ… ❌ ❌ ❌

🎯 Summary

  • ⚑ PyMuPDF is fast and efficient.
  • πŸ“œ Tesseract is best for scanned PDFs.
  • πŸ“Š Camelot excels at table extraction.
  • πŸ— PDFPlumber & PDFMiner provide detailed structure.

πŸš€ How to Use

1️⃣ Clone the repo: git clone <repo-url>
2️⃣ Run the app: Evaluation_pdf_to_text.ipynb
3️⃣ Upload PDF and run the Gradio app to get the results

πŸ› οΈ Contributing

We welcome contributions from the community! If you’d like to contribute:
πŸ“Œ Bug Reports & Issues: Open an issue if you find any bugs.
πŸ“Œ Feature Requests: Suggest new features via discussions or pull requests.
πŸ“Œ Pull Requests: Fork the repo, make changes, and submit a pull request!

Contribution Guidelines:

  • Ensure your changes follow best practices.
  • Keep the repository structured and well-documented.
  • Respect the licenses of individual frameworks.

πŸ“œ License

Eval-STT is open-source under the Apache 2.0. Use it freely and contribute to make it better! πŸš€

πŸ“¬ Contact Us

For any questions, suggestions, or feature requests, open an issue or reach out to the maintainers! πŸ’‘

πŸ”— References

πŸ”Ή PyMuPDF
πŸ”Ή Tesseract OCR
πŸ”Ή PDFMiner
πŸ”Ή PDFPlumber
πŸ”Ή Camelot