π Evaluate frameworks for converting PDFs into searchable text using a Gradio app. Upload a PDF and optionally provide ground truth images, text, or tables for benchmarking extraction results.
π PyMuPDF β Fast text & image extraction
π Tesseract β OCR-based, best for scanned PDFs
π PyPDF β Basic text extraction, lacks formatting retention
π Camelot β Best for tabular data extraction
π PDFMiner β Powerful text extraction with layout analysis
π PDFPlumber β Structured data extraction, including tables & images
β
Text Accuracy β How well text is extracted
β
Table Detection β Table extraction accuracy
β
Image Preservation β Retention of embedded images
β
Format Retention β Maintenance of document structure
β
Processing Time β Speed of extraction
Feature | PyMuPDF | PDFPlumber | Tesseract | PDFMiner | PyPDF | Camelot |
---|---|---|---|---|---|---|
π Text Extraction | β | β | β | β | β | β |
ποΈ Table Extraction | β | β | β | β | β | β |
πΌοΈ Image Extraction | β | β | β | β | β | β |
π Scanned PDFs Support | β (OCR) | β (OCR) | β | β | β | β |
- β‘ PyMuPDF is fast and efficient.
- π Tesseract is best for scanned PDFs.
- π Camelot excels at table extraction.
- π PDFPlumber & PDFMiner provide detailed structure.
1οΈβ£ Clone the repo: git clone <repo-url>
2οΈβ£ Run the app: Evaluation_pdf_to_text.ipynb
3οΈβ£ Upload PDF and run the Gradio app to get the results
We welcome contributions from the community! If youβd like to contribute:
π Bug Reports & Issues: Open an issue if you find any bugs.
π Feature Requests: Suggest new features via discussions or pull requests.
π Pull Requests: Fork the repo, make changes, and submit a pull request!
- Ensure your changes follow best practices.
- Keep the repository structured and well-documented.
- Respect the licenses of individual frameworks.
Eval-STT is open-source under the Apache 2.0. Use it freely and contribute to make it better! π
For any questions, suggestions, or feature requests, open an issue or reach out to the maintainers! π‘
πΉ PyMuPDF
πΉ Tesseract OCR
πΉ PDFMiner
πΉ PDFPlumber
πΉ Camelot