📄 PDF to Searchable Text Evaluation

🚀 Evaluate frameworks for converting PDFs into searchable text using a Gradio app. Upload a PDF and optionally provide ground truth images, text, or tables for benchmarking extraction results.

⚖️ Frameworks Compared

📌 PyMuPDF – Fast text & image extraction
📌 Tesseract – OCR-based, best for scanned PDFs
📌 PyPDF – Basic text extraction, lacks formatting retention
📌 Camelot – Best for tabular data extraction
📌 PDFMiner – Powerful text extraction with layout analysis
📌 PDFPlumber – Structured data extraction, including tables & images

📊 Key Metrics

✅ Text Accuracy – How well text is extracted
✅ Table Detection – Table extraction accuracy
✅ Image Preservation – Retention of embedded images
✅ Format Retention – Maintenance of document structure
✅ Processing Time – Speed of extraction

🛠️ Feature Comparison

Feature	PyMuPDF	PDFPlumber	Tesseract	PDFMiner	PyPDF	Camelot
📄 Text Extraction	✅	✅	✅	✅	✅	❌
🏛️ Table Extraction	✅	✅	❌	❌	❌	✅
🖼️ Image Extraction	✅	✅	❌	✅	✅	❌
📑 Scanned PDFs Support	✅ (OCR)	✅ (OCR)	✅	❌	❌	❌

🎯 Summary

⚡ PyMuPDF is fast and efficient.
📜 Tesseract is best for scanned PDFs.
📊 Camelot excels at table extraction.
🏗 PDFPlumber & PDFMiner provide detailed structure.

🚀 How to Use

1️⃣ Clone the repo: git clone <repo-url>
2️⃣ Run the app: Evaluation_pdf_to_text.ipynb
3️⃣ Upload PDF and run the Gradio app to get the results

🛠️ Contributing

We welcome contributions from the community! If you’d like to contribute:
📌 Bug Reports & Issues: Open an issue if you find any bugs.
📌 Feature Requests: Suggest new features via discussions or pull requests.
📌 Pull Requests: Fork the repo, make changes, and submit a pull request!

Contribution Guidelines:

Ensure your changes follow best practices.
Keep the repository structured and well-documented.
Respect the licenses of individual frameworks.

📜 License

Eval-STT is open-source under the Apache 2.0. Use it freely and contribute to make it better! 🚀

📬 Contact Us

For any questions, suggestions, or feature requests, open an issue or reach out to the maintainers! 💡

🔗 References

🔹 PyMuPDF
🔹 Tesseract OCR
🔹 PDFMiner
🔹 PDFPlumber
🔹 Camelot

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Evaluation_pdf_to_text.ipynb		Evaluation_pdf_to_text.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📄 PDF to Searchable Text Evaluation

⚖️ Frameworks Compared

📊 Key Metrics

🛠️ Feature Comparison

🎯 Summary

🚀 How to Use

🛠️ Contributing

Contribution Guidelines:

📜 License

📬 Contact Us

🔗 References

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

build-ai-applications/Eval-PDF

Folders and files

Latest commit

History

Repository files navigation

📄 PDF to Searchable Text Evaluation

⚖️ Frameworks Compared

📊 Key Metrics

🛠️ Feature Comparison

🎯 Summary

🚀 How to Use

🛠️ Contributing

Contribution Guidelines:

📜 License

📬 Contact Us

🔗 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages