Skip to content

Extracting tables from PDFs or Any files using Deeplearning,OCR and Tabula

Notifications You must be signed in to change notification settings

karan171/table-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

table-extractor

Extracting tables from PDFs or Any files using Deeplearning,OCR and Tabula

#Support for tabula is added in tables.py file. #We have used MASKRCNN here and trained it on images of pdf files to detect tables. #Things to do: 1)Adding a deep learning model than can detect columns of the tables. 2)Using Coordinates of detected tables and feeding it to tabula. 3)Adding support for OCR if PDFs are not avaliable.

Note:Use Tabula if you need the data extracted from pdfs,until the pytesseract branch is merged.

Standalone Api's that are used for extracting tables from Images of documents or PDFs are either Expensive(Abbey) or Not good enough(tabula,camelot,pdfminer etc) there are various types of tables in documents some easy to detect and Put in databases and some very unorthodox.Abbey uses Deep learning to solve that problem and probably the best api out there but its expensive.On the other hand Camelot,Tabula they only work for PDFs because they don't use OCR techniques instead they go for a Rule Based Approach and some classic EdgeDetection algorithms and GhostScript. They are free but don't really work that well if the table structure isn't good , also if tables countinue on other pages etc.We are going to solve those problems soon with this approach which combines all the approaches above and Give you a free and a flexible solution for your use case.

About

Extracting tables from PDFs or Any files using Deeplearning,OCR and Tabula

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages