Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey, Tabula maintainer here. tabula-java only works with "vector" PDFs. That is, tables drawn with vector lines, squiggles and glyphs.

Integrating an OCR library is something we always wanted to do.



I had some success last year integrating tesseract OCR and OpenCV with Tabula (compiled to javascript). The purpose was to build a Google Docs pdf table import addon without requiring a backend. Happy to get in touch to figure out how I could contribute the work back to Tabula (if that makes sense).

Here is a gif of table detection for a scanned PDF doc (the first run is slower as it requires fetching the opencv is bundle): https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...

Here's a demo of the addon running outside of Google Docs: https://pdftableutil.possiblenull.com/app/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: