Hacker Newsnew | past | comments | ask | show | jobs | submit | mgm__'s commentslogin

On an adjacent note, the work being done/posted by Collabora motivated me to try to get Mesa + Gallium soft pipe working with emscripten. Here is a demo of the classic glxgears: https://martinmullins.github.io/mesa-softpipe-emscripten/


That is gloriously perverse!


I had some success last year integrating tesseract OCR and OpenCV with Tabula (compiled to javascript). The purpose was to build a Google Docs pdf table import addon without requiring a backend. Happy to get in touch to figure out how I could contribute the work back to Tabula (if that makes sense).

Here is a gif of table detection for a scanned PDF doc (the first run is slower as it requires fetching the opencv is bundle): https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...

Here's a demo of the addon running outside of Google Docs: https://pdftableutil.possiblenull.com/app/


I created a PDF table extractor tool last year with the same idea that it should be local only. Try it here: https://pdftableutil.possiblenull.com/app/ Also as a Google Docs addon (still local only) https://workspace.google.com/marketplace/app/pdf_table_impor...

I had a bad case of scope creep, so the tool can also extract tables from scanned/image PDFs using OpenCV.js and tesseract OCR wasm build!


Wow That looks awesome, what did you use to display the PDF in the Browser? feels all really responsive!


I used Mozilla's PDF.js https://mozilla.github.io/pdf.js/ It is what firefox uses on desktop to show PDFs!


Thanks, really great work!


This is interesting. How accurate would you say it is?


I haven't seen anything better. It started as a PoC and I decided not to include table detection on the page and require the user to draw box around the table.

I use Tabula under the hood for the cell/row detection and it is really good given the correct mode is selected for the type of table. The modes are stream (find cells by spacing) or lattice (find cells by ruling lines).

The OCR/OpenCV seemed to be fine as well as long as the text isn't too blurry. Here is a GIF of the OCR/OpenCV running on an example Image PDF: https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: