mgm__'s comments

mgm__ · on March 27, 2021

On an adjacent note, the work being done/posted by Collabora motivated me to try to get Mesa + Gallium soft pipe working with emscripten. Here is a demo of the classic glxgears: https://martinmullins.github.io/mesa-softpipe-emscripten/

robert_foss · on March 27, 2021

That is gloriously perverse!

mgm__ · on March 10, 2021

I had some success last year integrating tesseract OCR and OpenCV with Tabula (compiled to javascript). The purpose was to build a Google Docs pdf table import addon without requiring a backend. Happy to get in touch to figure out how I could contribute the work back to Tabula (if that makes sense).

Here is a gif of table detection for a scanned PDF doc (the first run is slower as it requires fetching the opencv is bundle): https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...

Here's a demo of the addon running outside of Google Docs: https://pdftableutil.possiblenull.com/app/

mgm__ · on March 3, 2021

I created a PDF table extractor tool last year with the same idea that it should be local only. Try it here: https://pdftableutil.possiblenull.com/app/ Also as a Google Docs addon (still local only) https://workspace.google.com/marketplace/app/pdf_table_impor...

I had a bad case of scope creep, so the tool can also extract tables from scanned/image PDFs using OpenCV.js and tesseract OCR wasm build!

kickbeak · on March 3, 2021

Wow That looks awesome, what did you use to display the PDF in the Browser? feels all really responsive!

mgm__ · on March 3, 2021

I used Mozilla's PDF.js https://mozilla.github.io/pdf.js/ It is what firefox uses on desktop to show PDFs!

kickbeak · on March 3, 2021

Thanks, really great work!

redman25 · on March 3, 2021

This is interesting. How accurate would you say it is?

mgm__ · on March 3, 2021

I haven't seen anything better. It started as a PoC and I decided not to include table detection on the page and require the user to draw box around the table.

I use Tabula under the hood for the cell/row detection and it is really good given the correct mode is selected for the type of table. The modes are stream (find cells by spacing) or lattice (find cells by ruling lines).

The OCR/OpenCV seemed to be fine as well as long as the text isn't too blurry. Here is a GIF of the OCR/OpenCV running on an example Image PDF: https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...