Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
InvoiceNet: Neural network to extract information from invoice documents (github.com/naivehobo)
189 points by homarp on Aug 13, 2020 | hide | past | favorite | 33 comments


Fantastic work! I will be testing it out for sure.

Coincidentally I'm just about to begin a project intending to use the Form Recognizer service in Azure:

https://azure.microsoft.com/en-us/services/cognitive-service...

I will definitely do a side-by-side comparison with InvoiceNet.

Xero also has a related service: https://www.xero.com/au/features-and-tools/accounting-softwa...


Take a look at Rossum https://rossum.ai/ - hopefully the highest accuracy out there and as good a momentum in terms of dev community etc. as the big cloud services. (disclaimer: founder here :)


Have a look at Odoo, you can test our engine here: https://www.odoo.com/page/invoice-automation

We will open our API next week at last.


You are going to make the API available or the whole library


Founder - https://extracttable.com here. You should want to give us a try? We are running a closed beta with our premium customers, Happy to provide you access to it.


We've tried two external parties for this task. First got bought by a competitor to most of our customers so became a big no-go. Second got bought and went off in another direction.

That's why we're looking at doing this in-house (via Azure or something ala this) even though we really don't want to.


Hi, I get why going with an early stage service can feel dangerous Maybe Rossum (https://rossum.ai/) can be a good compromise - a more mature startup with significant portfolio of enterprise customers already and a big momentum, nothing is likely to veer us off the path anymore; our product isn't the cheapest but our customers keep telling us it's the best in class (in the end it's definitely cheaper than trying to build your own).


Looks interesting, thanks. I've forwarded it to the relevant parties.


This really needs a dataset to go with it. Preferably one from different countries and with different currencies and tax identification number styles.

Invoice recognition is a tricky subject, the companies that specialize in this field have spent a large amount of time and money on the problem, it would be great to see some kind of benchmark vs the commercial services.


>> This really needs a dataset to go with it.

Addressed in the disclaimer section. :)


Yes. Let me outline some of the challenges in this field: invoice information extraction is a subset of forms parsing, which is for many companies a hellishly difficult problem to deal with. The more companies are automated the more such information will be presented in a way that is machine readable, which means you will only have to do field matching rather than actually reading the field. OCR is anything but perfect resulting in errors creeping into digitized forms requiring human review. This relegates most of these solutions to aids for a human reviewer rather than a zero touch process. Finally, when working with international customers and suppliers you will have to be able to concurrently deal with a lot of different kinds of forms and languages which may require a step prior to the forms extraction that will decide what language the form is in and what kind of form it is.

For each form type that you intend to extract data from you will need a substantial training database. The good news is as you use it you build up more data but for legal reasons you may not be able to use that data to train on.

So that's why it really needs a dataset. One way to get one is to generate it based on a real dataset. I think that stands a much higher chance of happening than that some company will ship their - highly confidential - invoice stack to an unknown entity to make it world readable. That would likely cause that company serious problems and their legal department would never sign off on it.


What could an invoice app do to make this easy?

For example, what if I embed the invoice data in a JSON file in a PDF? Could that make it easier for the user?

I really don't know much about PDF, but from what little I just read after checking it is possible to do that.


The invoicing standard in France and Germany (factur-x) embed the XML inside the PDF, which is very convenient.

Here is a python lib that does it: https://pypi.org/project/factur-x/

Odoo, an open source invoicing software, produces factur-x invoices systematically (whatever the country): very convenient as it's parsed automatically.


Invoices in Sweden have a design created for easy OCR-scanning. I prefer this way because it means that there can be no discrepancy between what I see and what the machine sees.

Another benefit is that it's really easy to recognize an invoice at a glance, and you know exactly where all the info is.


Most invoicing software can generate a UBL[1] document (XML) and send this alongside a PDF. But this can (as far as I know) unfortunately not be embedded into the PDF.

[1] https://en.wikipedia.org/wiki/Universal_Business_Language


Add some kind of markup around keys and values in the PDF. While it's important to know the desired field values, their position on the page could be necessary as well.

If you have access to a large number of vendors (invoice templates), these templates are going to be really valuable for creating better models. If you provide samples as a free/open dataset, you could make all invoice models that use your dataset, even if developed by other companies, better on your distribution.


That would be the smart approach, not too much money in it though.


Really excited to try this out. I work in document capture and the products that the company I work for offers have absolute garbage "machine learning" capabilities. If this works well it could save us a lot of time building out our complex rules for extracting Invoice data.


If you want to do this in-house, I've some libs I can share to extract pdf's to json/structured data

js https://www.npmjs.com/package/pdf2json

py https://py-pdf-parser.readthedocs.io/en/latest/ or https://pypi.org/project/pdfminer/

php https://pdfparser.org/documentation


Would be nice to get some benchmarks on this e.g. vs https://aws.amazon.com/textract/ etc


And Azures form processing module


I'm curious, how do you prevent overfitting, where the model will simply learn the exact formats of the training data? Then it will not generalize to a format it has never seen before?

Unless the training data is an extremely diverse set of invoices, maybe randomly generated?


In general I think invoice extraction models will only generalise up to about 90% F1, after which they plateau off. What you can do at this point is to include examples of the specific formats your clients need to process and overfit on those formats, making the system 95% or 97% accurate. But for new layouts it's only going to be around 90%.

The problem with invoices is that some fields have extreme variability - the address, the company names and the product descriptions. So a synthetic invoice generation approach might not work when you want to process in a new industry or language.


Look at the loss and val_loss numbers in the sample image, it's way overfitting.


That's exactly the first thing pop up to my mind also.


Very curious to test it and compare it with other solutions.

FWIW, one can test the AI of Odoo here: https://www.odoo.com/page/invoice-automation


I think the by far coolest part about this, is that you dont need to tag your dataset on a token basis. Could be interesting to see results by using the graph convolution approach by Liu et al. as opposed to just feeding raw images.


Is there a public data set with invoice photos? The best I can find is use Google photo search. (I'm searching since about eight years). For proper training and experimentation we would need a couple of thousand


Im curious what hosted software does really good invoice recognition? (eg roger.ai)


UiPath has free community models (hosted in cloud) for invoice and receipt processing. They integrate with RPA workflows or can be used as JSON APIs.


Check Rossum's (https://rossum.ai) free trial - happy to hear feedback!


I’ve been using Veryfi (neé IQBoxy) with fairly good results.

https://www.veryfi.com/


This would be a great tool to use in-house for speeding up certain processes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: