Take a look at Rossum https://rossum.ai/ - hopefully the highest accuracy out there and as good a momentum in terms of dev community etc. as the big cloud services. (disclaimer: founder here :)
Founder - https://extracttable.com here. You should want to give us a try? We are running a closed beta with our premium customers, Happy to provide you access to it.
We've tried two external parties for this task. First got bought by a competitor to most of our customers so became a big no-go. Second got bought and went off in another direction.
That's why we're looking at doing this in-house (via Azure or something ala this) even though we really don't want to.
Hi, I get why going with an early stage service can feel dangerous Maybe Rossum (https://rossum.ai/) can be a good compromise - a more mature startup with significant portfolio of enterprise customers already and a big momentum, nothing is likely to veer us off the path anymore; our product isn't the cheapest but our customers keep telling us it's the best in class (in the end it's definitely cheaper than trying to build your own).
This really needs a dataset to go with it. Preferably one from different countries and with different currencies and tax identification number styles.
Invoice recognition is a tricky subject, the companies that specialize in this field have spent a large amount of time and money on the problem, it would be great to see some kind of benchmark vs the commercial services.
Yes. Let me outline some of the challenges in this field: invoice information extraction is a subset of forms parsing, which is for many companies a hellishly difficult problem to deal with. The more companies are automated the more such information will be presented in a way that is machine readable, which means you will only have to do field matching rather than actually reading the field. OCR is anything but perfect resulting in errors creeping into digitized forms requiring human review. This relegates most of these solutions to aids for a human reviewer rather than a zero touch process. Finally, when working with international customers and suppliers you will have to be able to concurrently deal with a lot of different kinds of forms and languages which may require a step prior to the forms extraction that will decide what language the form is in and what kind of form it is.
For each form type that you intend to extract data from you will need a substantial training database. The good news is as you use it you build up more data but for legal reasons you may not be able to use that data to train on.
So that's why it really needs a dataset. One way to get one is to generate it based on a real dataset. I think that stands a much higher chance of happening than that some company will ship their - highly confidential - invoice stack to an unknown entity to make it world readable. That would likely cause that company serious problems and their legal department would never sign off on it.
Odoo, an open source invoicing software, produces factur-x invoices systematically (whatever the country): very convenient as it's parsed automatically.
Invoices in Sweden have a design created for easy OCR-scanning. I prefer this way because it means that there can be no discrepancy between what I see and what the machine sees.
Another benefit is that it's really easy to recognize an invoice at a glance, and you know exactly where all the info is.
Most invoicing software can generate a UBL[1] document (XML) and send this alongside a PDF. But this can (as far as I know) unfortunately not be embedded into the PDF.
Add some kind of markup around keys and values in the PDF. While it's important to know the desired field values, their position on the page could be necessary as well.
If you have access to a large number of vendors (invoice templates), these templates are going to be really valuable for creating better models. If you provide samples as a free/open dataset, you could make all invoice models that use your dataset, even if developed by other companies, better on your distribution.
Really excited to try this out. I work in document capture and the products that the company I work for offers have absolute garbage "machine learning" capabilities. If this works well it could save us a lot of time building out our complex rules for extracting Invoice data.
I'm curious, how do you prevent overfitting, where the model will simply learn the exact formats of the training data? Then it will not generalize to a format it has never seen before?
Unless the training data is an extremely diverse set of invoices, maybe randomly generated?
In general I think invoice extraction models will only generalise up to about 90% F1, after which they plateau off. What you can do at this point is to include examples of the specific formats your clients need to process and overfit on those formats, making the system 95% or 97% accurate. But for new layouts it's only going to be around 90%.
The problem with invoices is that some fields have extreme variability - the address, the company names and the product descriptions. So a synthetic invoice generation approach might not work when you want to process in a new industry or language.
I think the by far coolest part about this, is that you dont need to tag your dataset on a token basis. Could be interesting to see results by using the graph convolution approach by Liu et al. as opposed to just feeding raw images.
Is there a public data set with invoice photos? The best I can find is use Google photo search. (I'm searching since about eight years). For proper training and experimentation we would need a couple of thousand
Coincidentally I'm just about to begin a project intending to use the Form Recognizer service in Azure:
https://azure.microsoft.com/en-us/services/cognitive-service...
I will definitely do a side-by-side comparison with InvoiceNet.
Xero also has a related service: https://www.xero.com/au/features-and-tools/accounting-softwa...