Traditional OCR's usually have detection + recognition pipeline. So they will detect every word and then try to predict the text for every word. Errors obviously can happen in both parts, eg some words not detected which will get missed from output. Or word recognized incorrectly which is also common and more comparable to hallucination. However give its trained to work only on a small patch, accuracy is often higher. Comparing this to VLM's, they are looking at entire image/context and auto-regressively generating tokens/text which can also have lot of language bias, hence hallicinations.
Traditional OCRs are trained for a single task: recognize characters. They do this through visual features (and sometimes there's an implicit (or even explicit) "language" model: see https://arxiv.org/abs/1805.09441). As such, the extent of their "hallucination", or errors, is when there's ambiguity in characters, e.g. 0 vs O (that's where the implicit language model comes in). Because they're trained with a singular purpose, you would expect their confidence scores (i.e. logprobs) to be well calibrated. Also, depending on the OCR model, you usually do a text detection (get bounding boxes) followed by a text recognition (read the characters), and so it's fairly local (you're only dealing with a small crop).
On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.
When i deploy a webapp on azure, it expects me to put env variables in a file or their own tool (key value fields) where you can add env variables one by one.
Is there a way to use envelope in places like those?
We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.
There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents
We do end up doing a GPT based rewrite. The initial description is really valuable too though, and we want to keep that throughout the workflow. It's kind of similar to a spelling correction or query intent system. If it's high confidence you can override their query, but ideally you use the original one too.
One thing we also do is match any files mentioned in the issue. So if you mention sweepai/api.py, we'll find that and add it to the fetched files. There's still more work to be done here, so look out for those!
Likely file name based scoring, and other rules + finetuned retrieval models (opt-in of course)
Modal is great, it's been able to handle us chunking 10k files/second. Most of the costs come from embedding(couple hundred to embed tens of thousands of repos a month). Our chunker was in the tens of dollars as well.
The developer experience is also great, so we highly recommend it :)
I didn't mention this point, but we actually do that during the modification. We ask the LLM to extract the necessary subcontext from the main context. It doesn't increase the costs much, but it does help performance because the unnecessary context is stripped away.
I saw your comment, got curious, and looked at a lot of your old comments. Lots of interesting insights - Thanks for sharing them.
If you don't mind me asking, what do you do? I'm a researcher at FAANG working on language models and starting a new company in the space. Would love to connect. Feel free to email me - idyllic.bilges0p@icloud.com