Hacker Newsnew | past | comments | ask | show | jobs | submit | applgo443's commentslogin

What's the simple explanation for why these VLM OCRs hallucinate but previous version of OCRs don't?


Traditional OCR's usually have detection + recognition pipeline. So they will detect every word and then try to predict the text for every word. Errors obviously can happen in both parts, eg some words not detected which will get missed from output. Or word recognized incorrectly which is also common and more comparable to hallucination. However give its trained to work only on a small patch, accuracy is often higher. Comparing this to VLM's, they are looking at entire image/context and auto-regressively generating tokens/text which can also have lot of language bias, hence hallicinations.


Why are traditional OCRs better in terms of hallucination and confidence scores?

Can we use logprobs of LLM as confidence scores?


Traditional OCRs are trained for a single task: recognize characters. They do this through visual features (and sometimes there's an implicit (or even explicit) "language" model: see https://arxiv.org/abs/1805.09441). As such, the extent of their "hallucination", or errors, is when there's ambiguity in characters, e.g. 0 vs O (that's where the implicit language model comes in). Because they're trained with a singular purpose, you would expect their confidence scores (i.e. logprobs) to be well calibrated. Also, depending on the OCR model, you usually do a text detection (get bounding boxes) followed by a text recognition (read the characters), and so it's fairly local (you're only dealing with a small crop).

On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.


When i deploy a webapp on azure, it expects me to put env variables in a file or their own tool (key value fields) where you can add env variables one by one.

Is there a way to use envelope in places like those?


If it's 5 common crawls, isn't data across multiple common crawls mostly similar?


We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.

There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents


Is ETL/ELT same as writing SQL scripts and periodically executing them? I assumed there's more to it.


Sometimes there is more to it, like pulling data from external services or running ML, but other than that, yeah it's SQL, DAGs, and cronjobs


May be rewriting user's description might help you match code better?

Similar to the prompt engineering for previous era GPT completion models.


We do end up doing a GPT based rewrite. The initial description is really valuable too though, and we want to keep that throughout the workflow. It's kind of similar to a spelling correction or query intent system. If it's high confidence you can override their query, but ideally you use the original one too.


How do you approach the problem of what files to look into to fix a bug? Just embeddings doesn't seem to cut it.


We use some simple ranking heuristics detailed here: https://docs.sweep.dev/blogs/building-code-search

One thing we also do is match any files mentioned in the issue. So if you mention sweepai/api.py, we'll find that and add it to the fetched files. There's still more work to be done here, so look out for those!

Likely file name based scoring, and other rules + finetuned retrieval models (opt-in of course)


How is your experience with Modal?

And I'm curious to know more about your costs of deployment and running on Modal.


Modal is great, it's been able to handle us chunking 10k files/second. Most of the costs come from embedding(couple hundred to embed tens of thousands of repos a month). Our chunker was in the tens of dollars as well.

The developer experience is also great, so we highly recommend it :)


Did you consider first asking LLM to explain what a code snippet does and use that instead?

It'd significantly increase the costs though.


I didn't mention this point, but we actually do that during the modification. We ask the LLM to extract the necessary subcontext from the main context. It doesn't increase the costs much, but it does help performance because the unnecessary context is stripped away.


I saw your comment, got curious, and looked at a lot of your old comments. Lots of interesting insights - Thanks for sharing them.

If you don't mind me asking, what do you do? I'm a researcher at FAANG working on language models and starting a new company in the space. Would love to connect. Feel free to email me - idyllic.bilges0p@icloud.com


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: