Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This matches my experience working with LLMs. I've built several applications that require an LLM to consider several factors or "zoom levels," and I have yet to work with a model that can do that in a single shot. Instead, you need to have multiple passes for each area of focus. Rather than "edit this manuscript," you want "find all the typos," then "find all the run-on sentences," etc.


Interesting. I wonder if this is related to the model architecture and attention mechanism.

The author seems to be implying it could be: "Even a single mention of ‘code enhancement suggestions’ in our instructions seemed to hijack the model’s attention"


The attention is probably just latching on to strong statistical patterns. Obvious errors create sharp spikes in attention weights, and drown out more subtle signals that can actually matter more


The weirdness of LLMs is that they're so damn good at so many things but then you see these glaring gaps that instantly make them seem dumb. We desperately need benchmarks and evals that test these kinds of hard to pin down cognitive abilities


Absolutely. This is not a new observation, but another thing they struggle with is self-reporting confidence intervals. When I've asked LLMs to classify/tag things along with a confidence metric, the number seems random and has no connection to the quality or difficulty of the classification.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: