All of these anecdotal stories about "LLM" failures need to go into more detail ...

spamizbad · on March 28, 2025

If any non-trivial ask of an LLM also requires the prompts/scaffolding to be listed, and independently verified, along with its output, their utility is severely diminished. They should be saving time not giving us extra homework.

Far better to just get these problems resolved.

mwigdahl · on March 28, 2025

That isn't what I'm saying. I'm saying you can't make a blanket statement that LLMs in general aren't fit for some particular task. There are certainly tasks where no LLM is competent, but for others, some LLMs might be suitable while others are not. At least some level of detail beyond "they used an LLM" is required to know whether a) there was user error involved, or b) an inappropriate tool was chosen.

butlike · on March 28, 2025

then they shouldn't market it as one-size fits all

mwigdahl · on March 28, 2025

Are they? Every foundation model release includes benchmarks with different levels of performance in different task domains. I don't think I've seen any model advertised by its creating org as either perfect or even equally competent across all domains.

The secondary market snake oil salesmen <cough>Manus</cough>? That's another matter entirely and a very high degree of skepticism for their claims is certainly warranted. But that's not different than many other huckster-saturated domains.

TexanFeller · on March 28, 2025

People like Zuckerberg go around claiming most of their code will be written by AI starting sometime this year. Other companies are hearing that and using it as a reason(or false cover) for layoffs. The reality is LLMs still have a way to go before replacing experienced devs and even when they start getting there there will be a period of time where we’re learning what we can and can’t trust them with and how to use them effectively and responsibly. Feels like at least a few years from now, but the marketing says it’s now.

some_random · on March 28, 2025

In many, many cases those problems are resolved by improvements to the model. The point is that making a big deal about LLM fuck ups in 3 year old models that don't reproduce in new ones is a complete waste of time and just spreads FUD.

actinium226 · on March 28, 2025

Did you read the original tweet? She mentions the models and gives high level versions of her prompts. I'm not sure what "scaffolding" is.

You're right that they're tools, but I think the complaint here is that they're bad tools, much worse than they are hyped to be, to the point that they actually make you less efficient because you have to do more legwork to verify what they're saying. And I'm not sure that "prompt training," which is what I think you're suggesting, is an answer.

I had several bad experiences lately. With Claude 3.7 I asked how to restore a running database in AWS to a snapshot (RDS, if anyone cares). It basically said "Sure, just go to the db in the AWS console and select 'Restore from snapshot' in the actions menu." There was no such button. I later read AWS docs that said you cannot restore a running database to a snapshot, you have to create a new one.

I'm not sure that any amount of prompting will make me feel confident that it's finally not making stuff up.

mwigdahl · on March 28, 2025

I was responding to the "they used an LLM" story about the Norwegian school report, not the original tweet. The original tweet has a great level of detail.

I agree that hallucination is still a problem, albeit a lot less of one than it was in the recent past. If you're using LLMs for tasks where you are not directly providing it the context it needs, or where it doesn't have solid tooling to find and incorporate that context itself, that risk is increased.

tempfile · on March 28, 2025

Why do you think these details are important? The entire point of these tools is that I am supposed to be able to trust what they say. The hard work is precisely to be able to spot which things are true and false. If I could do that I wouldn't need an assistant.

barnabee · on March 28, 2025

> The entire point of these tools is that I am supposed to be able to trust what they say

Hard disagree, and I feel like this assumption might be at the root of why some people seem so down on LLMs.

They’re a tool. When they’re useful to me, they’re so useful they save me hours (sometimes days) and allow me to do things I couldn’t otherwise, and when they’re not they’re not.

It never takes me very long to figure out which scenario I’m in, but I 100% understand and accept that figuring that out is on me and part of the deal!

Sure if you think you can “vibe code” (or “vibe founder”) your way to massive success but getting LLMs to do stuff you’re clueless about without anyone way to check, you’re going to have a bad time, but the fact they can’t (so far) do that doesn’t make them worthless.

prophesi · on March 28, 2025

Because then I can know whether the hallucinations they encountered are a little surprising, or not surprising at all.

casey2 · on March 28, 2025

Because it's the difference between a fleshy hallucination and something that might related to reality.

jodrellblank · on March 28, 2025

> Why do you think these details are important?

It's https://en.wikipedia.org/wiki/Sealioning