All the environments the test (Tower of Hanoi, Checkers Jumping, River Crossing,...

Jensson · 2025-06-07T08:39:42 1749285582

> I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.

People made missiles and precise engineering like jet aircraft before we had computers, humans can do all of those things reliably just by spending more time thinking about it, inventing better strategies and using more paper.

Our brains weren't made to do such computations, but a general intelligence can solve the problem anyway by using what it has in a smart way.

thomasahle · 2025-06-07T09:24:48 1749288288

Some specialized people could probably do 20x20, but I'd still expect them to make a mistake at 100x100. The level we needed for space crafts was much less than that, and we had many levels of checks to help catch errors afterwards.

I'd wager that 95% of humans wouldn't be able to do 10x10 multiplication without errors, even if we paid them $100 to get it right. There's a reason we had to invent lots of machines to help us.

It would be an interesting social studies paper to try and recreate some "LLMs can't think" papers with humans.

Jensson · 2025-06-07T10:19:06 1749291546

> There's a reason we had to invent lots of machines to help us.

The reason was efficiency, not that we couldn't do it. If a machine can do it then we don't need expensive humans to do it, so human time can be used more effectively.

moralestapia · 2025-06-07T23:50:42 1749340242

I don't think you got @Jensson's point.

With enough effort and time we can arrive at a perfect solution to those problems without a computer.

This is not a hypothetical, it was like that for at least hundreds of years.

throw310822 · 2025-06-08T07:26:19 1749367579

With enough time and effort you can build an entire science of how arbitrarily complex computations can be done with pen and paper without errors in an arbitrarily long amount of time.

But then you're not measuring the ability to perform the calculations, but the ability to invent the methods that make the calculation possible.

jdmoreira · 2025-06-07T10:07:05 1749290825

No. a huge population of humans did while standing on the shoulders of giants.

Jensson · 2025-06-07T10:18:30 1749291510

Humans aren't giants, they stood on the shoulder of other humans. So for AI to be equivalent they should stand on the shoulders of other AI models.

jdmoreira · 2025-06-07T10:38:46 1749292726

building for thousands of years with a population size in the range between millions and billions at any given time.

Jensson · 2025-06-07T12:55:52 1749300952

Right, and when we have AI that can do the same with millions/billions of computers then we can replace humans.

But as long as AI cannot do that they cannot replace humans, and we are very far from that. Currently AI cannot even replace individual humans in most white collar jobs, and replacing entire team is way harder than replacing an individual, and then even harder is replacing workers in an entire field meaning the AI has to make research and advances on its own etc.

So like, we are still very far from AI completely being able to replace human thinking and thus be called AGI.

Or in other words, AI has to replace those giants to be able to replace humanity, since those giants are humans.

blither · 2025-06-08T16:36:59 1749400619

> if the authors had allowed it to write code.

Yeah, and FWIW doing this through writing code is trivial in an LLM / LRM - after testing locally took not even a minute to have a working solution no matter the amount of disks.

Your analogy makes sense, no reasonable person would try to solve a Tower of Hanoi type problem with e.g. 15 disks and sit there for 32,767 moves non-programmatically.

bwfan123 · 2025-06-07T15:20:02 1749309602

> but humans cant do it either

This argument is tired as it keeps getting repeated for any flaws seen in LLMs. And the other tired argument is: wait ! this is a sigmoid curve, and we have not seen the inflection point yet. If someone have me a penny for every comment saying these, I'd be rich by now.

Humans invented machines because they could not do certain things. All the way from simple machines in physics (Archimedes lever) to the modern computer.

thomasahle · 2025-06-07T18:13:41 1749320021

> Humans invented machines because they could not do certain things.

If your disappointment is that the LLM didn't invent a computer to solve the problem, maybe you need to give it access to physical tools, robots, labs etc.

mrbungie · 2025-06-07T22:39:33 1749335973

Nah, even if we follow such a weak "argument" the fact is that, ironically, the evidence shown in this and other papers point towards the idea that even if LRMs did have access to physical tools, robots labs, etc*, they probably would not be able to harness them properly. So even if we had an API-first world (i.e. every object and subject in the world can be mediated via a MCP server), they wouldn't be able to perform as well as we hope.

Sure, humans may fail doing a 20 digit multiplication problems but I don't think that's relevant. Most aligned, educated and well incentivized humans (such as the ones building and handling labs) will follow complex and probably ill-defined instructions correctly and predictably, instructions harder to follow and interpret than an exact Towers of Hanoi solving algorithm. Don't misinterpret me, human errors do happen in those contexts because, well, we're talking about humans, but not as catastrophically as the errors committed by LRMs in this paper.

I'm kind of tired of people comparing humans to machines in such simple and dishonest ways. Such thoughts pollute the AI field.

*In this case for some of the problems the LRMs were given an exact algorithm to follow, and they didn't. I wouldn't keep my hopes up for an LRM handling a full physical laboratory/factory.

thomasahle · 2025-06-07T23:40:46 1749339646

> Don't misinterpret me, human errors do happen in those contexts because, well, we're talking about humans, but not as catastrophically as the errors committed by LRMs in this paper.

If your argument is just that LRMs are more noisy and error prone in their reasoning, then I don't disagree.

> I'm kind of tired of people comparing humans to machines in such simple and dishonest ways.

The issue is people who say "see, the AI makes mistakes at very complex reasoning problems, so their 'thinking is an illusion'". That's the title of the paper.

This mistake comes not from people "comparing humans to machines", but from people fundamentally misunderstanding what thinking is. If thinking is what humans do, then errors are expected.

There is this armchair philosophical idea, that a human can simulate any turning machine and thus our reasoning is "maxomally general", and anything that can't do this is not general intelligence. But this is the complete opposite of reality. In our world, anything we know that can perfectly simulate a turning machine is not general intelligence, and vice versa.

mrbungie · 2025-06-08T00:29:17 1749342557

> The issue is people who say "see, the AI makes mistakes at very complex reasoning problems, so their 'thinking is an illusion'". That's the title of the paper.

That's not what the paper proposes (i.e. it commits errors => thinking is an illusion). It in fact looks at the failures modes and then it argues that due to HOW they fail and in which contexts/conditions, that their thinking may be "illusory" (not that the word illusory matters that much, papers of this calibre always strive for interesting sounding titles). Hell, they even gave the exact algo to the LRM, it probably can't get more enabling than that.

Humans are lossy thinkers and error-prone biological "machines", but an educated+aligned+incentivized one shouldn't have problems following complex instructions/algos (not in a no-errors way, but rather, in a self-correcting way); we thought that LRMs did that too, but the paper shows how they even start using less "thinking" tokens after a complexity threshold and that's terribly worrisome, akin to someone getting frustrated and stopping thinking after a problem gets too difficult which goes contrary to the idea that these machines can run laboratories by themselves. It is not the last nail in the coffin because more evidence is needed as always, but when taken into account with other papers, it points towards the limitations of LLMs/LRMs and how those limitations may not be solvable with more compute/tokens, but rather exploring new paradigms (long due in my opinion, the industry usually forces a paradigm as panacea during hype cicles in the name of hypergrowth/sales).

In short the argument you say the paper and posters ITT make is very different from what they are actually saying, so beware of the logical leap you are making.

> There is this armchair philosophical idea, that a human can simulate any turning machine and thus our reasoning is "maxomally general", and anything that can't do this is not general intelligence. But this is the complete opposite of reality. In our world, anything we know that can perfectly simulate a turning machine is not general intelligence, and vice versa.

That's typical goalpoast moving and happens in both ways when talking about "general intelligence" as you say, since the dawn of AI and the first neural networks. I'm not following why this is relevant for the discussion though.

Xmd5a · 2025-06-07T10:29:48 1749292188

>Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.

https://arxiv.org/abs/2311.13373

someothherguyy · 2025-06-07T08:35:37 1749285337

> humans can't

The reasons humans can't and the reasons LLMs can't are completely different though. LLMs are often incapable of performing multiplication. Many humans just wouldn't care to do it.

PatronBernard · 2025-06-09T21:52:08 1749505928

>write code

Doesn't that come down to allowing it to directly regurgitate training data? Surely it's seen dozens of such solutions.

mjburgess · 2025-06-07T15:56:04 1749311764

The goal isnt to assess the LLM capability at solving any of those problems. The point isnt how good they are at block world puzzles.

The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.

thomasahle · 2025-06-07T23:41:34 1749339694

How are these problems more interesting than simple arithmetic or algorithmic problems?

mrbungie · 2025-06-08T01:27:27 1749346047

Towers of Hanoi IS an algorithmic problem. It is a high-school/college level problem when designing algorithms, probably kid level when trying to solve intuitively, heuristically or via brute force for few disks (i.e. like when playing Mass Effect 1 or similar games that embed it as a minigame*).

* https://www.youtube.com/watch?v=1vTBVyhX7n4

pcooperchi · 2025-06-08T02:35:29 1749350129

The problems themselves aren’t particularly interesting, I suppose. The interesting part is how the complexity of each problem scales as a function of the number of inputs (e.g. the number of disks in the tower of Hanoi).

hskalin · 2025-06-07T13:24:11 1749302651

Well that's because all these LLMs have memorized a ton of code bases with solutions to all these problems.