Hacker News

empiko · 2025-06-09T11:11:21 1749467481

> LLMs are human culture compiled into code. They will strongly tend to follow patterns of human behavior, to the point of that being their main (only?) feature.

Perhaps, but this already disproves the idea of superhuman PhD+ level AI agents.

---

I think that the paper has a good motivation. It would be great if we were able to somehow define the complexity of problems AIs are able to tackle. But the Hanoi towers puzzle does not seem like a good match, especially since you can generate the solution mechanically, if you are aware of the very simple algorithm.

K0balt · 2025-06-09T11:29:24 1749468564

> Perhaps, but this already disproves the idea of superhuman PhD+ level AI agents.

The value in AI is not some higher level of understanding, but rather the ability to (potentially) carry on millions of independent and interleaved thought threads/conversations at once, and to work tirelessly at a high throughput.

Perhaps with some kind of recursive approach with iterative physical grounding that tests hypothesis against ground truth, AI can transcend human levels of understanding, but for now we need to understand that AI is going to be more of intern-level assistance with occasional episodes of drunk uncle bob.

K0balt · 2025-06-09T11:41:04 1749469264

Your comment sparked a thought that might be of value.

If LLMs are indeed primarily useful for solving simpler tasks and will systematically balk at complex problems, perhaps a variant of the “thinking” approach is of value?

Where a task that is high iteration is approached by solving a part of the problem and then identifying the next -small part- of the problem, in an iteration loop.

I can also see this easily going off the rails, but perhaps posing each following iteration as -small- and encapsulating only the relevant context while trimming off the context tail could work as an agentic behavior spawned from the main supervisory task?

Of course maybe just having the model write and run code would probably be better for many classes of problems.

Maybe something to be looked at, if it’s not already being used.

K0balt · 2025-06-10T21:14:49 1749590089

Huh, I wonder why my post that you replied to got deleted ?

vidarh · 2025-06-09T11:43:40 1749469420

A pet peeve of mine is that these papers practically never benchmark against humans before they make grandiose claims of how bad these models are.

If you tried to get a human to write out a solution to Tower of Hanoi with a large number of disks, you'd get a whole lot of refusals, and even if you cajoled people into doing it, you'd get a whole lot of errors because people would get sloppy.

That said, there were some useful bits - basically the notion that we can't expect to just ramp inference budgets higher and higher for reasoning models and keep getting gains without improving the underlying LLMs too.

I just wish they'd been more rigorous, and stuck to that.