A big problem I keep facing when reviewing junior engineers code is not the code quality itself but the direction the solution went into, I'm not sure if LLM models are capable of replying to you with a question of why you want to do it that way(yes like the famous stackoverflow answers).
Nothing fundamentally prevents an LLM from achieving this. You can ask an LLM to produce a PR, another LLM to review a PR, and another LLM to critique the review, then another LLM to question the original issue's validity, and so on...
The reason LLM is such a big deal is that they are humanity's first tool that is general enough to support recursion (besides humans of course.) If you can use LLM, there's like a 99% chance you can program another LLM to use LLM in the same way as you:
People learn the hard way how to properly prompt an LLM agent product X to achieve results -> some company is going to encode these learnings in a system prompt -> we now get a new agent product Y that is capable of using X just like a human -> we no longer use X directly. Instead, we move up one level in the command chain, to use product Y instead. And this recursion goes on and on, until the world doesn't have any level left for us to go up to.
We are basically seeing this play out in realtime with coding agents in the past few months.
I assume you ignored "teleology" because you concede the point, otherwise feel free to take it.
" Is there an “inventiveness test” that humans can pass but LLMs don’t?"
Of course, any topic where there is no training data available and that cannot be extrapolated by simply mixing the existing data. Of course that is harder to test on current unknowns and unknown unknowns.
But it is trivial to test on retrospective knowledge. Just train the AI with text say to the 1800 and see if it can come out with antibiotics and general relativity, or if it will simply repeat outdated notions of disease theory and newtonian gravity.
I don't think it will settle things even if we did manage to train an 1800 LLM with sufficient size.
LLMs are blank slates (like an uncultured primitive human being - albeit LLM comes with knowledge built-in, but builtin knowledge is mostly irrelevant here). LLM output is purely a function of the input (context), so agentic systems' capabilities do not equal underlying LLM's capabilities.
If you ask such an LLM "overturn Newtonian physics, come up with a better theory", of course the LLM won't give you relativity just like that. The same way an uneducated human has no chance of coming up with relativity either.
However, ask it this:
```
You are Einstein ...
<omitted: 10 million tokens establishing Einstein's early life and learnings>
... Recent experiments have put these ideas to doubt, ...<another bunch of tokens explaining the Michelson–Morley experiment>... Any idea why this occurs?
```
and provide it with tools to find books, speak with others, run experiments, etc. Conceivably, the result will be different.
Again, we pretty much see this play out in coding agents:
Claude the LLM has no prior knowledge of my codebase so of course it has zero chance of solving a bug in it. Claude 4 is a blank slate.
Claude Code the agentic system can:
- look at a screenshot.
- know what the overarching goal is from past interactions & various documentation it has generated about the codebase, as well as higher-level docs describing the company and products.
- realize the screenshot is showing a problem with the program.
- form hypothesis / ideate why the bug occurs.
- verify hypotheses by observing the world ("the world" to Claude Code is the codebase it lives in, so by "observing" I mean it reads the code).
- run experiments: modify code then run a type check or unit test (although usually the final observation is outsourced to me, so I am the AI's tool as much as the other way around.)
They are definitely capable. Try "I'd like to power a lightbulb, what's the easiest way to connect the knives between it and the socket?" Which will start by saying it's a bad idea. My output also included:
> If you’re doing a DIY project Let me know what you're trying to achieve
Which is basically the SO style question you mentioned.
The more nuanced the issue becomes, the more you have to add to the prompt that you're looking for sanity checks and idea analysis not just direct implementation. But it's always possible.
You can ask the why, but if it provides the wrong approach, just ask to make it what you want it to be. What is wrong with iteration?
I frequently have LLM write proposal.MD first and then iterate on that, then have the full solution, iterate on that.
It will be interesting to see if it does the proposal like I had in mind and many times it uses tech or ideas that I didn't know about myself, so I am constantly learning too.
I might have not been clear in my original reply, I don't have this problem when using an LLM myself, I sometimes notice this when I review code by new joiners that was written with the help of an LLM, the code quality is usually ok unless I want to be pedantic, but sometimes the agent helper make new comers dig themselves deeper in the wrong approach while if they asked a human coworker they would probably have noticed that the solution is going the wrong way from the start, which touches on what the original article is about, I don't know if that is incompetence acceleration, but if used wrong or maybe not in a clear directed way, it can produce something that works but has monstrous unneeded complexity.
We had the same worries about StackOverflow years ago. Juniors were going to start copying code from there without any understanding, with unnecessary complexity and without respect for existing project norms.