More

einrealist · 2026-02-17T11:21:26 1771327286

What is the purpose of an AGENTS.md file when there are so many different models? Which model or version of the model is the file written for? So much depends on assumptions here. It only makes sense when you know exactly which model you are writing for. No wonder the impact is 'all over the place'.

einrealist · 2026-02-16T23:26:43 1771284403

I can follow the arguments, and I find many of them plausible. But LLMs are still unreliable and require attention and verification. Ultimately, it's an economic question: the cost of training the model and the computing power required to produce accurate results.

The strongest argument is the one about the interface. LLMs will definitely have a large impact. But under the hood, I still expect to see a lot of formally verified code, written by engineers with domain knowledge, with support by AI.

einrealist · 2026-02-11T10:17:19 1770805039

Reminds me of this: https://www.nbcnews.com/news/world/russia-plot-plant-bombs-c...

El Paso is a hub for cargo. Probably takes some days to go through all that parcel.

einrealist · 2026-02-05T18:10:17 1770315017

They trained for it. That's the +0.1!

einrealist · 2026-02-05T12:46:51 1770295611

I am curious to know what he has in mind. This 'process engineering' could be a solution to problems that BPM and COBOL are trying to solve. He might end up with another formalized layer (with rules and constraints for everyone to learn) of indirection that integrates better with LLM interactions (which are also evolving rapidly).

I like the idea that 'code is truth' (as opposed to 'correct'). An AI should be able to use this truth and mutate it according to a specification. If the output of an LLM is incorrect, it is unclear whether the specification is incorrect or if the model itself is incapable (training issue, biases). This is something that 'process engineering' simply cannot solve.

reg_dunlop · 2026-02-05T14:23:14 1770301394

I'm also curious about what a process engineering abstraction layer looks like. Though the final section does hint at it; more integration of more stakeholders closer to the construction of code.

Though I have to push back on the idea of "code as truth". Thinking about all the layers of abstraction and indirection....hasn't data and the database layer typically been the source of truth?

Maybe I'm missing something in this iteration of the industry where code becomes something other than what it's always been: an intermediary between business and data.

einrealist · 2026-02-05T14:44:03 1770302643

Yes, the database layer and the data itself are also sources of truth. Code (including code run inside the database, such as SQL, triggers, stored procedures and other native modules) defines behaviour. The data influences behaviour. This is why we can only test code with data that is as close to reality as possible, or even production data.

einrealist · 2026-02-03T07:04:02 1770102242

Why is it too big to fail? SpaceX can be dissected, parts be sold to the government or the competition.

It's too big to fail for Musk, because it is one source of his money, in large paid by the US tax payer.

Ekaros · 2026-02-03T09:20:13 1770110413

I see no reason why Starship could not be dumped. And Falcon rockets kept being produced as needed, maybe with higher cost.

einrealist · 2026-01-28T21:36:12 1769636172

Unless the economy crashes and I die to the consequences, there are so many pre-AI hard-cover books to read.....

einrealist · 2026-01-27T18:44:42 1769539482

> It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.

Somewhere, there are GPUs/NPUs running hot. You send all the necessary data, including information that you would never otherwise share. And you most likely do not pay the actual costs. It might become cheaper or it might not, because reasoning is a sticking plaster on the accuracy problem. You and your business become dependent on this major gatekeeper. It may seem like a good trade-off today. However, the personal, professional, political and societal issues will become increasingly difficult to overlook.

cyode · 2026-01-27T22:18:56 1769552336

This quote stuck out to me as well, for a slightly different reason.

The “tenacity” referenced here has been, in my opinion, the key ingredient in the secret sauce of a successful career in tech, at least in these past 20 years. Every industry job has its intricacies, but for every engineer who earned their pay with novel work on a new protocol, framework, or paradigm, there were 10 or more providing value by putting the myriad pieces together, muddling through the ever-waxing complexity, and crucially never saying die.

We all saw others weeded out along the way for lacking the tenacity. Think the boot camp dropouts or undergrads who changed majors when first grappling with recursion (or emacs). The sole trait of stubbornness to “keep going” outweighs analytical ability, leetcode prowess, soft skills like corporate political tact, and everything else.

I can’t tell what this means for the job market. Tenacity may not be enough on its own. But it’s the most valuable quality in an employee in my mind, and Claude has it.

noosphr · 2026-01-28T00:15:19 1769559319

There is an old saying back home: an idiot never tires, only sweats.

Claude isn't tenacious. It is an idiot that never stops digging because it lacks the meta cognition to ask 'hey, is there a better way to do this?'. Chain of thought's whole raison d'etre was so the model could get out of the local minima it pushed itself in. The issue is that after a year it still falls into slightly deeper local minima.

This is fine when a human is in the loop. It isn't what you want when you have a thousand idiots each doing a depth first search on what the limit of your credit card is.

Havoc · 2026-01-28T00:40:08 1769560808

> it lacks the meta cognition to ask 'hey, is there a better way to do this?'.

Recently had an AI tell me this code (that it wrote) is a mess and suggested wiping it and starting from scratch with a more structure plan. That seems to hint at some meta cognition outlines

zzrrt · 2026-01-28T00:47:49 1769561269

Haha, it has the human developer traits of thinking all old code is garbage, failing to identify oneself as the dummy who wrote this particular code, and wanting to start from scratch.

dpkirchner · 2026-01-28T00:53:47 1769561627

It's like NIH syndrome but instead "not invented here today". Also a very human thing.

globular-toast · 2026-01-28T07:54:52 1769586892

More like NIITS: Not Invented in this Session.

rurp · 2026-01-28T04:34:39 1769574879

Perhaps. I've had LLMs tell me some code is deeply flawed garbage that should be rewritten about code that exact same LLM wrote minutes before. It could be a sign of deep meta cognition, or it might be due to some cognitive gaps where it has no idea why it did something a minute ago and suddenly has a different idea.

Applejinx · 2026-01-28T20:34:36 1769632476

This is not a fair criticism. There is _nobody_ there, so you can't be saying 'code the exact same LLM wrote minutes before'. There is no 'exact same LLM' and no ideas for it to have, you're trying to make sense of sparkles off the surface of a pond. There's no 'it' to have an idea and then a different idea, much less deep meta cognition.

rurp · 2026-01-29T18:08:47 1769710127

I'm not sure we disagree. I was pushing back against the idea that suggesting a rewrite of some code implies meta cognition abilities on the part of the LLM. That seems like weak evidence to me.

duncangh · 2026-01-29T12:22:17 1769689337

They should’ve named him tom instead of Claude in homage to Ten second Tom from fifty first dates

teaearlgraycold · 2026-01-28T03:41:06 1769571666

I asked Claude to analyze something and report back. It thought for a while said “Wow this analysis is great!” and then went back to thinking before delivering the report. They’re auto-sycophantic now!

lbrito · 2026-01-28T01:58:49 1769565529

Someone will say "you just need to instruct Claude.md to be more meta and do a wiggum loop on it"

hyperadvanced · 2026-01-28T01:34:19 1769564059

Metacognition As A Service, you say?

guy4261 · 2026-01-28T03:28:18 1769570898

Running on the Meta Cognition Protocol server near you.

baxtr · 2026-01-28T05:47:33 1769579253

You’ll get sued by Meta for this!

r-w · 2026-01-28T03:25:05 1769570705

I think that’s called “consulting”.

karlgkk · 2026-01-28T06:14:02 1769580842

lol no it doesn’t. It hints at convincing language models

samusiam · 2026-01-28T01:26:35 1769563595

I mean, not always. I've seen Claude step back and reconsider things after hitting a dead end, and go down a different path. There are also workflows, loops that can increase the likelihood of this occurring.

BeetleB · 2026-01-27T23:15:28 1769555728

This is a major concern for junior programmers. For many senior ones, after 20 (or even 10) years of tenacious work, they realize that such work will always be there, and they long ago stopped growing on that front (i.e. they had already peaked). For those folks, LLMs are a life saver.

At a company I worked for, lots of senior engineers become managers because they no longer want to obsess over whether their algorithm has an off by one error. I think fewer will go the management route.

(There was always the senior tech lead path, but there are far more roles for management than tech lead).

codyb · 2026-01-28T03:27:05 1769570825

I feel like if you're really spending a ton of time on off by one errors after twenty years in the field you haven't actually grown much and have probably just spent a ton of time in a single space.

Otherwise you'd be senior staff to principle range and doing architecture, mentorship, coordinating cross team work, interviewing, evaluating technical decisions, etc.

I got to code this week a bit and it's been a tremendous joy! I see many peers at similar and lower levels (and higher) who have more years and less technical experience and still write lots of code and I suspect that is more what you're talking about. In that case, it's not so much that you've peaked, it's that there's not much to learn and you're doing a bunch of the same shit over and over and that's of course tiring.

I think it also means that everything you interact with outside your space does feel much harder because of the infrequency with which you have interacted with it.

If you've spent your whole career working the whole stack from interfaces to infrastructure then there's really not going to be much that hits you as unfamiliar after a point. Most frameworks recycle the same concepts and abstractions, same thing with programming languages, algorithms, data management etc.

But if you've spent most of your career in one space cranking tickets, those unknown corners are going to be as numerous as the day you started and be much more taxing.

rishabhaiover · 2026-01-27T23:26:01 1769556361

That's just sad. Right when I found love in what I do, my work has no value anymore.

jasonfarnon · 2026-01-28T00:28:55 1769560135

Aren't you still better off than the rest of us who found what they love + invested decades in it before it lost its value. Isn't it better to lose your love when you still have time to find a new one?

josephg · 2026-01-28T02:39:16 1769567956

I don't think so. Those of us who found what we love and invested decades into it got to spend decades getting paid well to do what we love.

pesus · 2026-01-28T00:38:34 1769560714

Depends on if their new love provides as much money as their old one, which is probably not likely. I'd rather have had those decades to stash and invest.

jasonfarnon · 2026-01-28T00:46:45 1769561205

A lot of pre-faang engineers dont have the stash you're thinking about. What you meant was "right when I found a lucrative job that I love". What was going on in tech these last 15 years, unfortunately, probably was once in a lifetime.

WarmWash · 2026-01-28T01:49:31 1769564971

It's crazy to think back in the 80's programmers had "mild" salaries despite programming back then being worlds more punishing. No libraries, no stack exchange, no forums, no endless memory and infinite compute. If you had a challenging bug you better also be proficient in reading schematics and probing circuits.

lurking_swe · 2026-01-28T08:30:28 1769589028

on the bright side software evolved much more slowly in the 80s. You could go very far by being an expert in 1 thing.

People had real offices with actual quiet focus time.

User expectations were also much lower.

pros and cons i guess?

sponaugle · 2026-01-28T16:14:17 1769616857

"it lost its value".

It has not lost its value yet, but the future will shift that value. All of the past experience you have is an asset for you to move with that shift. The problem will not be you losing value, it will be you not following where the value goes.

It might be a bit more difficult to love where the shift goes, but that is no different than loving being a artist which often shares a bed with loving being poor. What will make you happier?

nfredericks · 2026-01-28T00:49:52 1769561392

This is genuinely such a good take

dugidugout · 2026-01-28T04:35:16 1769574916

Especially on the topic of value! We are all intuitively aware that value is highly contextual, but get in a knot trying to rationalize value long past genuine engagement!

test6554 · 2026-01-27T23:35:40 1769556940

Imagine a senior dev who just approves PRs, approves production releases, and prioritizes bug reports and feature requests. LLM watches for errors ceaslessly, reports an issue. Senior dev reviews the issue and assigns a severity to it. Another LLM has a backlog of features and errors to go solve, it makes a fix and submits a PR after running tests and verifying things work on its end.

techgnosis · 2026-01-28T00:44:24 1769561064

Why are we pretending like the need for tenacity will go away? Certain problems are easier now. We can tackle larger problems now that also require tenacity.

samusiam · 2026-01-28T01:24:35 1769563475

Even right at this very moment where we have a high-tenacity AI, I'd argue that working with the AI -- that is to say, doing AI coding itself and dealing with the novel challenges that brings requires a lot of stubborn persistence.

mykowebhn · 2026-01-28T05:58:37 1769579917

Fittingly, George Hinton toiled away for years in relative obscurity before finally being recognized for his work. I was always quite impressed by his "tenacity".

So although I don't think he should have won the Nobel Prize because not really physics, I felt his perseverance and hard work should merit something.

direwolf20 · 2026-01-28T18:27:38 1769624858

... The person who embezzled from the SDC in 2018? https://eu.jsonline.com/story/news/investigations/2024/04/19...

mykowebhn · 2026-01-29T18:23:10 1769710990

Haha, my bad. Yes, that "George" Hinton!

daxfohl · 2026-01-27T20:06:03 1769544363

I still find in these instances there's at least a 50% chance it has taken a shortcut somewhere: created a new, bigger bug in something that just happened not to have a unit test covering it, or broke an "implicit" requirement that was so obvious to any reasonable human that nobody thought to document it. These can be subtle because you're not looking for them, because no human would ever think to do such a thing.

Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"

gtowey · 2026-01-27T21:00:03 1769547603

The value extortion plan writes itself. How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? Would you even know?

password4321 · 2026-01-27T23:53:33 1769558013

First time I've seen this idea, I have a tingling feeling it might become reality sooner rather than later.

sailfast · 2026-01-27T22:31:41 1769553101

That’s far-fetched. It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.

d0mine · 2026-01-27T23:11:25 1769555485

> far-fetched

Remember Google?

Once it was far-fetched that they would make the search worse just to show you more ads. Now, it is a reality.

With tokens, it is even more direct. The more tokens users spend, the more money for providers.

retsibsi · 2026-01-28T02:53:12 1769568792

> Now, it is a reality.

What are the details of this? I'm not playing dumb, and of course I've noticed the decline, but I thought it was a combination of losing the battle with SEO shite and leaning further and further into a 'give the user what you think they want, rather than what they actually asked for' philosophy.

supriyo-biswas · 2026-01-28T04:37:40 1769575060

https://www.wheresyoured.at/the-men-who-killed-google/

SetTheorist · 2026-01-28T18:49:05 1769626145

As recently as 15 years ago, Google _explicitly_ stated in their employee handbook that they would NOT, as a matter of principle, include ads in the search results. (Source: worked there at that time.)

Now, they do their best to deprioritize and hide non-ad results...

throwthrowuknow · 2026-01-28T00:20:26 1769559626

Only if you are paying per token on the API. If you are paying a fixed monthly fee then they lose money when you need to burn more tokens and they lose customers when you can’t solve your problems within that month and max out your session limits and end up with idle time which you use to check if the other providers have caught up or surpassed your current favourite.

layla5alive · 2026-01-28T04:53:53 1769576033

Indeed, unlimited plan seems like the only way that makes sense to not have it be guaranteed to be abused by the provider

xienze · 2026-01-27T23:02:06 1769554926

> It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise.

Unless you’re paying by the token.

lelanthran · 2026-01-28T18:43:40 1769625820

> It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.

It's only in the interests of the model builders to do that IFF the user can actually tell that the model is giving them the best value for a single dollar.

Right now you can't tell.

fragmede · 2026-01-28T18:55:28 1769626528

Why not? Seems like you'd just build the same app on each of the models you want to test and judge how they did.

lelanthran · 2026-01-28T19:01:01 1769626861

> Why not? Seems like you'd just build the same app on each of the models you want to test and judge how they did.

I tried that on a few problems; even on the same model the results have too much variation.

When comparing different models, repeating the experiment gives you different results.

hnuser40690 · 2026-02-01T21:50:37 1769982637

Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.

lelanthran · 2026-02-03T07:19:14 1770103154

> Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.

That doesn't help in practical usage - all you'd know is their consistency at the point in time of testing. After all, 5m after your test is done, your request to an API might lead to a different model being used in the background because the limits of the current one were reached.

fragmede · 2026-01-27T21:37:11 1769549831

The free market proposition is that competition (especially with Chinese labs and grok) means that Anthropic is welcome to do that. They're even welcome to illegally collude with OpenAi such that ChatGPT is similarly gimped. But switching costs are pretty low. If it turns out I can one shot an issue with Qwen or Deepseek or Kimi thinking, Anthropic loses not just my monthly subscription, but everyone else's I show that too. So no, I think that's some grade A conspiracy theory nonsense you've got there.

coffeefirst · 2026-01-27T21:51:09 1769550669

It’s not that crazy. It could even happen by accident in pursuit of another unrelated goal. And if it did, a decent chunk of the tech industry would call it “revealed preference” because usage went up.

hnuser123456 · 2026-01-27T22:29:04 1769552944

LLMs became sycophantic and effusive because those responses were rated higher during RLHF, until it became newsworthy how obviously eager-to-please they got, so yes, being highly factually correct and "intelligent" was already not the only priority.

jrflowers · 2026-01-27T21:59:07 1769551147

This is a good point. For example if you have access to a bunch of slot machines, one of them is guaranteed to hit the jackpot. Since switching from one slot machine to another is easy, it is trivial to go from machine to machine until you hit the big bucks. That is why casinos have such large selections of them (for our benefit).

krupan · 2026-01-27T22:49:19 1769554159

"for our benefit" lol! This is the best description of how we are all interacting with LLMs now. It's not working? Fire up more "agents" ala gas town or whatever

direwolf20 · 2026-01-28T18:28:38 1769624918

gas is the transaction fees in Ethereum. It's a fitting name.

robotmaxtron · 2026-01-28T03:42:58 1769571778

last time I was at a casino I checked to see what company built the machines, imagine my surprise that it was (by my observation) a single vendor.

bandrami · 2026-01-28T00:46:05 1769561165

> But switching costs are pretty low

Switching costs are currently low. Once you're committed to the workflow the providers will switch to prepaying for a year's worth of tokens.

daxfohl · 2026-01-27T22:48:22 1769554102

To be clear I don't think that's what they're doing intentionally. Especially on a subscription basis, they'd rather me maximize my value per token, or just not use them. Lulling users into using tokens unproductively is the worst possible option.

The way agents work right now though just sometimes feels that way; they don't have a good way of saying "You're probably going to have to figure this one out yourself".

thunderfork · 2026-01-27T21:45:52 1769550352

As a rational consumer, how would you distinguish between some intentional "keep pulling the slot machine" failure rate and the intrinsic failure rate?

I feel like saying "the market will fix the incentives" handwaves away the lack of information on internals. After all, look at the market response to Google making their search less reliable - sure, an invested nerd might try Kagi, but Google's still the market leader by a long shot.

In a market for lemons, good luck finding a lime.

krupan · 2026-01-27T22:50:04 1769554204

FWIW, kagi is better than Google

direwolf20 · 2026-01-28T17:26:01 1769621161

yes, that was their point. Everyone uses Google anyway.

zelphirkalt · 2026-01-29T09:52:39 1769680359

And we all know the market always gives us the best quality product ...

Fnoord · 2026-01-28T00:37:03 1769560623

I was thinking more of deliberate backdoor in code. RCE is an obvious example, but another one could be bias. "I'm sorry ma'am, computer says you are ineligable for a bank account." These ideas aren't new. They were there in 90s already when we still thought about privacy and accountability regarding technology, and dystopian novels already described them long, long ago.

chanux · 2026-01-28T03:46:33 1769571993

Is this from a page of dating apps playbook?

direwolf20 · 2026-01-28T17:26:14 1769621174

wvenable · 2026-01-27T22:11:02 1769551862

> These can be subtle because you're not looking for them

After any agent run, I'm always looking the git comparison between the new version and the previous one. This helps catch things that you might otherwise not notice.

teaearlgraycold · 2026-01-28T03:42:53 1769571773

And after manually coding I often have an LLM review the diff. 90% of the problems it finds can be discounted, but it’s still a net positive.

einrealist · 2026-01-28T07:09:55 1769584195

And there is this paradox where it becomes harder to detect the problems as the models 'improve'.

charcircuit · 2026-01-27T21:19:37 1769548777

You are using it wrong, or are using a weak model if your failure rate is over 50%. My experience is nothing like this. It very consistently works for me. Maybe there is a <5% chance it takes the wrong approach, but you can quickly steer it in the right direction.

testaccount28 · 2026-01-27T21:26:01 1769549161

you are using it on easy questions. some of us are not.

meowface · 2026-01-28T06:11:06 1769580666

A lot of people are getting good results using it on hard things. Obviously not perfect, but > 50% success.

That said, more and more people seem to be arriving at the conclusion that if you want a fairly large-sized, complex task in a large existing codebase done right, you'll have better odds with Codex GPT-5.2-Codex-XHigh than with Claude Code Opus 4.5. It's far slower than Opus 4.5 but more likely to get things correct, and complete, in its first turn.

testaccount28 · 2026-01-29T10:41:14 1769683274

yes, i also get good results. that's why i use it on the hard things.

mikkupikku · 2026-01-27T22:08:08 1769551688

I think a lot of it comes down to how well the user understands the problem, because that determines the quality of instructions and feedback given to the LLM.

For instance, I know some people have had success with getting claude to do game development. I have never bothered to learn much of anything about game development, but have been trying to get claude to do the work for me. Unsuccessful. It works for people who understand the problem domain, but not for those who don't. That's my theory.

samrus · 2026-01-27T22:50:09 1769554209

It works for hard problems when the person already solves it and just needs the grunt work done

It also works for problems that have been solved a thousand times before, which impresses people and makes them think it is actually solving those problems

daxfohl · 2026-01-27T23:32:10 1769556730

Which matches what they are. They're first and foremost pattern recognition engines extraordinaire. If they can identify some pattern that's out of whack in your code compared to something in the training data, or a bug that is similar to others that have been fixed in their training set, they can usually thwack those patterns over to your latent space and clean up the residuals. If comparing pattern matching alone, they are superhuman, significantly.

"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape. Their ability to pattern match makes reasoning seem more powerful than it actually is. If your bug is within some reasonable distance of a pattern it has seen in training, reasoning can get it over the final hump. But if your problem is too far removed from what it has seen in its latent space, it's not likely to figure it out by reasoning alone.

samrus · 2026-01-29T19:38:45 1769715525

Exactly. I go back to a recent ancestor of LLMs, seq2seq. Its purpose was to translate things. Thats all. That needed representation learning and an attention mechanism, and it lead to some really freaky emergent capabilities, but its trained to trainslate language.

And thats exactly what its good for. It works great if you already solve a tough problem and provide it the solution in natural language, because the program is already there, it just needs to translate it to python.

Anything more than that that might emerge from this is going to be unreliable sleight of next-token-prediction at best.

We need a new architectural leap to have these things reason, maybe something that involves reinforcement learning at the token represention level, idk. But scaling the context window and training data arent going to cut it

charcircuit · 2026-01-27T23:45:16 1769557516

>"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape.

What do you mean by this? Especially for tasks like coding where there is a deterministic correct or incorrect signal it should be possible to train.

direwolf20 · 2026-01-28T18:30:22 1769625022

it's meant in the literal sense but with metaphorical hacksaws and duct tape.

Early on, some advanced LLM users noticed they could get better results by forcing insertion of a word like "Wait," or "Hang on," or "Actually," and then running the model for a few more paragraphs. This would increase the chance of a model noticing a mistake it made.

Reasoning is basically this.

charcircuit · 2026-01-28T18:57:11 1769626631

It's not just force inserting a word. Reasoning is integrated into the training process of the model.

samrus · 2026-01-29T19:40:53 1769715653

Not the core foundation model. The foundation model still only predicts the next token in a static way. The reasoning is tacked onto the instructGPT style finetuning step and its done through prompt engineering. Which is the shittiest way a model like this could have been done, and it shows

thunky · 2026-01-28T02:09:47 1769566187

> It also works for problems that have been solved a thousand times before

So you mean it works on almost all problems?

samrus · 2026-01-29T19:44:07 1769715847

I mean problems not worth solving, because theyve already been solved. If you need to just do the grunt work of retrieving the solution to a trite and worn out problem from the models training data, then they work great

But if you want to do interesting things, like all the shills keep trying to claim they do. Then this wont do it for you. You have to do it for it

thunky · 2026-01-29T21:10:52 1769721052

> But if you want to do interesting things, like all the shills keep trying to claim they do

I don't know where this is coming from. I've seen some over-enthusiastic hype for sure, but most of the day-to-day conversations I see aren't people saying they're curing cancer with Claude, they're people saying they're automating their bread and butter tasks with great success.

baq · 2026-01-27T21:43:24 1769550204

Don’t use it for hard questions like this then; you wouldn’t use a hammer to cut a plank, you’d try to make a saw instead

fooker · 2026-01-27T20:38:49 1769546329

> It might become cheaper or it might not

If it does not, this is going to be first technology in the history of mankind that has not become cheaper.

(But anyway, it already costs half compared to last year)

ctoth · 2026-01-27T21:09:36 1769548176

> But anyway, it already costs half compared to last year

You could not have bought Claude Opus 4.5 at any price one year ago I'm quite certain. The things that were available cost half of what they did then, and there are new things available. These are both true.

I'm agreeing with you, to be clear.

There are two pieces I expect to continue: inference for existing models will continue to get cheaper. Models will continue to get better.

Three things, actually.

The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].

[0]: https://blog.irvingwb.com/blog/2018/09/a-critical-appraisal-...

teaearlgraycold · 2026-01-28T03:47:09 1769572029

As a user of LLMs since GPT-3 there was noticeable stagnation in LLM utility after the release of GPT-4. But it seems the RLHF, tool calling, and UI have all come together in the last 12 months. I used to wonder what fools could be finding them so useful to claim a 10x multiplier - even as a user myself. These days I’m feeling more and more efficiency gains with Claude Code.

HNisCIS · 2026-01-28T06:33:56 1769582036

That's the thing people are missing, the models plateaued a while ago, still making minor gains to this day, but not huge ones. The difference is now we've had time to figure out the tooling. I think there's still a ton of ground to cover there and maybe the models will improve given that the extra time, but I think it's foolish to consider people who predicted that completely wrong. There are also a lot of mathematical concerns that will cause problems in the near and distant future. Infinite progress is far from a given, we're already way behind where all the boosters thought we'd be my now.

teaearlgraycold · 2026-01-28T10:39:37 1769596777

I believe Sam Altman, perhaps the greatest grifter in today’s Silicon Valley, claimed that software engineering would be obsolete by the end of last year.

bsder · 2026-01-27T23:17:06 1769555826

> The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].

Everybody who bet against Moore's Law was wrong ... until they weren't.

And AI is the reaction to Moore's Law having broken. Nobody gave one iota of damn about trying to make programming easier until the chips couldn't double in speed anymore.

twoodfin · 2026-01-27T23:42:56 1769557376

This is exactly backwards: Dennard scaling stopped. Moore’s Law has continued and it’s what made training and running inference on these models practical at interactive timescales.

bsder · 2026-01-28T00:41:39 1769560899

You are technically correct. The best kind of correct.

However, most people don't know the difference between the proper Moore's Law scaling (the cost of a transistor halves every 2 years) which is still continuing (sort of) and the colloquial version (the speed of a transistor doubles every 2 years) which got broken when Dennard scaling ran out. To them, Moore's Law just broke.

Nevertheless, you are reinforcing my point. Nobody gave a damn about improving the "programming" side of things until the hardware side stopped speeding up.

And rather than try to apply some human brainpower to fix the "programming" side, they threw a hideous number of those free (except for the electricity--but we don't mention that--LOL) transistors at the wall to create a broken, buggy, unpredictable machine simulacrum of a "programmer".

(Side note: And to be fair, it looks like even the strong form of Moore's Law is finally slowing down, too)

twoodfin · 2026-01-28T00:51:52 1769561512

If you can turn a few dollars of electricity per hour into a junior-level programmer who never gets bored, tired, or needs breaks, that fundamentally changes the economics of information technology.

And in fact, the agentic looped LLMs are executing much better than that today. They could stop advancing right now and still be revolutionary.

simianwords · 2026-01-27T21:17:04 1769548624

interesting post. i wonder if these people go back and introspect on how incorrect they have been? do they feel the need to address it?

fooker · 2026-01-27T21:31:39 1769549499

No, people do not do that.

This is harmless when it comes to tech opinions but causes real damage in politics and activism.

People get really attached to ideals and ideas, and keep sticking to those after they fail to work again and again.

simianwords · 2026-01-27T21:35:19 1769549719

i don't think it is harmless or we are incentivising people to just say whatever they want without any care for truth. people's reputations should be attached to their predictions.

cogogo · 2026-01-27T21:51:31 1769550691

Some people definitely do but how do they go and address it? A fresh example in that it addresses pure misinformation. I just screwed up and told some neighbors garbage collection was delayed for a day because of almost 2ft of snow. Turns out it was just food waste and I was distracted checking the app and read the notification poorly.

I went back to tell them (do not know them at all just everyone is chattier digging out of a storm) and they were not there. Feel terrible and no real viable remedy. Hope they check themselves and realize I am an idiot. Even harder on the internet.

maest · 2026-01-28T02:26:13 1769567173

Do _you_ do that?

simianwords · 2026-01-28T06:31:22 1769581882

i try to yes

peaseagee · 2026-01-27T20:54:38 1769547278

That's not true. Many technologies get more expensive over time, as labor gets more expensive or as certain skills fall by the wayside, not everything is mass market. Have you tried getting a grandfather clock repaired lately?

willio58 · 2026-01-27T21:35:57 1769549757

Repairing grandfather clocks isn't more expensive now because it's gotten any harder; it's because the popularity of grandfather clocks is basically nonexistent compared to anything else to tell time.

direwolf20 · 2026-01-28T18:31:12 1769625072

Doesn't need to be any particular reason to disprove the notion that technology only gets cheaper.

simianwords · 2026-01-27T21:03:34 1769547814

"repairing a unique clock" getting costlier doesn't mean technology hasn't gotten cheaper.

check out whether clocks have gotten cheaper in general. the answer is that it has.

there is no economy of scale here in repairing a single clock. its not relevant to bring it up here.

ipaddr · 2026-01-27T22:04:54 1769551494

Clocks prices have gone up since 2020. Unless a cheaper better way to make clocks has emerged inflation causes prices to grow.

fooker · 2026-01-27T22:12:46 1769551966

Luxury watches have gone up, 'clocks' as a technology is cheaper than ever.

You can buy one for 90 cents on temu.

ipaddr · 2026-01-27T23:33:05 1769556785

The landing cost for that 90 cent watch has gone way up. Shipping and to some degree taxes has pushed the price higher.

pas · 2026-01-28T00:21:38 1769559698

that's not the technology

of course it's silly to talk about manufacturing methods and yield and cost efficiency without having an economy to embed all of this into, but ... technology got cheaper means that we have practical knowledge of how to make cheap clocks (given certain supply chains, given certain volume, and so and so)

we can make very cheap very accurate clocks that can be embedded into whatever devices, but it requires the availability of fabs capable of doing MEMS components, supply materials, etc.

simianwords · 2026-01-27T22:14:33 1769552073

not true, clocks have gone down after accounting for inflation. verified using ChatGPT.

ipaddr · 2026-01-27T23:28:58 1769556538

You can't account for inflation because the price increase is inflation.

pas · 2026-01-28T00:24:33 1769559873

you can look at a basket of goods that doesn't have your specific product and compare directly

but inflation is the general price level increase, this can be used as a deflator to get the price of whatever product in past/future money amount to see how the price of the product changed in "real" terms (ie. relative to the general price level change)

simianwords · 2026-01-27T23:38:10 1769557090

this is not true

peaseagee · 2026-01-28T21:22:00 1769635320

You cannot verify anything using a tool that cannot validate truth.

esafak · 2026-01-27T21:14:48 1769548488

Instead of advancing tenuous examples you could suggest a realistic mechanism by which costs could rise, such as a Chinese advance on Taiwan, effecting TSMC, etc.

groby_b · 2026-01-27T21:26:23 1769549183

No. You don't get to make "technology gets more expensive over time" statements for deprecated technologies.

Getting a bespoke flintstone axe is also pretty expensive, and has also absolutely no relevance to modern life.

These discussions must, if they are to be useful, center in a population experience, not in unique personal moments.

ipaddr · 2026-01-27T22:08:38 1769551718

I purchased a 5T drive in 2019 and the price is higher now despite newer better drives going on the market since.

Not much has down in price over the last few years.

groby_b · 2026-01-27T23:52:04 1769557924

Price volatility exists.

Meanwhile the overall price of storage has been going down consistently: https://ourworldindata.org/grapher/historical-cost-of-comput...

solomonb · 2026-01-27T21:51:52 1769550712

okay how about the Francis Scott Key Bridge?

https://marylandmatters.org/2025/11/17/key-bridge-replacemen...

groby_b · 2026-01-28T00:06:27 1769558787

You will get a different bridge. With very different technology. Same as "I can't repair my grandfather clock cheaply".

In general, there are several things that are true for bridges that aren't true for most technology:

* Technology has massively improved, but most people are not realizing that. (E.g. the Bay Bridge cost significantly more than the previous version, but that's because we'd like to not fall down again in the next earthquake) * We still have little idea how to reason about the cost of bridges in general. (Seriously. It's an active research topic) * It's a tiny market, with the major vendors forming an oligopoly * It's infrastructure, not a standard good * The buy side is almost exclusively governments.

All of these mean expensive goods that are completely non-repeatable. You can't build the same bridge again. And on top of that, in a distorted market.

But sure, the cost of "one bridge, please" has gone up over time.

solomonb · 2026-01-28T01:14:30 1769562870

This seems largely the same as any other technology. The prices of new technologies go down initially as we scale up and optimize it's production, but as soon as demand fades, due to newer technology or whatever, the cost of that technology goes up again.

fooker · 2026-01-28T00:08:55 1769558935

> But sure, the cost of "one bridge, please" has gone up over time.

Even if you adjust for inflation?

groby_b · 2026-01-28T22:32:03 1769639523

Depends, do we care about TCO? Also, can I pick the set of bridges I compare?

OK, kidding aside: If you deeply care, you can probably mine the Federal Highway Administration's bridge construction database: https://fhwaapps.fhwa.dot.gov/upacsp/tm?transName=MenuSystem...

I don't think the question is answerable in a meaningful way. Bridges are one-off projects with long life spans, comparing cost over time requires a lot of squinting just so.

arthurbrown · 2026-01-27T22:04:57 1769551497

Bought any RAM lately? Phone? GPU in the last decade?

ipaddr · 2026-01-27T22:12:11 1769551931

The latest iphone has gone down in price? It's double. I guess the marketing is working.

xnyan · 2026-01-28T00:02:20 1769558540

"Pens are not cheaper, look at this Montblanc" is not a good faith response.

'84 Motorola DynaTAC - ~$12k AfI (adjusted for inflation)

'89 MicroTAC ~$8k AfI

'96 StarTAC ~$2k AfI

`07 iPhone ~$673 AfI

The current average smartphone sells for around $280. Phones are getting cheaper.

direwolf20 · 2026-01-28T18:32:23 1769625143

Phones, or smartphones?

emtel · 2026-01-27T22:31:14 1769553074

Time-keeping is vastly cheaper. People don't want grandfather clocks. They want to tell time. And they can, more accurately, more easily, and much cheaper than their ancestors.

epidemiology · 2026-01-28T01:22:51 1769563371

Or riding in an uber?

InsideOutSanta · 2026-01-27T20:57:02 1769547422

Sure, running an LLM is cheaper, but the way we use LLMs now requires way more tokens than last year.

fooker · 2026-01-27T21:26:49 1769549209

10x more tokens today cost less than than half of X tokens from ~mid 2024.

simianwords · 2026-01-27T21:01:22 1769547682

ok but the capabilities are also rising. what point are you trying to make?

oytis · 2026-01-27T21:02:49 1769547769

That it's not getting cheaper?

jstummbillig · 2026-01-27T21:07:56 1769548076

But it is, capability adjusted, which is the only way it makes sense. You can definitely produce last years capability at a huge discount.

simianwords · 2026-01-27T21:08:09 1769548089

you are wrong. https://epoch.ai/data-insights/llm-inference-price-trends

this is accounting for the fact that more tokens are used.

techpression · 2026-01-27T21:42:39 1769550159

The chart shows that they’re right though. Newer models cost more than older models. Sure they’re better but that’s moot if older models are not available or can’t solve the problem they’re tasked with.

simianwords · 2026-01-27T21:46:55 1769550415

this is incorrect. the cost to achieve the same task by old models is way higher than by new models.

> Newer models cost more than older models

where did you see this?

techpression · 2026-01-27T21:58:19 1769551099

On the link you shared, 4o vs 3.5 turbo price per 1m tokens.

There’s no such thing as ”same task by old model”, you might get comparable results or you might not (and this is why the comparison fail, it’s not a comparison), the reason you pick the newer models is to increase chances of getting a good result.

simianwords · 2026-01-27T22:07:09 1769551629

> The dataset for this insight combines data on large language model (LLM) API prices and benchmark scores from Artificial Analysis and Epoch AI. We used this dataset to identify the lowest-priced LLMs that match or exceed a given score on a benchmark. We then fit a log-linear regression model to the prices of these LLMs over time, to measure the rate of decrease in price. We applied the same method to several benchmarks (e.g. MMLU, HumanEval) and performance thresholds (e.g. GPT-3.5 level, GPT-4o level) to determine the variation across performance metrics

This should answer. In your case, GPT-3.5 definitely is cheaper per token than 4o but much much less capable. So they used a model that is cheaper than GPT-3.5 that achieved better performance for the analysis.

fooker · 2026-01-27T22:00:39 1769551239

OpenAI has always priced newer models lower than older ones.

simianwords · 2026-01-27T22:13:56 1769552036

not true! 4o was costlier than 3.5 turbo

techpression · 2026-01-27T22:07:18 1769551638

https://platform.openai.com/docs/pricing

Not according to their pricing table. Then again I’m not sure what OpenAI model versions even mean anymore, but I would assume 5.2 is in the same family as 5 and 5.2-pro as 5-pro

fooker · 2026-01-27T22:15:17 1769552117

Check GPT 5.2 vs it's predecessor the 'o' series of reasoning models.

root_axis · 2026-01-27T22:52:32 1769554352

Not true. Bitcoin has continued to rise in cost since its introduction (as in the aggregate cost incurred to run the network).

LLMs will face their own challenges with respect to reducing costs, since self-attention grows quadratically. These are still early days, so there remains a lot of low hanging fruit in terms of optimizations, but all of that becomes negligible in the face of quadratic attention.

twoodfin · 2026-01-27T23:44:24 1769557464

For Bitcoin that’s by design!

namcheapisdumb · 2026-01-28T04:18:32 1769573912

> bitcoin

so close! that is a commodity

fulafel · 2026-01-28T04:54:15 1769576055

I don't think computation is going to become more expensive, but there are techs that have become so: Nuclear power plants. Mobile phones. Oil extraction.

(Oil rampdown is a survival imperative due to the climate catastrophe so there it's a very positive thing of course, though not sufficient...)

krupan · 2026-01-27T22:52:16 1769554336

There are plenty of technologies that have not become cheaper, or at least not cheap enough, to go big and change the world. You probably haven't heard of them because obviously they didn't succeed.

asadotzler · 2026-01-27T22:50:06 1769554206

cheaper doesnt mean cheap enough to be viable after the bills come due

runarberg · 2026-01-28T05:38:47 1769578727

Supersonic jet engines, rockets to the moon, nuclear power plants, etc. etc. all have become more expensive. Superconductors were discovered in 1911, and we have been making them for as long as we have been making transistors in the 1950s, yet superconductors show no sign of becoming cheaper any time soon.

There have been plenty of technologies in history which do not in fact become cheaper. LLMs are very likely to become such, as I suspect their usefulness will be superseded by cheaper (much cheaper in fact) specialized models.

ak_111 · 2026-01-27T22:37:46 1769553466

Concorde?

YetAnotherNick · 2026-01-27T21:32:53 1769549573

With optimizations and new hardware, power is almost a negligible cost. You can get 5.5M tokens/s/MW[1] for kimi k2(=20M/KWH=181M tokens/$) which is 400x cheaper than current pricing. It's just Nvidia/TSMC/other manufacturers eating up the profit now because they can. My bet is that China will match current Nvidia within 5 years.

[1]: https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...

storystarling · 2026-01-27T22:28:15 1769552895

Electricity is negligible but the dominant cost is the hardware depreciation itself. Also inference is typically memory bandwidth bound so you are limited by how fast you can move weights rather than raw compute efficiency.

YetAnotherNick · 2026-01-28T04:25:48 1769574348

Yes, because the margin is like 80% for Nvidia, and 80% again for the manufacturers like Samsung and TSMC. Once the fixed cost like R and D is amortized the same node technology and hardware capacity could be just few single digit percent of current.

redox99 · 2026-01-27T23:01:14 1769554874

> And you most likely do not pay the actual costs.

This is one of the weakest anti AI postures. "It's a bubble and when free VC money stops you'll be left with nothing". Like it's some kind of mystery how expensive these models are to run.

You have open weight models right now like Kimi K2.5 and GLM 4.7. These are very strong models, only months behind the top labs. And they are not very expensive to run at scale. You can do the math. In fact there are third parties serving these models for profit.

The money pit is training these models (and not that much if you are efficient like chinese models). Once they are trained, they are served with large profit margins compared to the inference cost.

OpenAI and Anthropic are without a doubt selling their API for a lot more than the cost of running the model.

bob1029 · 2026-01-28T04:09:34 1769573374

Humans run hot too. Once you factor in the supply chain that keeps us alive, things become surprisingly equivalent.

Eating burgers and driving cars around costs a lot more than whatever # of watts the human brain consumes.

bbor · 2026-01-28T05:47:57 1769579277

I mean, “equivalent” is an understatement! There’s a reason Claude Code costs less than hiring a full time software engineer…

direwolf20 · 2026-01-28T18:37:06 1769625426

(it's VC money burn)

crazygringo · 2026-01-27T23:36:25 1769556985

> Somewhere, there are GPUs/NPUs running hot.

Running at their designed temperature.

> You send all the necessary data, including information that you would never otherwise share.

I've never sent the type of data that isn't already either stored by GitHub or a cloud provider, so no difference there.

> And you most likely do not pay the actual costs.

So? Even if costs double once investor subsidies stop, that doesn't change much of anything. And the entire history of computing is that things tend to get cheaper.

> You and your business become dependent on this major gatekeeper.

Not really. Switching between Claude and Gemini or whatever new competition shows up is pretty easy. I'm no more dependent on it than I am on any of another hundred business services or providers that similarly mostly also have competitors.

hahahahhaah · 2026-01-27T22:15:19 1769552119

It is also amazing seeing Linux kernel work, scheduling threads, proving interrupts and API calls all without breaking a sweat or injuring its ACL.

mikeocool · 2026-01-27T22:43:10 1769553790

To me this tenacity is often like watching someone trying to get a screw into board using a hammer.

There’s often a better faster way to do it, and while it might get to the short term goal eventually, it’s often created some long term problems along the way.

chasebank · 2026-01-28T05:33:03 1769578383

I don’t understand this pov. Unfortunately, id pay 10k mo for my cc sub. I wish I could invest in anthropic, they’re going to be the most profitable company on earth

moooo99 · 2026-01-28T07:42:30 1769586150

My agent struggled for 45 minutes because it tried to do `go run` on a _test.go file, which the compiler repeatedly exited after posting an error message that files named like this cannot be executed using the run command.

So yeah, that wasted a lot of GPU cycles for a very unimpressive result, but with a renewed superficial feeling of competence

squidbeak · 2026-01-28T12:16:13 1769602573

> you most likely do not pay the actual costs. It might become cheaper or it might not

Why would this be the first technology that doesn't become cheaper at scale over time?

karlgkk · 2026-01-28T06:13:27 1769580807

> And you most likely do not pay the actual costs

Oh my lord you absolutely do not. The costs to oai per token inference ALONE are at least 7x. AT LEAST and from what I’ve heard, much higher.

tgrowazay · 2026-01-28T06:26:46 1769581606

We can observe how much generic inference providers like deepinfra or together-ai charge for large SOTA models. Since they are not subsidized and they don’t charge 7x of OpenAI, that means OAI also doesn’t have outrageously high per-token costs.

karlgkk · 2026-02-06T10:35:58 1770374158

Actually, that doesn’t mean anything.

OAI is running boundary pushing large models. I don’t think those “second tier” applications can even get the GPUs with the HBM required at any reasonable scale for customer use.

Not to mention training costs of foundation models

utopiah · 2026-01-28T06:02:33 1769580153

AI genius discover brute forcing... what a time to be alive. /s

Like... bro that's THE foundation of CS. That's the principle of The bomb in Turing's time. One can still marvel at it but it's been with us since the beginning.

einrealist · 2026-01-19T12:17:15 1768825035

In the case of those big 'foundation models': Fine-tune for whom and how? I doubt it is possible to fine-tune things like this in a way that satisfies all audiences and training set instances. Much of this is probably due to the training set itself containing a lot of propaganda (advertising) or just bad style.

paradite · 2026-01-19T12:21:41 1768825301

I'm pretty sure Mistral is doing fine tuning for their enterprise clients. OpenAI and Anthropic are probably not?

I'm more thinking about startups for fine-tuning.

einrealist · 2026-01-11T12:23:02 1768134182

It's such a shame that Lenovo discontinued the X1 Nano series. It's my everyday casual driver. I would buy a newer model instantly if necessary.

mmcnl · 2026-01-11T13:40:15 1768138815

X13 is not that much bigger I believe.