What is the purpose of an AGENTS.md file when there are so many different models? Which model or version of the model is the file written for? So much depends on assumptions here. It only makes sense when you know exactly which model you are writing for. No wonder the impact is 'all over the place'.
I can follow the arguments, and I find many of them plausible. But LLMs are still unreliable and require attention and verification. Ultimately, it's an economic question: the cost of training the model and the computing power required to produce accurate results.
The strongest argument is the one about the interface. LLMs will definitely have a large impact. But under the hood, I still expect to see a lot of formally verified code, written by engineers with domain knowledge, with support by AI.
I am curious to know what he has in mind. This 'process engineering' could be a solution to problems that BPM and COBOL are trying to solve. He might end up with another formalized layer (with rules and constraints for everyone to learn) of indirection that integrates better with LLM interactions (which are also evolving rapidly).
I like the idea that 'code is truth' (as opposed to 'correct'). An AI should be able to use this truth and mutate it according to a specification. If the output of an LLM is incorrect, it is unclear whether the specification is incorrect or if the model itself is incapable (training issue, biases). This is something that 'process engineering' simply cannot solve.
I'm also curious about what a process engineering abstraction layer looks like. Though the final section does hint at it; more integration of more stakeholders closer to the construction of code.
Though I have to push back on the idea of "code as truth". Thinking about all the layers of abstraction and indirection....hasn't data and the database layer typically been the source of truth?
Maybe I'm missing something in this iteration of the industry where code becomes something other than what it's always been: an intermediary between business and data.
Yes, the database layer and the data itself are also sources of truth. Code (including code run inside the database, such as SQL, triggers, stored procedures and other native modules) defines behaviour. The data influences behaviour. This is why we can only test code with data that is as close to reality as possible, or even production data.
> It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.
Somewhere, there are GPUs/NPUs running hot. You send all the necessary data, including information that you would never otherwise share. And you most likely do not pay the actual costs. It might become cheaper or it might not, because reasoning is a sticking plaster on the accuracy problem. You and your business become dependent on this major gatekeeper. It may seem like a good trade-off today. However, the personal, professional, political and societal issues will become increasingly difficult to overlook.
This quote stuck out to me as well, for a slightly different reason.
The “tenacity” referenced here has been, in my opinion, the key ingredient in the secret sauce of a successful career in tech, at least in these past 20 years. Every industry job has its intricacies, but for every engineer who earned their pay with novel work on a new protocol, framework, or paradigm, there were 10 or more providing value by putting the myriad pieces together, muddling through the ever-waxing complexity, and crucially never saying die.
We all saw others weeded out along the way for lacking the tenacity. Think the boot camp dropouts or undergrads who changed majors when first grappling with recursion (or emacs). The sole trait of stubbornness to “keep going” outweighs analytical ability, leetcode prowess, soft skills like corporate political tact, and everything else.
I can’t tell what this means for the job market. Tenacity may not be enough on its own. But it’s the most valuable quality in an employee in my mind, and Claude has it.
There is an old saying back home: an idiot never tires, only sweats.
Claude isn't tenacious. It is an idiot that never stops digging because it lacks the meta cognition to ask 'hey, is there a better way to do this?'. Chain of thought's whole raison d'etre was so the model could get out of the local minima it pushed itself in. The issue is that after a year it still falls into slightly deeper local minima.
This is fine when a human is in the loop. It isn't what you want when you have a thousand idiots each doing a depth first search on what the limit of your credit card is.
> it lacks the meta cognition to ask 'hey, is there a better way to do this?'.
Recently had an AI tell me this code (that it wrote) is a mess and suggested wiping it and starting from scratch with a more structure plan. That seems to hint at some meta cognition outlines
Haha, it has the human developer traits of thinking all old code is garbage, failing to identify oneself as the dummy who wrote this particular code, and wanting to start from scratch.
Perhaps. I've had LLMs tell me some code is deeply flawed garbage that should be rewritten about code that exact same LLM wrote minutes before. It could be a sign of deep meta cognition, or it might be due to some cognitive gaps where it has no idea why it did something a minute ago and suddenly has a different idea.
This is not a fair criticism. There is _nobody_ there, so you can't be saying 'code the exact same LLM wrote minutes before'. There is no 'exact same LLM' and no ideas for it to have, you're trying to make sense of sparkles off the surface of a pond. There's no 'it' to have an idea and then a different idea, much less deep meta cognition.
I'm not sure we disagree. I was pushing back against the idea that suggesting a rewrite of some code implies meta cognition abilities on the part of the LLM. That seems like weak evidence to me.
I asked Claude to analyze something and report back. It thought for a while said “Wow this analysis is great!” and then went back to thinking before delivering the report. They’re auto-sycophantic now!
I mean, not always. I've seen Claude step back and reconsider things after hitting a dead end, and go down a different path. There are also workflows, loops that can increase the likelihood of this occurring.
This is a major concern for junior programmers. For many senior ones, after 20 (or even 10) years of tenacious work, they realize that such work will always be there, and they long ago stopped growing on that front (i.e. they had already peaked). For those folks, LLMs are a life saver.
At a company I worked for, lots of senior engineers become managers because they no longer want to obsess over whether their algorithm has an off by one error. I think fewer will go the management route.
(There was always the senior tech lead path, but there are far more roles for management than tech lead).
I feel like if you're really spending a ton of time on off by one errors after twenty years in the field you haven't actually grown much and have probably just spent a ton of time in a single space.
Otherwise you'd be senior staff to principle range and doing architecture, mentorship, coordinating cross team work, interviewing, evaluating technical decisions, etc.
I got to code this week a bit and it's been a tremendous joy! I see many peers at similar and lower levels (and higher) who have more years and less technical experience and still write lots of code and I suspect that is more what you're talking about. In that case, it's not so much that you've peaked, it's that there's not much to learn and you're doing a bunch of the same shit over and over and that's of course tiring.
I think it also means that everything you interact with outside your space does feel much harder because of the infrequency with which you have interacted with it.
If you've spent your whole career working the whole stack from interfaces to infrastructure then there's really not going to be much that hits you as unfamiliar after a point. Most frameworks recycle the same concepts and abstractions, same thing with programming languages, algorithms, data management etc.
But if you've spent most of your career in one space cranking tickets, those unknown corners are going to be as numerous as the day you started and be much more taxing.
Aren't you still better off than the rest of us who found what they love + invested decades in it before it lost its value. Isn't it better to lose your love when you still have time to find a new one?
Depends on if their new love provides as much money as their old one, which is probably not likely. I'd rather have had those decades to stash and invest.
A lot of pre-faang engineers dont have the stash you're thinking about. What you meant was "right when I found a lucrative job that I love". What was going on in tech these last 15 years, unfortunately, probably was once in a lifetime.
It's crazy to think back in the 80's programmers had "mild" salaries despite programming back then being worlds more punishing. No libraries, no stack exchange, no forums, no endless memory and infinite compute. If you had a challenging bug you better also be proficient in reading schematics and probing circuits.
It has not lost its value yet, but the future will shift that value. All of the past experience you have is an asset for you to move with that shift. The problem will not be you losing value, it will be you not following where the value goes.
It might be a bit more difficult to love where the shift goes, but that is no different than loving being a artist which often shares a bed with loving being poor. What will make you happier?
Especially on the topic of value! We are all intuitively aware that value is highly contextual, but get in a knot trying to rationalize value long past genuine engagement!
Imagine a senior dev who just approves PRs, approves production releases, and prioritizes bug reports and feature requests. LLM watches for errors ceaslessly, reports an issue. Senior dev reviews the issue and assigns a severity to it. Another LLM has a backlog of features and errors to go solve, it makes a fix and submits a PR after running tests and verifying things work on its end.
Why are we pretending like the need for tenacity will go away? Certain problems are easier now. We can tackle larger problems now that also require tenacity.
Even right at this very moment where we have a high-tenacity AI, I'd argue that working with the AI -- that is to say, doing AI coding itself and dealing with the novel challenges that brings requires a lot of stubborn persistence.
Fittingly, George Hinton toiled away for years in relative obscurity before finally being recognized for his work. I was always quite impressed by his "tenacity".
So although I don't think he should have won the Nobel Prize because not really physics, I felt his perseverance and hard work should merit something.
I still find in these instances there's at least a 50% chance it has taken a shortcut somewhere: created a new, bigger bug in something that just happened not to have a unit test covering it, or broke an "implicit" requirement that was so obvious to any reasonable human that nobody thought to document it. These can be subtle because you're not looking for them, because no human would ever think to do such a thing.
Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"
The value extortion plan writes itself. How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? Would you even know?
That’s far-fetched. It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.
What are the details of this? I'm not playing dumb, and of course I've noticed the decline, but I thought it was a combination of losing the battle with SEO shite and leaning further and further into a 'give the user what you think they want, rather than what they actually asked for' philosophy.
As recently as 15 years ago, Google _explicitly_ stated in their employee handbook that they would NOT, as a matter of principle, include ads in the search results. (Source: worked there at that time.)
Now, they do their best to deprioritize and hide non-ad results...
Only if you are paying per token on the API. If you are paying a fixed monthly fee then they lose money when you need to burn more tokens and they lose customers when you can’t solve your problems within that month and max out your session limits and end up with idle time which you use to check if the other providers have caught up or surpassed your current favourite.
> It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.
It's only in the interests of the model builders to do that IFF the user can actually tell that the model is giving them the best value for a single dollar.
Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.
> Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.
That doesn't help in practical usage - all you'd know is their consistency at the point in time of testing. After all, 5m after your test is done, your request to an API might lead to a different model being used in the background because the limits of the current one were reached.
The free market proposition is that competition (especially with Chinese labs and grok) means that Anthropic is welcome to do that. They're even welcome to illegally collude with OpenAi such that ChatGPT is similarly gimped. But switching costs are pretty low. If it turns out I can one shot an issue with Qwen or Deepseek or Kimi thinking, Anthropic loses not just my monthly subscription, but everyone else's I show that too. So no, I think that's some grade A conspiracy theory nonsense you've got there.
It’s not that crazy. It could even happen by accident in pursuit of another unrelated goal. And if it did, a decent chunk of the tech industry would call it “revealed preference” because usage went up.
LLMs became sycophantic and effusive because those responses were rated higher during RLHF, until it became newsworthy how obviously eager-to-please they got, so yes, being highly factually correct and "intelligent" was already not the only priority.
This is a good point. For example if you have access to a bunch of slot machines, one of them is guaranteed to hit the jackpot. Since switching from one slot machine to another is easy, it is trivial to go from machine to machine until you hit the big bucks. That is why casinos have such large selections of them (for our benefit).
"for our benefit" lol! This is the best description of how we are all interacting with LLMs now. It's not working? Fire up more "agents" ala gas town or whatever
To be clear I don't think that's what they're doing intentionally. Especially on a subscription basis, they'd rather me maximize my value per token, or just not use them. Lulling users into using tokens unproductively is the worst possible option.
The way agents work right now though just sometimes feels that way; they don't have a good way of saying "You're probably going to have to figure this one out yourself".
As a rational consumer, how would you distinguish between some intentional "keep pulling the slot machine" failure rate and the intrinsic failure rate?
I feel like saying "the market will fix the incentives" handwaves away the lack of information on internals. After all, look at the market response to Google making their search less reliable - sure, an invested nerd might try Kagi, but Google's still the market leader by a long shot.
I was thinking more of deliberate backdoor in code. RCE is an obvious example, but another one could be bias. "I'm sorry ma'am, computer says you are ineligable for a bank account." These ideas aren't new. They were there in 90s already when we still thought about privacy and accountability regarding technology, and dystopian novels already described them long, long ago.
> These can be subtle because you're not looking for them
After any agent run, I'm always looking the git comparison between the new version and the previous one. This helps catch things that you might otherwise not notice.
You are using it wrong, or are using a weak model if your failure rate is over 50%. My experience is nothing like this. It very consistently works for me. Maybe there is a <5% chance it takes the wrong approach, but you can quickly steer it in the right direction.
A lot of people are getting good results using it on hard things. Obviously not perfect, but > 50% success.
That said, more and more people seem to be arriving at the conclusion that if you want a fairly large-sized, complex task in a large existing codebase done right, you'll have better odds with Codex GPT-5.2-Codex-XHigh than with Claude Code Opus 4.5. It's far slower than Opus 4.5 but more likely to get things correct, and complete, in its first turn.
I think a lot of it comes down to how well the user understands the problem, because that determines the quality of instructions and feedback given to the LLM.
For instance, I know some people have had success with getting claude to do game development. I have never bothered to learn much of anything about game development, but have been trying to get claude to do the work for me. Unsuccessful. It works for people who understand the problem domain, but not for those who don't. That's my theory.
It works for hard problems when the person already solves it and just needs the grunt work done
It also works for problems that have been solved a thousand times before, which impresses people and makes them think it is actually solving those problems
Which matches what they are. They're first and foremost pattern recognition engines extraordinaire. If they can identify some pattern that's out of whack in your code compared to something in the training data, or a bug that is similar to others that have been fixed in their training set, they can usually thwack those patterns over to your latent space and clean up the residuals. If comparing pattern matching alone, they are superhuman, significantly.
"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape. Their ability to pattern match makes reasoning seem more powerful than it actually is. If your bug is within some reasonable distance of a pattern it has seen in training, reasoning can get it over the final hump. But if your problem is too far removed from what it has seen in its latent space, it's not likely to figure it out by reasoning alone.
Exactly. I go back to a recent ancestor of LLMs, seq2seq. Its purpose was to translate things. Thats all. That needed representation learning and an attention mechanism, and it lead to some really freaky emergent capabilities, but its trained to trainslate language.
And thats exactly what its good for. It works great if you already solve a tough problem and provide it the solution in natural language, because the program is already there, it just needs to translate it to python.
Anything more than that that might emerge from this is going to be unreliable sleight of next-token-prediction at best.
We need a new architectural leap to have these things reason, maybe something that involves reinforcement learning at the token represention level, idk. But scaling the context window and training data arent going to cut it
it's meant in the literal sense but with metaphorical hacksaws and duct tape.
Early on, some advanced LLM users noticed they could get better results by forcing insertion of a word like "Wait," or "Hang on," or "Actually," and then running the model for a few more paragraphs. This would increase the chance of a model noticing a mistake it made.
Not the core foundation model. The foundation model still only predicts the next token in a static way. The reasoning is tacked onto the instructGPT style finetuning step and its done through prompt engineering. Which is the shittiest way a model like this could have been done, and it shows
I mean problems not worth solving, because theyve already been solved. If you need to just do the grunt work of retrieving the solution to a trite and worn out problem from the models training data, then they work great
But if you want to do interesting things, like all the shills keep trying to claim they do. Then this wont do it for you. You have to do it for it
> But if you want to do interesting things, like all the shills keep trying to claim they do
I don't know where this is coming from. I've seen some over-enthusiastic hype for sure, but most of the day-to-day conversations I see aren't people saying they're curing cancer with Claude, they're people saying they're automating their bread and butter tasks with great success.
> But anyway, it already costs half compared to last year
You could not have bought Claude Opus 4.5 at any price one year ago I'm quite certain. The things that were available cost half of what they did then, and there are new things available. These are both true.
I'm agreeing with you, to be clear.
There are two pieces I expect to continue: inference for existing models will continue to get cheaper. Models will continue to get better.
Three things, actually.
The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].
As a user of LLMs since GPT-3 there was noticeable stagnation in LLM utility after the release of GPT-4. But it seems the RLHF, tool calling, and UI have all come together in the last 12 months. I used to wonder what fools could be finding them so useful to claim a 10x multiplier - even as a user myself. These days I’m feeling more and more efficiency gains with Claude Code.
That's the thing people are missing, the models plateaued a while ago, still making minor gains to this day, but not huge ones. The difference is now we've had time to figure out the tooling. I think there's still a ton of ground to cover there and maybe the models will improve given that the extra time, but I think it's foolish to consider people who predicted that completely wrong. There are also a lot of mathematical concerns that will cause problems in the near and distant future. Infinite progress is far from a given, we're already way behind where all the boosters thought we'd be my now.
I believe Sam Altman, perhaps the greatest grifter in today’s Silicon Valley, claimed that software engineering would be obsolete by the end of last year.
> The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].
Everybody who bet against Moore's Law was wrong ... until they weren't.
And AI is the reaction to Moore's Law having broken. Nobody gave one iota of damn about trying to make programming easier until the chips couldn't double in speed anymore.
This is exactly backwards: Dennard scaling stopped. Moore’s Law has continued and it’s what made training and running inference on these models practical at interactive timescales.
You are technically correct. The best kind of correct.
However, most people don't know the difference between the proper Moore's Law scaling (the cost of a transistor halves every 2 years) which is still continuing (sort of) and the colloquial version (the speed of a transistor doubles every 2 years) which got broken when Dennard scaling ran out. To them, Moore's Law just broke.
Nevertheless, you are reinforcing my point. Nobody gave a damn about improving the "programming" side of things until the hardware side stopped speeding up.
And rather than try to apply some human brainpower to fix the "programming" side, they threw a hideous number of those free (except for the electricity--but we don't mention that--LOL) transistors at the wall to create a broken, buggy, unpredictable machine simulacrum of a "programmer".
(Side note: And to be fair, it looks like even the strong form of Moore's Law is finally slowing down, too)
If you can turn a few dollars of electricity per hour into a junior-level programmer who never gets bored, tired, or needs breaks, that fundamentally changes the economics of information technology.
And in fact, the agentic looped LLMs are executing much better than that today. They could stop advancing right now and still be revolutionary.
i don't think it is harmless or we are incentivising people to just say whatever they want without any care for truth. people's reputations should be attached to their predictions.
Some people definitely do but how do they go and address it? A fresh example in that it addresses pure misinformation. I just screwed up and told some neighbors garbage collection was delayed for a day because of almost 2ft of snow. Turns out it was just food waste and I was distracted checking the app and read the notification poorly.
I went back to tell them (do not know them at all just everyone is chattier digging out of a storm) and they were not there. Feel terrible and no real viable remedy. Hope they check themselves and realize I am an idiot. Even harder on the internet.
That's not true. Many technologies get more expensive over time, as labor gets more expensive or as certain skills fall by the wayside, not everything is mass market. Have you tried getting a grandfather clock repaired lately?
Repairing grandfather clocks isn't more expensive now because it's gotten any harder; it's because the popularity of grandfather clocks is basically nonexistent compared to anything else to tell time.
of course it's silly to talk about manufacturing methods and yield and cost efficiency without having an economy to embed all of this into, but ... technology got cheaper means that we have practical knowledge of how to make cheap clocks (given certain supply chains, given certain volume, and so and so)
we can make very cheap very accurate clocks that can be embedded into whatever devices, but it requires the availability of fabs capable of doing MEMS components, supply materials, etc.
you can look at a basket of goods that doesn't have your specific product and compare directly
but inflation is the general price level increase, this can be used as a deflator to get the price of whatever product in past/future money amount to see how the price of the product changed in "real" terms (ie. relative to the general price level change)
Instead of advancing tenuous examples you could suggest a realistic mechanism by which costs could rise, such as a Chinese advance on Taiwan, effecting TSMC, etc.
You will get a different bridge. With very different technology. Same as "I can't repair my grandfather clock cheaply".
In general, there are several things that are true for bridges that aren't true for most technology:
* Technology has massively improved, but most people are not realizing that. (E.g. the Bay Bridge cost significantly more than the previous version, but that's because we'd like to not fall down again in the next earthquake)
* We still have little idea how to reason about the cost of bridges in general. (Seriously. It's an active research topic)
* It's a tiny market, with the major vendors forming an oligopoly
* It's infrastructure, not a standard good
* The buy side is almost exclusively governments.
All of these mean expensive goods that are completely non-repeatable. You can't build the same bridge again. And on top of that, in a distorted market.
But sure, the cost of "one bridge, please" has gone up over time.
This seems largely the same as any other technology. The prices of new technologies go down initially as we scale up and optimize it's production, but as soon as demand fades, due to newer technology or whatever, the cost of that technology goes up again.
I don't think the question is answerable in a meaningful way. Bridges are one-off projects with long life spans, comparing cost over time requires a lot of squinting just so.
Time-keeping is vastly cheaper. People don't want grandfather clocks. They want to tell time. And they can, more accurately, more easily, and much cheaper than their ancestors.
The chart shows that they’re right though. Newer models cost more than older models.
Sure they’re better but that’s moot if older models are not available or can’t solve the problem they’re tasked with.
On the link you shared, 4o vs 3.5 turbo price per 1m tokens.
There’s no such thing as ”same task by old model”, you might get comparable results or you might not (and this is why the comparison fail, it’s not a comparison), the reason you pick the newer models is to increase chances of getting a good result.
> The dataset for this insight combines data on large language model (LLM) API prices and benchmark scores from Artificial Analysis and Epoch AI. We used this dataset to identify the lowest-priced LLMs that match or exceed a given score on a benchmark. We then fit a log-linear regression model to the prices of these LLMs over time, to measure the rate of decrease in price. We applied the same method to several benchmarks (e.g. MMLU, HumanEval) and performance thresholds (e.g. GPT-3.5 level, GPT-4o level) to determine the variation across performance metrics
This should answer. In your case, GPT-3.5 definitely is cheaper per token than 4o but much much less capable. So they used a model that is cheaper than GPT-3.5 that achieved better performance for the analysis.
Not according to their pricing table. Then again I’m not sure what OpenAI model versions even mean anymore, but I would assume 5.2 is in the same family as 5 and 5.2-pro as 5-pro
Not true. Bitcoin has continued to rise in cost since its introduction (as in the aggregate cost incurred to run the network).
LLMs will face their own challenges with respect to reducing costs, since self-attention grows quadratically. These are still early days, so there remains a lot of low hanging fruit in terms of optimizations, but all of that becomes negligible in the face of quadratic attention.
I don't think computation is going to become more expensive, but there are techs that have become so: Nuclear power plants. Mobile phones. Oil extraction.
(Oil rampdown is a survival imperative due to the climate catastrophe so there it's a very positive thing of course, though not sufficient...)
There are plenty of technologies that have not become cheaper, or at least not cheap enough, to go big and change the world. You probably haven't heard of them because obviously they didn't succeed.
Supersonic jet engines, rockets to the moon, nuclear power plants, etc. etc. all have become more expensive. Superconductors were discovered in 1911, and we have been making them for as long as we have been making transistors in the 1950s, yet superconductors show no sign of becoming cheaper any time soon.
There have been plenty of technologies in history which do not in fact become cheaper. LLMs are very likely to become such, as I suspect their usefulness will be superseded by cheaper (much cheaper in fact) specialized models.
With optimizations and new hardware, power is almost a negligible cost. You can get 5.5M tokens/s/MW[1] for kimi k2(=20M/KWH=181M tokens/$) which is 400x cheaper than current pricing. It's just Nvidia/TSMC/other manufacturers eating up the profit now because they can. My bet is that China will match current Nvidia within 5 years.
Electricity is negligible but the dominant cost is the hardware depreciation itself. Also inference is typically memory bandwidth bound so you are limited by how fast you can move weights rather than raw compute efficiency.
Yes, because the margin is like 80% for Nvidia, and 80% again for the manufacturers like Samsung and TSMC. Once the fixed cost like R and D is amortized the same node technology and hardware capacity could be just few single digit percent of current.
> And you most likely do not pay the actual costs.
This is one of the weakest anti AI postures. "It's a bubble and when free VC money stops you'll be left with nothing". Like it's some kind of mystery how expensive these models are to run.
You have open weight models right now like Kimi K2.5 and GLM 4.7. These are very strong models, only months behind the top labs. And they are not very expensive to run at scale. You can do the math. In fact there are third parties serving these models for profit.
The money pit is training these models (and not that much if you are efficient like chinese models). Once they are trained, they are served with large profit margins compared to the inference cost.
OpenAI and Anthropic are without a doubt selling their API for a lot more than the cost of running the model.
> You send all the necessary data, including information that you would never otherwise share.
I've never sent the type of data that isn't already either stored by GitHub or a cloud provider, so no difference there.
> And you most likely do not pay the actual costs.
So? Even if costs double once investor subsidies stop, that doesn't change much of anything. And the entire history of computing is that things tend to get cheaper.
> You and your business become dependent on this major gatekeeper.
Not really. Switching between Claude and Gemini or whatever new competition shows up is pretty easy. I'm no more dependent on it than I am on any of another hundred business services or providers that similarly mostly also have competitors.
To me this tenacity is often like watching someone trying to get a screw into board using a hammer.
There’s often a better faster way to do it, and while it might get to the short term goal eventually, it’s often created some long term problems along the way.
I don’t understand this pov. Unfortunately, id pay 10k mo for my cc sub. I wish I could invest in anthropic, they’re going to be the most profitable company on earth
My agent struggled for 45 minutes because it tried to do `go run` on a _test.go file, which the compiler repeatedly exited after posting an error message that files named like this cannot be executed using the run command.
So yeah, that wasted a lot of GPU cycles for a very unimpressive result, but with a renewed superficial feeling of competence
We can observe how much generic inference providers like deepinfra or together-ai charge for large SOTA models. Since they are not subsidized and they don’t charge 7x of OpenAI, that means OAI also doesn’t have outrageously high per-token costs.
OAI is running boundary pushing large models. I don’t think those “second tier” applications can even get the GPUs with the HBM required at any reasonable scale for customer use.
Not to mention training costs of foundation models
AI genius discover brute forcing... what a time to be alive. /s
Like... bro that's THE foundation of CS. That's the principle of The bomb in Turing's time. One can still marvel at it but it's been with us since the beginning.
In the case of those big 'foundation models': Fine-tune for whom and how? I doubt it is possible to fine-tune things like this in a way that satisfies all audiences and training set instances. Much of this is probably due to the training set itself containing a lot of propaganda (advertising) or just bad style.
reply