Hacker Newsnew | past | comments | ask | show | jobs | submit | ACCount37's commentslogin

Is that really true?

I can name OpenAI CEO but not Anthropic CEO off the top of my head. And I actually like Anthropic's work way more than what OpenAI is doing right now.


Which you can do for knockouts, but not for the "splice in a new gene 400BP long".

Because that kind of optimization takes effort. And a lot of it.

Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.

The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.


It sure is a weird thing, but yes, the first mobile devices that shipped with USB didn't really know how to charge off it.

Which, to be fair to them, usb was never supposed to be a power delivery standard (at least not more than the 5 volts needed to power a mouse)

Not necessarily. GPT-4.5 was a new pretrain on top of a sizeable raw model scale bump, and only got 0.5 - because the gains from reasoning training in o-series overshadowed GPT-4.5's natural advantage over GPT-4.

OpenAI might have learned not to overhype. They already shipped GPT-5 - which was only an incremental upgrade over o3, and was received poorly, with this being a part of the reason why.


I jumped straight from 4o (free user) into GPT-5 (paid user).

It was a generational leap if there ever has been one. Much bigger than 3.5 to 4.


Yes, if OpenAI released GPT-5 after GPT-4o, then it would have been seen as a proper generational leap.

But o3 existing and being good at what it does? Took the wind out of GPT-5's sails.


What kind of improvements do you expect when going from 5 straight to 6?

With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.

Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.


Come on. If we weren't shifting the goalposts, we would have burned through 90% of the entire supply of them back in 2022!

It’s less shifting goalposts and more of a very jagged frontier of capabilities problem.

The good old "benchmarks just keep saturating" problem.

Anthropic is genuinely one of the top companies in the field, and for a reason. Opus consistently punches above its weight, and this is only in part due to the lack of OpenAI's atrocious personality tuning.

Yes, the next stop for AI is: increasing task length horizon, improving agentic behavior. The "raw general intelligence" component in bleeding edge LLMs is far outpacing the "executive function", clearly.


Shouldn't the next stop be to improve general accuracy, which is what these tools have struggled with since their inception? Until when are "AI" companies going to offload the responsibility on the user to verify the output of their tools?

Optimizing for benchmark scores, which are highly gamed to begin with, by throwing more resources at this problem is exceedingly tiring. Surely they must've noticed the performance plateau and diminishing returns of this approach by now, yet every new announcement is the same.


What "performance plateau"? The "plateau" disappears the moment you get harder unsaturated benchmarks.

It's getting more and more challenging to do that - just not because the models don't improve. Quite the opposite.

Framing "improve general accuracy" as "something no one is doing" is really weird too.

You need "general accuracy" for agentic behavior to work at all. If you have a simple ten step plan, and each step has a 50% chance of an unrecoverable failure, then your plan is fucked, full stop. To advance on those benchmarks, the LLM has to fail less and recover better.

Hallucinations is a "solvable but very hard to solve" problem. Considerable progress is being made on it, but if there's "this one weird trick" that deletes hallucinations, then we sure didn't find it yet. Humans get a body of meta-knowledge for free, which lets them dodge hallucinations decently well (not perfectly) if they want to. LLMs get pathetic crumbs of meta-knowledge and little skill in using it. Room for improvement, but, not trivial to improve.


By now, I'm convinced that Kessler syndrome exists solely to be fear bait. Almost no one knows what it is or what it does - people just know the stupid "space is ruined forever" media picture.

A major weak point for AIs is long term tasks and agentic behavior. Which is, as it turns out, its own realm of behavior that's hard to learn from text data, and also somewhat separate from g - the raw intelligence component.

An average human still has LLMs beat there, which might be distorting people's perceptions. But task length horizon is going up, so that moat holding isn't a given at all.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: