More

overfeed · 2026-03-31T22:56:58 1774997818

> Large aircraft are the cheapest and most scalable way to deliver a ton of explosive on target.

An important variable missing from your calculus is distance from munitions factory/supply depot. There are far cheaper and scalable ways to deliver tons of explosives if your supply lines are short, such as rail when you're defending your homeland. Carrier groups are both transport and FOBs

> You should also consider that it is much more difficult to sink a large ship than a small ship.

How did that turn out for the Russian Black Sea flagship, the Moskva?

overfeed · 2026-03-31T06:51:27 1774939887

> It's just a matter of getting the performance good enough.

Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.

DrScientist · 2026-03-31T10:42:04 1774953724

Apple via customers paying for the whole solution ( eg a laptop that can run decent local models )?

I think Apple had something in the region of 143 billion in revenue in the last quarter.

Not saying it will happen - just that there are a variety of business models out there and in the end it all depends on where consumers put their money.

overfeed · 2026-03-31T06:33:28 1774938808

Spurious keyboard inputs and broken ribbon cables may have been issues in 2003, but tablet-mode laptops made in the last 15 years face no such issues; e.g. the many generations of the Lenovo Yoga series in that period. In 2026, even 7mm-thick phones can have reliable 180°/-180° folding screens - laptops have a lot more volume to play with and fewer lifetime open/close events.

Apple's problems with touchscreen laptops are not mechanical; if Apple were to make a decent touchscreen laptop - say a 12" MacBook Air, it'd have a 360° hinge and cannibalize iPad sales, so they don't make that device to preserve the segmentation motivating people to buy both devices.

overfeed · 2026-03-31T06:21:54 1774938114

That wasn't the only time Jobs trashed a category Apple didn't currently have annon-sale model for, but was actively developing; he also slurred 6-inch Android phones as "Hummers", and mocked the 7-inch Android tablets as "too small" a little while before Apple launched its iPad Mini.

Maxion · 2026-03-31T10:16:18 1774952178

To be fair, 7.9 inches is quite a bit bigger than 7 inches. That's ~30% more screen area.

fennecbutt · 2026-03-31T09:05:46 1774947946

Exactly, but watch people leap into to defend how brilliant he supposedly was.

isidusjdjsu · 2026-03-31T20:06:05 1774987565

There’s no contradiction here. Jobs’ point was about the MAIN input method. A touchscreen that requires a stylus as main input method still is a terrible idea. The Apple Pencil is meant for alternative and creative input, something you can’t do well with your fingers.

Please, leave that reddit-esque “iSheep”-type of comment out of here.

eru · 2026-03-31T10:37:15 1774953435

I see no contradiction.

overfeed · 2026-03-28T07:30:54 1774683054

Chinese companies, starting with CXMT will own the consumer segment: until they are sanctioned/banned in the US. The rest of the world will be fine, but consumer desktop computing in the US will be akin to the cars in Cuba.

overfeed · 2026-03-28T04:31:18 1774672278

i use local dev containers: the worst an agent can do is delete its working copy; no access to my home directory, access tokens or sudo.

overfeed · 2026-03-27T08:22:50 1774599770

> Now I'm really curious. What field are you in that ndjson files of that size are common?

I'm not OP,but structured JSON logs can easily result in humongous ndjson files, even with a modest fleet of servers over a not-very-long period of time.

messe · 2026-03-27T08:24:49 1774599889

So what's the use case for keeping them in that format rather than something more easily indexed and queryable?

I'd probably just shove it all into Postgres, but even a multi terabyte SQLite database seems more reasonable.

carlmr · 2026-03-27T08:47:13 1774601233

Replying here because the other comment is too deeply nested to reply.

Even if it's once off, some people handle a lot of once-offs, that's exactly where you need good CLI tooling to support it.

Sure jq isn't exactly super slow, but I also have avoided it in pipelines where I just need faster throughput.

rg was insanely useful in a project I once got where they had about 5GB of source files, a lot of them auto-generated. And you needed to find stuff in there. People were using Notepad++ and waiting minutes for a query to find something in the haystack. rg returned results in seconds.

messe · 2026-03-27T09:06:34 1774602394

You make some good points. I've worked in support before, so I shouldn't have discounted how frequent "once-offs" can be.

paavope · 2026-03-27T08:35:34 1774600534

The use case could be e.g. exactly processing an old trove of logs into something more easily indexed and queryable, and you might want to use jq as part of that processing pipeline

messe · 2026-03-27T08:38:24 1774600704

Fair, but for a once-off thing performance isn't usually a major factor.

The comment I was replying to implied this was something more regular.

EDIT: why is this being downvoted? I didn't think I was rude. The person I responded to made a good point, I was just clarifying that it wasn't quite the situation I was asking about.

adastra22 · 2026-03-27T09:06:27 1774602387

At scale, low performance can very easily mean "longer than the lifetime of the universe to execute." The question isn't how quickly something will get done, but whether it can be done at all.

messe · 2026-03-27T10:27:43 1774607263

Good point. I said it above, but I'll repeat it here that I shouldn't have discounted how frequent once offs can be. I've worked in support before so I really should've known better

bigDinosaur · 2026-03-27T09:06:59 1774602419

Certain people/businesses deal with one-off things every day. Even for something truly one-off, if one tool is too slow it might still be the difference between being able to do it once or not at all.

overfeed · 2026-03-27T08:15:28 1774599328

> I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default.

There's a lots of experimentation right now, but one thing that's guaranteed is that the data gatekeepers will slam the door shut[1] - or install a toll-booth when there's less money sloshing about, and the winners and losers are clear. At some point in the future, Atlassian and Github may not grant Anthropic access to your tickets unless you're on the relevant tier with the appropriate "NIH AI" surcharge.

1. AI does not suspend or supplant good old capitalism and the cult of profit maximization.

overfeed · 2026-03-27T06:03:59 1774591439

What were you using 6 months ago?

withinboredom · 2026-03-27T06:23:06 1774592586

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

hhh · 2026-03-27T08:44:48 1774601088

The models don’t change.

tornikeo · 2026-03-27T08:50:51 1774601451

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

armchairhacker · 2026-03-27T10:17:23 1774606643

And there’s an incentive to publish evidence of this to discourage it, do you have any?

TeMPOraL · 2026-03-27T10:54:11 1774608851

Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.

natebc · 2026-03-27T11:12:46 1774609966

There really always is a man behind the curtain eh?

coldtea · 2026-03-27T12:12:35 1774613555

Often it's literally just that:

https://www.msn.com/en-us/money/other/ai-startup-backed-by-m...

TeMPOraL · 2026-03-27T11:35:12 1774611312

It's still engineering. Even magic alien tech from outer space would end up with an interface layer to manage it :).

ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.

woadwarrior01 · 2026-03-27T11:23:53 1774610633

There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.

[1]: https://marginlab.ai/trackers/claude-code/

nl · 2026-03-27T11:31:52 1774611112

So - as the charts say - no statistical difference?

Isn't this link am argument against the point you are making?

withinboredom · 2026-03-27T12:47:48 1774615668

The chart doesn't cover the 4.6 release which was in the end of December/early January time frame. So, it's hard to tell from existing data.

nl · 2026-03-29T06:42:23 1774766543

That isn't true. The whole point it to quickly pick up statistically significant variations quickly, and with the volume of tests they are doing there is plenty of data.

If you turn on the 95% CI bands you can see there is plenty of statistical significance.

withinboredom · 2026-03-29T09:04:24 1774775064

Unless you and I are looking at different web pages… it only goes back to February, not December or January.

coldtea · 2026-03-27T12:11:18 1774613478

Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...

seunosewa · 2026-03-27T17:53:36 1774634016

Or just change the reasoning levels.

esskay · 2026-03-27T09:22:01 1774603321

Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.

yorwba · 2026-03-27T11:40:08 1774611608

Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.

https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)

nextaccountic · 2026-03-27T12:32:11 1774614731

It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark

yorwba · 2026-03-27T13:03:11 1774616591

Any other benchmark at that sample size would have similarly huge error bars. Unless Anthropic makes a model that works 100% of the time or writes a bug that brings it all the way to zero, it's going to work sometimes and fail sometimes, and anyone who thinks they can spot small changes in how often it works without running an astonishingly large number of tests is fooling themselves with measurement noise.

fer · 2026-03-27T09:39:38 1774604378

They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".

stavros · 2026-03-27T10:12:27 1774606347

Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".

arcanemachiner · 2026-03-27T10:48:28 1774608508

I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.

scrollop · 2026-03-27T18:13:20 1774635200

You sure about that?

https://marginlab.ai/trackers/claude-code/

withinboredom · 2026-03-27T19:32:50 1774639970

Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.

And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...

50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.

coldtea · 2026-03-27T12:09:49 1774613389

Only nominally...

pixel_popping · 2026-03-27T09:33:32 1774604012

Oh yes, they do.

girvo · 2026-03-27T09:57:20 1774605440

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

coldtea · 2026-03-27T12:19:40 1774613980

No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.

overfeed · 2026-03-26T19:39:33 1774553973

> ...that really isn't that hard.

Until the AI scrapers[1] come for you at 5k requests per second and you're doing operations in hard-mode.

1. Most forges have http pages for discoverability. I suppose one could hypothetically setup an ssh-only forge and statically generate a html site periodically, but this is already advanced ops for the average Github user

gzread · 2026-03-27T04:00:33 1774584033

This isn't a real thing and if it ever becomes a thing you can sue them for DDOS and send Sam Altman to jail. AI scraping is in the realm of 1-5 requests per second, not 5000.

NewJazz · 2026-03-26T19:46:02 1774554362

I wasn't proposing a full on forge, just a VM with a (key auth only) ssh server to push code to/from.

RadiozRadioz · 2026-03-28T14:09:23 1774706963

fail2ban

account42 · 2026-03-31T08:54:00 1774947240

... is snakeoil in the era of (undeservedly) respected megacorporations operating botnets.