Sites like simonwillison.net/2025/jul/ and channels like https://www.youtube.com/@aiexplained-official also cover new model releases pretty quickly for some "out of the box thinking/reasoning" evaluations.
For me and my usage I can really only tell if I start using the new model for tasks I actually use them for.
My personal benchmark andrew.ginns.uk/merbench has full code and data on GitHub if you want a staring point!
A complete desktop computer with the M2 Ultra w/64GB of RAM and 1TB of SSD is $4k.
The 7995WX processor alone is $10k, the motherboard is one grand, the RAM is another $300. So you're up to $11300, and you still don't have a PSU, case, SSD, GPU....or heatsink that can handle the 300W TDP of the threadripper processor; you're probably looking at a very large AIO radiator to keep it cool enough to get its quoted performance. So you're probably up past $12k, 3x the price of the Studio...more like $14k if you want to have a GPU of similar capability to the M2 Ultra.
Just the usual "aPPle cOMpuTeRs aRE EXpeNsIVE!" nonsense.
So from a CPU perspective you get 7x the CPU throughput for 3x to 4x the price, plus upgradable RAM that is massively cheaper. The M2 uses the GPU for LLMs though, and there it sits in a weird spot where 64GB of (slower) RAM plus midrange GPU performance is not something that exists in the PC space. The closest thing would probably be a (faster) 48GB Quadro RTX which is in the $5000 ballpark. For other use cases where VRAM is not such a limiting factor, the comparably priced PC will blow the Mac out of the water, especially when it comes to GPU performance. The only reason we do not have cheap 96GB GDDR GPUs is that it would cannibalize NVIDIA/AMDs high margin segment. If this was something that affected Apple, they would act the same.
I didn't see benchmarks that suggest the 7950X is faster than M2 Ultra. I only saw performance numbers for 7995WX which has 6x the cores and 6x the cache.
Either way, I think these comparisons are moot since an M2 Ultra comes with 2x M2 Max GPUs and an NPU and up to 192GB of unified memory running at 800GB/s. In other words, you wouldn't want to run your LLM on the CPU if you have an M2 Ultra.
The point of OP is to increase LLM performance when you don't have a capable GPU.
Indeed they do, however companies like Meta (altruistically or not) are preventing OpenAI from building 'moats' by releasing models and architecture details in a very public way.
I think it's a safe bet to say it's not altruistic. And, if Meta were to wrestle away OpenAI's moat, they'd eagerly create their own, given the opportunity.
> And, if Meta were to wrestle away OpenAI's moat, they'd eagerly create their own
Meta is already capable of monetizing content generated by the models: these models complement their business and they could not care less which model you're using to earn them advertising dollars, as long as you keep the (preferably high quality) content coming.
> And, if Meta were to wrestle away OpenAI's moat, they'd eagerly create their own, given the opportunity.
At which point the new underdogs would have an interest in doing to them what they're doing to OpenAI.
Assuming progress for LLMs continues at a rapid pace for an extended period of time. It's not implausible that they'll get to a certain level past which non-trivial progress is hard, and if there is an open source model at that level there isn't going to be a moat.
Meta doesn't interact with its users in very obvious ways which MS, Google do. All its models magic happen behind the scenes. Meta can continue to release 2nd best models to undercut others and them going far too ahead. And Open Source community will take it from there. Dall-E is dead.
And if all open source extends their models, they can accrue those benefits back to themselves. This is already how they’ve become such a huge player in machine learning (open sourcing amazing stuff).
True but what you can do is SSH to the device and install a custom launcher for apps that can read standard epubs, play chess, or expose the linux terminal on device.
Not great for basic users but I've had significantly more use out of it with some advanced setup.
Take a look at Piper. It's the tts solution used by the open source home automation project HomeAssistant. Produces decent quality speech in a couple seconds on raspberry pi class hardware.
I think we will be seeing this more and more (less cache on the die OR stacked cache on top of the main chip instead of in it) since SRAM scaling is now near a halting point [1] (no more improvements) which means the fixed code of cache is going up with every new node.
From what I've read, the L1/L2 cache per core is the same, and the L3 cache per chiplet is the same, but the cores per chiplet is doubled and the overall chiplet size is about the same (it's a little bigger, I think)
So, L3 cache didn't get smaller (in area or bytes), there's just less for each core. L1/L2 is relatively small, but they did use techniqued to make it smaller at the expense of performance.
I think the big difference really is the reductions in buffers, etc, needed to get a design that scales to the moon. This is likely a major factor for Apple's M series too. Apple computers are never getting the thermal design needed to clock to 5Ghz, so setting the design target much lower means a smaller core, better power efficiency, lower heat, etc. The same thing applies here: you're not running your dense servers at 5Ghz, there's just not enough capability to deliver power and remove heat; so a design with a more realistic target speed can be smaller and more efficient.
Semianalysis seems to indicate the core space itself is 35% smaller. Maybe that's process tweaks related to clock rate? But I don't think we know for absolute certain it really is the same core with the same execution units & same everything. Even though a ton of the listed stats are exactly the same.
Combined Obtainium it's easy to keep it updated. https://github.com/ImranR98/Obtainium