This is huge for AI / ML at least for inference. Apple chips are among the most efficient out there for that sort of thing, the only downside is the lack of cuda
Lack of Cuda is not a problem if for most ML frameworks. For example, in PyTorch you just tell it to use the “mps” (metal performance shaders) device instead of the “cuda” device.
That simply isn't true in practice. Maybe for inference, but even then you're running up against common CUDA kernels such as FlashAttention which will be far from plug and play with PyTorch.
I tried training some models using tensorflow-metal a year ago and I was quite disappointed. Using a relu activation function led to very poor accuracy [0] and training time was an order of magnitude slower than just using the free tier of Google Colab
To be totally honest, there's enough money in the ML / AI / LLM space now that I fully expect some companies to put forward alternative cards specifically for that purpose. Why google does not sell their TPU to consumer and datacenter instead of just letting you rent is beyond me.
Yep, there's no performance x86 CPUs on the market with ambitious GPUs, only laptop chips. Games are optimized for discrete GPUs, Apple didn't have that software inertia to deal with.
Sort of, obviously quite a few games are optimized for the PS5 and Xbox series X.
GPU cores are generally identical between the iGPUs and the discrete GPUs. Adding a PCIe bus (high latency and low bandwidth) and having a separate memory pool doesn't create new opportunities for optimization.
On the other hand having unified memory creates optimization opportunities, but even just making memcpy a noop can be useful as well.
GPUs are all about (compute & memory) bandwidth. Using the same building blocks of compute units doesn't yet it go fast. You need a lot of compute units and a lot of bandwidth to feed them.
The performance dependency on DGPUs doesn't come from the existance of a PCIe bus and partitioned memory, but from the fact that the software running on the DGPU is written for a system with high bandwidth memory like GDDR6X or HBM. It creates opportunities for optimization the same way as hardware properties and machine balances tend to, the software gets written, benchmarked and optimized against hardware with certain kinds of performance properties and constraints (like here compute/bandwidth balance and memory capacity, and whether CPU & GPU have shared memory).
> Apple chips are among the most efficient out there for that sort of thing
Not really? Apple is efficient because they ship moderately large GPUs manufactured on TSMC hardware. Their NPU hardware is more or less entirely ignored and their GPUs are using the same shader-based compute that Intel and AMD rely on. It's not efficient because Apple does anything different with their hardware like Nvidia does, it's efficient because they're simply using denser silicon than most opponents.
Apple does make efficient chips, but AI is so much of an afterthought that I wouldn't consider them any more efficient than Intel or AMD.
For inference, Apple chips are great due to a high memory bandwidth. Mac Studio is a popular choice in the local Llama community for this particular reason. It's a cost effective option if you need a lot of memory plus a high bandwidth. The downside is poor training performance and Metal being a less polished software stack compared to CUDA.
I wonder if a little cluster of Mac Minis is a good option for running concurrent LLM agents, or a single Mac Studio is still preferable?
The memory bandwidth on Apple silicon is only sometimes comparable to, and in many cases worse than, that of a GPU. For example, an nVidia RTX 4060 Ti 16GB GPU (not a high-end card by any means) has memory bandwidth of 288GiB/sec, which is more than double that of the M4 CPU.
On the higher end, building a machine with 6 to 8 24GB GPUs such as RTX 3090s would be comparable in cost (as well as available memory) to a high-end Mac Studio, and would be at least an order of magnitude faster at inference. Yes, it's going to use an order of magnitude more power as well, but what you probably should care about here is W/token which is in the same ballpark.
Apple silicon is a reasonable solution for inference only if you need the most amount of memory possible, you don't care about absolute performance, and you're unwilling to deal with a multi-GPU setup.
Edit: since my reply you have edited your comment to mention the Studio, but the fact remains that the M2 Max has at least ~40% greater bandwidth than the number you quoted as an example.
Exactly, the M2 Ultra is competitive for local inference use cases given the 800 GB/s bandwith and a relatively low cost and energy efficiency.
The M4 Pro in the Mini has a bandwidth of 273 GB/s, which is probably less appealing. But I wonder how it'd compare cost-wise and performance-wise, with several Minis in a little cluster, each running a small LLM and exchanging messages. This could be interesting for a local agent architecture.
See my sibling reply below, but I disagree with your main point here. M2 Ultra is only competitive for very specific use cases, it does not really cost less than a much higher-performing setup, and if what you care about is true efficiency (meaning, W/token, or how much energy does the computer use to produce a given response), a multi-GPU setup and Mac Studios are on about equal footing.
For reference comparing to what the big companies use, an H100 has over 3TB/s bandwidth. A nice home lab might be built around 4090s — two years old at this point — which have about 1TB/s.
Apple's chips have the advantage of being able to be specced out with tons of RAM, but performance isn't going to be in the same ballpark of even fairly old Nvidia chips.
The cheapest 4090 is EUR 110 less than a complete 32GB RAM M2 max Mac Studio where I live. Speccing out a full Intel 14700K computer (avoiding the expensive 14900) with 32 GB RAM, NVMe storage, case, power supply, motherboard, 10G Ethernet … and we are approaching the cost of the 64GB M2 ultra which has a more comparable memory bandwidth to the Nvidia card, but with more than twice the RAM available to the GPU.
That's my point. I would absolutely be willing to suffer a 20% memory bandwidth penalty if it means I can put 200% more data in the memory buffer to begin with. Not having to page in and out of disk storage quickly make those 20% irrelevant.
If you have enough 4090s, you don't need to page in and out of disk: everything stays in VRAM and is fast. But it's true that if you just want it to work, and you don't need the fastest perf, Apple is cheaper!
How is that relevant when the discussion from the start was about comparing a two year old Mac with a two year old GPU as a cost-benefit discussion.
In any case how are you going to fit 50+GB in two (theoretically 24+24 GB) Nvidia cards without swapping to disk when the Mac in question has 64GB (also theoretically) available?
You seem confused. Please feel free to read my post near the top of this very chain of comments, where I specifically compare a Mac Studio to a machine with 6 to 8 Nvidia GPUs. That was the discussion “from the start.”
> In any case how are you going to fit 50+GB in two (theoretically 24+24 GB) Nvidia cards
What seems like a joke about it? And relevant to what, exactly?
The parent of my initial comment in this thread said: "For inference, Apple chips are great due to a high memory bandwidth... It's a cost effective option if you need a lot of memory plus a high bandwidth."
My post was attempting to explain at a high level how 1) Apple SoCs do not really have high memory bandwidth compared to a cluster of GPUs, and 2) you can actually build that cluster of GPUs for the same cost or cheaper than a loaded Mac Studio, and it will drastically outperform the Mac.
If you want specifics on how to build such a GPU cluster, you can search for "ROMED8-2T 3090" for some examples.
Yeah, sorry, I realized that as well so I edited my post to add a higher end example with multiple 3090s or similar cards. A single 3090 has just under 1TiB/sec of memory bandwidth.
One more edit: I'd also like to point out that memory bandwidth is important, but not sufficient for fast inference. My entire point here is that Apple silicon does have high memory bandwidth for sure, but for inference it's very much held back by the relative slowness of the GPU compared with dedicated nVidia/AMD cards.
It's still "fast enough" for even 120b models in practice, and you don't need to muck around with building a multi-GPU rig (and figuring out how to e.g. cool it properly).
It's definitely not what you'd want for your data center, but for home tinkering it has a very clear niche.
> It's still "fast enough" for even 120b models in practice
Is it? This is very subjective. The Mac Studio would not be "fast enough" for me on even a 70b model, not necessarily because its output is slow, but because the prompt evaluation speed is quite bad. See [0] for example numbers; on Llama 3 70B at Q4_K_M quantization, it takes an M2 Ultra with 192GB about 8.5 seconds just to evaluate a 1024-token prompt. A machine with 6 3090s (which would likely come in cheaper than the Mac Studio) is over 6 times faster at prompt parsing.
A 120b model is likely going to be something like 1.5-2x slower at prompt evaluation, rendering it pretty much unusable (again, for me).
You're mostly correct, though a 4060Ti 16GB is 20-30% cheaper than the cheapest Mac Mini. More importantly though, "fits inside a Mac Mini" is not a criterion I'm using to evaluate whether a particular solution is suitable for LLM inference. If it is for you, that's fine, but we have vastly different priorities.
I'm not sure what you mean. RTX 4060 Ti/4070 Ti Super/3090/4090 cards can be easily purchased at any major electronics store in person or online and have 16GB or 24GB depending on model. Once you get up to 32GB, your point would stand, but 16-24GB GPUs are common.
> I know ancient iGPUs had that thing for setting the GPU memory size in the BIOS, but that's aaaaaancient and completely obsolete. If you still have that, just set it to the minimum value. The rest of memory will be unified.
I hadn’t used a PC in so long, I still thought that bios setting decided the division. TIL.