Unified memory too. It's your GPU and your ram.

wing-_-nuts · on Oct 29, 2024

This is huge for AI / ML at least for inference. Apple chips are among the most efficient out there for that sort of thing, the only downside is the lack of cuda

burnerthrow008 · on Oct 29, 2024

Lack of Cuda is not a problem if for most ML frameworks. For example, in PyTorch you just tell it to use the “mps” (metal performance shaders) device instead of the “cuda” device.

throwaway314155 · on Oct 29, 2024

That simply isn't true in practice. Maybe for inference, but even then you're running up against common CUDA kernels such as FlashAttention which will be far from plug and play with PyTorch.

xattt · on Oct 29, 2024

Cuda Apple license it from nVidia?

rob74 · on Oct 29, 2024

Cude, er, cute, but... no.

ojhughes · on Oct 29, 2024

I tried training some models using tensorflow-metal a year ago and I was quite disappointed. Using a relu activation function led to very poor accuracy [0] and training time was an order of magnitude slower than just using the free tier of Google Colab

[0] https://github.com/keras-team/tf-keras/issues/140

rowanG077 · on Oct 29, 2024

I consider that a plus. Maybe the AI community wil start to wake up and realize that going all in on cuda is ridiculously stupid.

wing-_-nuts · on Oct 29, 2024

To be totally honest, there's enough money in the ML / AI / LLM space now that I fully expect some companies to put forward alternative cards specifically for that purpose. Why google does not sell their TPU to consumer and datacenter instead of just letting you rent is beyond me.

fennecbutt · on Nov 1, 2024

Ceiling is too low on alternative hardware for that.

Until Apple can bang something out as good as an h100, it's no competition.

Cuda thrives bc of the hardware offering too.

kridsdale3 · on Oct 29, 2024

Not if you're an NVDA shareholder!

whartung · on Oct 29, 2024

So, do you think that when the Mac Studio gets upgraded, it will also come with less max RAM, but be unified?

Is the whole "unified" RAM a reason that the iMac and Mini are capped at 32G?

transitorykris · on Oct 29, 2024

The Mac Studio has always had unified memory

angoragoats · on Oct 29, 2024

Fun fact: any PC with integrated graphics has also had unified memory (yep, including Intel Macs), for at least the past decade!

sliken · on Oct 30, 2024

True, but PCs are 128 bits wide, apple lets you upgrade to 256 bits wide (M4 pro), 512 bits wide (m3 Max) or 1024 bits wide (M2 Ultra).

Unified memory is much more useful when you can get more bandwidth to it.

fulafel · on Oct 30, 2024

Yep, there's no performance x86 CPUs on the market with ambitious GPUs, only laptop chips. Games are optimized for discrete GPUs, Apple didn't have that software inertia to deal with.

sliken · on Oct 30, 2024

Sort of, obviously quite a few games are optimized for the PS5 and Xbox series X.

GPU cores are generally identical between the iGPUs and the discrete GPUs. Adding a PCIe bus (high latency and low bandwidth) and having a separate memory pool doesn't create new opportunities for optimization.

On the other hand having unified memory creates optimization opportunities, but even just making memcpy a noop can be useful as well.

fulafel · on Oct 31, 2024

GPUs are all about (compute & memory) bandwidth. Using the same building blocks of compute units doesn't yet it go fast. You need a lot of compute units and a lot of bandwidth to feed them.

The performance dependency on DGPUs doesn't come from the existance of a PCIe bus and partitioned memory, but from the fact that the software running on the DGPU is written for a system with high bandwidth memory like GDDR6X or HBM. It creates opportunities for optimization the same way as hardware properties and machine balances tend to, the software gets written, benchmarked and optimized against hardware with certain kinds of performance properties and constraints (like here compute/bandwidth balance and memory capacity, and whether CPU & GPU have shared memory).

kridsdale3 · on Oct 29, 2024

All "Apple Silicon" products, going back to the first one, which was the iPhone 4.

talldayo · on Oct 29, 2024

> Apple chips are among the most efficient out there for that sort of thing

Not really? Apple is efficient because they ship moderately large GPUs manufactured on TSMC hardware. Their NPU hardware is more or less entirely ignored and their GPUs are using the same shader-based compute that Intel and AMD rely on. It's not efficient because Apple does anything different with their hardware like Nvidia does, it's efficient because they're simply using denser silicon than most opponents.

Apple does make efficient chips, but AI is so much of an afterthought that I wouldn't consider them any more efficient than Intel or AMD.

nextos · on Oct 29, 2024

For inference, Apple chips are great due to a high memory bandwidth. Mac Studio is a popular choice in the local Llama community for this particular reason. It's a cost effective option if you need a lot of memory plus a high bandwidth. The downside is poor training performance and Metal being a less polished software stack compared to CUDA.

I wonder if a little cluster of Mac Minis is a good option for running concurrent LLM agents, or a single Mac Studio is still preferable?

angoragoats · on Oct 29, 2024

The memory bandwidth on Apple silicon is only sometimes comparable to, and in many cases worse than, that of a GPU. For example, an nVidia RTX 4060 Ti 16GB GPU (not a high-end card by any means) has memory bandwidth of 288GiB/sec, which is more than double that of the M4 CPU.

On the higher end, building a machine with 6 to 8 24GB GPUs such as RTX 3090s would be comparable in cost (as well as available memory) to a high-end Mac Studio, and would be at least an order of magnitude faster at inference. Yes, it's going to use an order of magnitude more power as well, but what you probably should care about here is W/token which is in the same ballpark.

Apple silicon is a reasonable solution for inference only if you need the most amount of memory possible, you don't care about absolute performance, and you're unwilling to deal with a multi-GPU setup.

Y-bar · on Oct 29, 2024

Note that they said the _Mac Studio_ which in the M2 model has between 400GB/s and 800GB/s memory bandwidth.

https://www.apple.com/mac-studio/specs/

Edit: since my reply you have edited your comment to mention the Studio, but the fact remains that the M2 Max has at least ~40% greater bandwidth than the number you quoted as an example.

nextos · on Oct 29, 2024

Exactly, the M2 Ultra is competitive for local inference use cases given the 800 GB/s bandwith and a relatively low cost and energy efficiency.

The M4 Pro in the Mini has a bandwidth of 273 GB/s, which is probably less appealing. But I wonder how it'd compare cost-wise and performance-wise, with several Minis in a little cluster, each running a small LLM and exchanging messages. This could be interesting for a local agent architecture.

angoragoats · on Oct 29, 2024

See my sibling reply below, but I disagree with your main point here. M2 Ultra is only competitive for very specific use cases, it does not really cost less than a much higher-performing setup, and if what you care about is true efficiency (meaning, W/token, or how much energy does the computer use to produce a given response), a multi-GPU setup and Mac Studios are on about equal footing.

reissbaker · on Oct 29, 2024

For reference comparing to what the big companies use, an H100 has over 3TB/s bandwidth. A nice home lab might be built around 4090s — two years old at this point — which have about 1TB/s.

Apple's chips have the advantage of being able to be specced out with tons of RAM, but performance isn't going to be in the same ballpark of even fairly old Nvidia chips.

Y-bar · on Oct 30, 2024

The cheapest 4090 is EUR 110 less than a complete 32GB RAM M2 max Mac Studio where I live. Speccing out a full Intel 14700K computer (avoiding the expensive 14900) with 32 GB RAM, NVMe storage, case, power supply, motherboard, 10G Ethernet … and we are approaching the cost of the 64GB M2 ultra which has a more comparable memory bandwidth to the Nvidia card, but with more than twice the RAM available to the GPU.

reissbaker · on Oct 31, 2024

Apple will let you buy more RAM for cheaper than Nvidia, but it won't be the same speed — it'll be ~20% slower than a 4090.

Y-bar · on Oct 31, 2024

That's my point. I would absolutely be willing to suffer a 20% memory bandwidth penalty if it means I can put 200% more data in the memory buffer to begin with. Not having to page in and out of disk storage quickly make those 20% irrelevant.

reissbaker · on Nov 4, 2024

If you have enough 4090s, you don't need to page in and out of disk: everything stays in VRAM and is fast. But it's true that if you just want it to work, and you don't need the fastest perf, Apple is cheaper!

Y-bar · on Nov 5, 2024

How _exactly_ do I keep 50+ Gigabytes of data in the 4090's VRAM without paging back and forth to disk?

angoragoats · on Nov 6, 2024

As the person you’re replying to said, by having multiple 4090s. 3090s work pretty well also, and are less than half the cost of 4090s.

Y-bar · on Nov 7, 2024

How is that relevant when the discussion from the start was about comparing a two year old Mac with a two year old GPU as a cost-benefit discussion.

In any case how are you going to fit 50+GB in two (theoretically 24+24 GB) Nvidia cards without swapping to disk when the Mac in question has 64GB (also theoretically) available?

angoragoats · on Nov 8, 2024

You seem confused. Please feel free to read my post near the top of this very chain of comments, where I specifically compare a Mac Studio to a machine with 6 to 8 Nvidia GPUs. That was the discussion “from the start.”

> In any case how are you going to fit 50+GB in two (theoretically 24+24 GB) Nvidia cards

No one was talking about only two cards.

Y-bar · on Nov 9, 2024

> Mac Studio to a machine with 6 to 8 Nvidia GPUs.

That post read like a joke comparison and still does. Can you elaborate how it is relevant?

angoragoats · on Nov 9, 2024

What seems like a joke about it? And relevant to what, exactly?

The parent of my initial comment in this thread said: "For inference, Apple chips are great due to a high memory bandwidth... It's a cost effective option if you need a lot of memory plus a high bandwidth."

My post was attempting to explain at a high level how 1) Apple SoCs do not really have high memory bandwidth compared to a cluster of GPUs, and 2) you can actually build that cluster of GPUs for the same cost or cheaper than a loaded Mac Studio, and it will drastically outperform the Mac.

If you want specifics on how to build such a GPU cluster, you can search for "ROMED8-2T 3090" for some examples.

I hope this helps.

angoragoats · on Oct 29, 2024

Yeah, sorry, I realized that as well so I edited my post to add a higher end example with multiple 3090s or similar cards. A single 3090 has just under 1TiB/sec of memory bandwidth.

One more edit: I'd also like to point out that memory bandwidth is important, but not sufficient for fast inference. My entire point here is that Apple silicon does have high memory bandwidth for sure, but for inference it's very much held back by the relative slowness of the GPU compared with dedicated nVidia/AMD cards.

int_19h · on Oct 29, 2024

It's still "fast enough" for even 120b models in practice, and you don't need to muck around with building a multi-GPU rig (and figuring out how to e.g. cool it properly).

It's definitely not what you'd want for your data center, but for home tinkering it has a very clear niche.

angoragoats · on Oct 29, 2024

> It's still "fast enough" for even 120b models in practice

Is it? This is very subjective. The Mac Studio would not be "fast enough" for me on even a 70b model, not necessarily because its output is slow, but because the prompt evaluation speed is quite bad. See [0] for example numbers; on Llama 3 70B at Q4_K_M quantization, it takes an M2 Ultra with 192GB about 8.5 seconds just to evaluate a 1024-token prompt. A machine with 6 3090s (which would likely come in cheaper than the Mac Studio) is over 6 times faster at prompt parsing.

A 120b model is likely going to be something like 1.5-2x slower at prompt evaluation, rendering it pretty much unusable (again, for me).

[0] https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

cpuguy83 · on Oct 29, 2024

And yet the GPU costs about as much as the whole Mac Mini and wouldn't even come close to fitting inside one.

angoragoats · on Oct 29, 2024

You're mostly correct, though a 4060Ti 16GB is 20-30% cheaper than the cheapest Mac Mini. More importantly though, "fits inside a Mac Mini" is not a criterion I'm using to evaluate whether a particular solution is suitable for LLM inference. If it is for you, that's fine, but we have vastly different priorities.

alwayslikethis · on Oct 29, 2024

Does it use the GPU? I was under the impression that it uses the CPU. It's only faster because of the massive memory bandwidth compared to DDR4/5

astrange · on Oct 29, 2024

The AI features use all three of NPU ("ANE"), GPU, CPU, mostly depending on model size.

https://machinelearning.apple.com/research/neural-engine-tra...

DrBenCarson · on Oct 29, 2024

Frankly you’re very wrong. NPUs and GPUs aside, 16gb of GPU memory is very rare in consumer hardware

angoragoats · on Oct 29, 2024

I'm not sure what you mean. RTX 4060 Ti/4070 Ti Super/3090/4090 cards can be easily purchased at any major electronics store in person or online and have 16GB or 24GB depending on model. Once you get up to 32GB, your point would stand, but 16-24GB GPUs are common.

DrBenCarson · on Oct 29, 2024

Yes but the average user is not purchasing those, let alone putting together a system with one for $600

angoragoats · on Oct 30, 2024

You said nothing about price in your initial comment, and the cards I listed are some of the most popular GPUs of the last several years.

astrange · on Oct 29, 2024

You can't use all 16GB because it's unified, so it's shared with the system, SSD controller etc. You can use something like 12-14GB though.

DrBenCarson · on Oct 29, 2024

Sure, still incredibly rare in a $600 device

fulafel · on Oct 29, 2024

The reality distortion field is not dead, rebranding the iGPU like this even convincing the technical crowd has been a great marketing win for Apple.

bitwize · on Oct 29, 2024

The on-chip RAM means that you can run models on the CPU that would require the GPU on a peecee.

angoragoats · on Oct 29, 2024

The RAM is not on the chip. I need to get a tee shirt and a bumper sticker that says this.

TiredOfLife · on Oct 29, 2024

Yeah the same tech pcs were using for 14+ years

https://x.com/LinaAsahi/status/1820947147312820497

acchow · on Oct 29, 2024

> I know ancient iGPUs had that thing for setting the GPU memory size in the BIOS, but that's aaaaaancient and completely obsolete. If you still have that, just set it to the minimum value. The rest of memory will be unified.

I hadn’t used a PC in so long, I still thought that bios setting decided the division. TIL.

Lucky we have Asahi Lina to clarify the details.

abhinavk · on Oct 29, 2024

It shows up as "Shared GPU memory" in Task Manager. What BIOS sets is the Dedicated i.e. reserved video memory in RAM.

e.g. My Ryzen iGPU reserves 2GB/32GB for itself (which Windows can't see) via BIOS and use 9 more as shared "unified" memory.