Somewhat as a general question, not just for AMD/Nvidia: At what point does RAM stop being the bottleneck? By which I mean, given current chips, how much RAM could you theoretically bolt on to these cards before the limiting factor in performance is the GPU instead of RAM? And does that change when the task is training vs. deployment/prompting?
What do you mean? Are you talking about capacity? Or bandwidth from RAM?
I'm in the HPC space, and pretty much everything I do on the GPU is bound by how quickly I can get data in and out of DRAM.
The point at which data motion to/from DRAM is not the bottleneck is when you do enough work per byte of data moved. How much work is that? On today's server GPUs it's in the region of 50--100 double precision floating point operations per byte. You can work out an exact number by taking the theoretical maximum floating point operations per unit time you can execute and divide by DRAM throughput (data moved per unit time).
O(50--100) double precision flops per byte is a _lot_ of work. We're talking BLAS-3 type operations. Anything level 2 or lower, or sparse operations, are typically bandwidth bound.
The problem with a lot of machine learning algorithms is that you do hundreds or even thousands of operations per-value, but you do it in specialized orderings on incredibly large numbers of values, forcing you to swap partitions in and out of RAM. So, while you may do thousands of ops per-value, you may only be able to do tens of ops per-value per-write to RAM.
The more RAM you have on device, the fewer swaps you need to do (none at all if it's big enough), and the less those operations get amortized, bringing you closer to theoretical max throughput.
Matrix multiples are such that going from fitting 75% of your values to fitting 100% of your values can mean an order of magnitude speedup.
Disclaimer: I have no idea how machine learning algorithms work.
I work with problems so huge they do not fit on a single device. Multiple devices each own a small piece of the global problem. They solve a local problem and they must communicate over a network in order to solve the global problem. This is almost universally true.
You would know more than I in this field, and I expect it really is better to swap a partition than it is to use a network.
There are certain methods in HPC applications that are almost universally avoided because of how terribly they scale to a large distributed memory system. Matrix multiplies are one of them. Outside of a handful of ab initio computational chemistry algorithms (which are incredibly important), basically the only reason someone does a large dense matrix-multiply on a supercomputer is usually because they're running a benchmark and they're not solving a real science problem.
Folks more knowledgeable than me here feel free to jump in.
You guys are talking past each other but really talking about the same thing - arithmetic intensity. You're talking about FEA or some other grid solver/discretized PDE/DFT type thing where the matmuls are small because the mesh is highly refined and you've assumed the potentials/fields/effects are hyper-local. But that's not accident or dumb luck - the problems in scientific HPC are modeled using these kinds potentials post-hoc ie so that they can be distributed across so many cores.
What I'm saying is it's not like a global solver (ie taking into account all to all interactions) wouldn't be more accurate right? It's just an insane proposition because surprise surprise that would require an enormous matmul during the update, which you can't do efficiently, even on a GPU, for the same reason the ML folks can't: the arithmetic intensity isn't high enough and so you can incur i/o costs (memory or network, same thing at this scale).
> There are certain methods in HPC applications that are almost universally avoided because of how terribly they scale to a large distributed memory system. Matrix multiplies are one of them.
Neural networks, which are the basis for nearly all modern AI, are implemented as a mixture of sparse and dense matrix multiplies, depending on the neural architecture.
Thanks— that’s exactly the sort of framework for thinking about this that I was looking for. I think in the realm of LLMs there are other components but this is a piece of it
The ideal ram capacity is determined by the biggest available model you want to run. So... ~48GB for llama 70B? Maybe more with batching and very long contexts.
RAM bandwidth is basically always going to be a bottleneck. Microsft's proposal to get around this is to just keep everything in SRAM and pipe chips together.