prelude: I realized that I typed out a ton of words but in the end engineering is all about tradeoffs, so, fine if there's a way I can teach some existing GPU, or some existing PCIe TPU, to access system RAM over an existing PCIe slot, that sounds like a fine step forward. I just don't have a lot of experience in that setup to know if only certain video cards allow that or what
Bearing in mind the aforementioned "I'm not a hardware guy," my mental model of any system RAM access for GPUs is:
1. copy weights from SSD to RAM
2. trigger GPU with that RAM location
3. GPU copies weights over PCIe bus to do calculation
4. GPU copies activations over PCIe bus back to some place in RAM
5. goto 3
If my understanding is correct, this PCIe (even at 16 lanes) is still shared with everything else on the motherboard that is also using PCIe, to say nothing of the actual protocol handshaking since it's a common bus and thus needs contention management. I would presume doing such a stunt would at bare minimum need to contend with other SSD traffic and the actual graphical part of the GPU's job[1][2]
Contrast this with memory socket(s) on the "GPU's mainboard" where it is, what, 3mm of trace wires away from ripping the data back and forth between its RAM and its processors, only choosing to PCIe the result out to RAM. It can have its own PCIe to speak to other sibling GPGPU setups for doing multi-device inference[3]
I would entertain people saying "but what a waste having 128GB of RAM only usable for GPGPU tasks" but if all these folks are right in claiming that it's the end of software engineering as we know it, I would guess it's not going to be that idle
1: I wish I had actually made a bigger deal out of wanting a GPGPU since for this purpose I don't care at all what DirectX or Vulkan whatever it runs
2: furthermore, if the "just use system RAM" was such a hot idea, I don't think it would be 2025 and we still have graphics cards with only 8GB of RAM on them. I'm not considering the Apple architecture because they already solder RAM and mark it up so much that normal people can't afford a sane system anyway, so I give no shits how awesome their unified architecture is
3: I also should have drew more attention to the inference need, since AIUI things like the TPUs I have on my desk aren't (able to do|good at) training jobs but that's where my expertise grinds to a halt because I have no idea why that is or how to fix it
Oh, it's not a good idea at all from a performance perspective to use system memory because it's slow as heck. The important thing is that you can do it. Some way of allowing the GPU to page in data from system RAM (or even storage) on an as-needed basis has been supported by Nvidia since at least Tesla generation.
There's actually a multitude of different ways now that each have their own performance tradeoffs like direct DMA from the Nvidia card, data copied via CPU, GPU direct storage, and so on. You seem to understand the gist though, so these are mainly implementation details. Sometimes there's weird limitations with one method like limited to Quadro, or only up to a fixed percentage of system memory.
The short answer is that all of them suck to different degrees and you don't want to use them if possible. They're enabled by default for virtually all systems because they significantly simplify CUDA programming. DDR is much less suitable than GDDR for feeding a bandwidth hungry monster like a GPU, PCI introduces high latency and further constructions, and any CPU involvement is a further slowdown. This would also apply to socketed memory on a GPU though: Significantly slower and less bandwidth.
There's also some additional downsides to accessing system RAM that we don't need to get into, like sometimes losing the benefits of caching and getting full cost memory accesses every time.
That's interesting, thanks for making me aware. I'll try to dig up some reading material, but in some sense this is going the opposite of how I want the world to work because nvidia is already a supply chain bottleneck and so therefore saying "the solution to this supply and demand is more CUDA" doesn't get me where I want to go
> any CPU involvement is a further slowdown. This would also apply to socketed memory on a GPU though: Significantly slower and less bandwidth
I am afraid what I'm about to say doubles down on my inexperience, but: I could have sworn that series of problems is what DMA was designed to solve: peripherals do their own handshaking without requiring the CPU's involvement (aside from the "accounting" bits of marking regions as in-use). And thus if a GPGPU comes already owning its own RAM, it most certainly does not need to ask the CPU to do jack squat to talk to its own RAM because there's no one else who could possibly be using it
I was looking for an example of things that carried their own RAM and found this, which strictly speaking is what I searched for but is mostly just funny so I hope others get a chuckle too: a SCSI ram disk <https://micha.freeshell.org/ramdisk/RAM_disk.jpg>
Sorry if that was confusing. I was trying to communicate a generality about multiple very different means of accessing the memory: the way we currently build GPUs is a local maximum for performance. Changing anything, even putting dedicated memory on sockets, has a dramatic and negative impact on performance. The latest board I've worked on saw the layout team working overtime to place the memory practically on top of the chip and they were upset it couldn't be closer.
Also, other systems have similar technologies, I'm just mentioning Nvidia as an example.