> Is there really no way to partition the workload to run with 16gb memory per c...

> Is there really no way to partition the workload to run with 16gb memory per card?

It depends on the model architecture you are using. Once you cannot fit a single instance of your model on a single GPU or at minimum on a single node, things start becoming very complicated. If you are lucky and you have a generic transformer model, you can just use Deepspeed with their transformer kernel. But if you have another architecture it will likely not be compatible with Deepspeed or Fairscale or any of the other scaling frameworks and you will end up having to write your own CUDA kernels.

So the per GPGPU RAM is quite important.