> 96GB of weights. You won't be able to run this on your home GPU. This seems li...

ttul · on Dec 8, 2023

Someone smarter will probably correct me, but I don’t think that is how MoE works. With MoE, a feed-forward network assesses the tokens and selects the best two of eight experts to generate the next token. The choice of experts can change with each new token. For example, let’s say you have two experts that are really good at answering physics questions. For some of the generation, those two will be selected. But later on, maybe the context suggests you need two models better suited to generate French language. This is a silly simplification of what I understand to be going on.

wongarsu · on Dec 8, 2023

One viable strategy might be to offload as many experts as possible to the GPU, and evaluate the other ones on the CPU. If you collect some statistics which experts are used most in your use cases and select those for GPU acceleration you might get some cheap but notable speedups over other approaches.

ttul · on Dec 8, 2023

This being said, presumably if you’re running a huge farm of GPUs, you could put each expert onto its own slice of GPUs and orchestrate data to flow between GPUs as needed. I have no idea how you’d do this…

alchemist1e9 · on Dec 8, 2023

Ideally those many GPUs could be on different hosts connected with a commodity interconnect like 10gbe.

If MOE models do well it could be great for commodity hw based distributed inference approaches.

Philpax · on Dec 8, 2023

Yes, that's more or less it - there's no guarantee that the chosen expert will still be used for the next token, so you'll need to have all of them on hand at any given moment.

read_if_gay_ · on Dec 8, 2023

however, if you need to swap experts on each token, you might as well run on cpu.

tarruda · on Dec 8, 2023

> Presumably, the same expert would frequently be selected for a number of tokens in a row

In other words, assuming you ask a coding question and there's a coding expert in the mix, it would answer it completely.

ttul · on Dec 8, 2023

See my poorly educated answer above. I don’t think that’s how MoE actually works. A new mixture of experts is chosen for every new context.

read_if_gay_ · on Dec 8, 2023

yes I read that. do you think it's reasonable to assume that the same expert will be selected so consistently that model swapping times won't dominate total runtime?

tarruda · on Dec 8, 2023

No idea TBH, we'll have to wait and see. Some say it might be possible to efficiently swap the expert weights if you can fit everything in RAM: https://x.com/brandnarb/status/1733163321036075368?s=20

numeri · on Dec 8, 2023

You're not necessarily wrong, but I'd imagine this is almost prohibitively slow. Also, this model seems to use two experts per token.

tarruda · on Dec 8, 2023

I will be super happy if this is true.

Even if you can't fit all of them in the VRAM, you could load everything in tmpfs, which at least removes disk I/O penalty.

cjbprime · on Dec 8, 2023

Just mentioning in case it helps anyone out: Linux already has a disk buffer cache. If you have available RAM, it will hold on to pages that have been read from disk until there is enough memory pressure to remove them (and then it will only remove some of them, not all of them). If you don't have available RAM, then the tmpfs wouldn't work. The tmpfs is helpful if you know better than the paging subsystem about how much you really want this data to always stay in RAM no matter what, but that is also much less flexible, because sometimes you need to burst in RAM usage.