Most people are running these at 4 bits per parameter for speed and RAM reasons. That means the model would take just about all of the RAM. But instead of swap (writing data to disk and then reading it again later), I would expect a good implementation to only run into cache eviction (deleting data from RAM and then reading it back from disk later), which should be a lot faster and cause less wear and tear on SSDs.
Training uses gradient descent, so you want to have good precision during that process. But once you have the overall structure of the network, https://arxiv.org/abs/2210.17323 (GPTQ) showed that you can cut down the precision quite a bit without losing a lot of accuracy. It seems you can cut down further for larger models. For the 13B Llama-based ones, going below 5 bits per parameter is noticeably worse, but for 30B models you can do 4 bits.
The same group did another paper https://arxiv.org/abs/2301.00774 which shows that in addition to reducing the precision of each parameter, you can also prune out a bunch of parameters entirely. It's harder to apply this optimization because models are usually loaded into RAM densely, but I hope someone figures out how to do it for popular models.
I wonder if specialization of the LLM is another way to reduce the RAM requirements. For example, if you can tell which nodes are touched through billions of web searches on a topic, then you can delete the ones that never are touched.
Some people are having some success speeding token rates and clawback on VRAM using a 0- group size flag but ymmv I did not test this yet (they were discussing gptq btw)
How much resources are required is directly related to the memory size devoted to each weight. If the weights are stored as 32-bit floating points then each weight is 32 bits which adds up when we are talking about billions of weights. But if the weights are first converted to 16-bit floating point numbers (precise to fewer decimal places) then fewer resources are needed to store and compute the numbers. Research has shown that simply chopping off some of the precision of the weights still yields good AI performance in many cases.
Note too that the numbers are standardized, e.g. floats are defined by IEEE 754 standard. Numbers in this format have specialized hardware to do math with them, so when considering which number format to use it's difficult to get outside of the established ones (foat32, float16, int8).
You’ll notice that a lot of language allow you more control when dealing with number representations, such as C/C++, Numpy in Python etc.
Ex: Since C and C++ number sizes depend on processor architecture, C++ has types like int16_t and int32_t to enforce a size regardless of architecture, Python always uses the same side, but Numpy has np.int16 and np.int32, Java also uses the same size but has short for 16-bit and int for 32-bit integers.
It just happens that some higher level languages hide this abstraction from the programmers and often standardize in one default size for integers.
FP16 and Int8 are about how many bits are being used for floating point and integer numbers. FP16 is 16bit floating point. The more bits the better the precision, but the more ram it takes. Normally programmers use 32 or 64bit floats so 16bit floats have significantly reduced precision, but take up half the space of fp32 which is the smallest floating point format for most CPUs. similarly 8 bit integers have only 256 total possibilities and go from -128 to 127.