Ollama is very neat. Given how compressible the models are is there any work being done on using them in some kind of compressed format other than reducing the word size?
There are different levels of quantization available for different models (if that's what you mean :). E.g. here are the versions available for Llama 2: https://ollama.ai/library/llama2/tags which go down to 2-bit quantization (which surprisingly still happens to work reasonably well).
No, what I mean is that it seems as though there is quite a bit of sparseness to the matrix and I was wondering if that can somehow be used to further shrink the model, quantization is another effect (it leaves the shape of the various elements as they are but reduces their bit-depth).
Ah, gotcha! I thought you probably meant something else. I've been wondering this too, and it's something I've been meaning to look at.
On a related note it doesn't seem like many local runners are leveraging techniques like PagedAttention yet (see https://vllm.ai/) which is inspired by operating system memory paging to reduce memory requirements for LLMs.
It's not quite what you mentioned, but it might have a similar effect! Would love to know if you've seen other methods that might help reduce memory requirements.. it's one of the largest resource bottlenecks to running LLMs right now!
That's a clever one, I had not seen that yet, thank you.
The hint for me is that the models compress so well, that suggests the information content is much lower than the size of the uncompressed model indicates which is a good reason to investigate which parts of the model are so compressible and why. I haven't looked at the raw data of these models but maybe I'll give it a shot. Sometimes you can learn a lot about the structure (built in or emergent) of data just by staring at the dumps.
That's quite interesting. I hadn't thought of sparsity in the weights as a way to compress models, although this is an obvious opportunity in retrospect! I started doing some digging and found https://github.com/SqueezeAILab/SqueezeLLM, although I'm sure there's newer work on this idea.