Ollama is very neat. Given how compressible the models are is there any work bei...

nacs · on Aug 28, 2023

Yes, AutoGPTQ supports this (8, 4, 3, and 2 bit quantization/"compression" of weights + inference).

GPTQ has also been merged into Transformers library recently ( https://huggingface.co/blog/gptq-integration ).

GGML quantization format used by llama.cpp also supports (8,6,5,4,3, and 2 bit quantization).

jacquesm · on Aug 28, 2023

'other than'...

jmorgan · on Aug 29, 2023

There are different levels of quantization available for different models (if that's what you mean :). E.g. here are the versions available for Llama 2: https://ollama.ai/library/llama2/tags which go down to 2-bit quantization (which surprisingly still happens to work reasonably well).

jacquesm · on Aug 29, 2023

No, what I mean is that it seems as though there is quite a bit of sparseness to the matrix and I was wondering if that can somehow be used to further shrink the model, quantization is another effect (it leaves the shape of the various elements as they are but reduces their bit-depth).

jmorgan · on Aug 29, 2023

Ah, gotcha! I thought you probably meant something else. I've been wondering this too, and it's something I've been meaning to look at.

On a related note it doesn't seem like many local runners are leveraging techniques like PagedAttention yet (see https://vllm.ai/) which is inspired by operating system memory paging to reduce memory requirements for LLMs.

It's not quite what you mentioned, but it might have a similar effect! Would love to know if you've seen other methods that might help reduce memory requirements.. it's one of the largest resource bottlenecks to running LLMs right now!

jacquesm · on Aug 29, 2023

That's a clever one, I had not seen that yet, thank you.

The hint for me is that the models compress so well, that suggests the information content is much lower than the size of the uncompressed model indicates which is a good reason to investigate which parts of the model are so compressible and why. I haven't looked at the raw data of these models but maybe I'll give it a shot. Sometimes you can learn a lot about the structure (built in or emergent) of data just by staring at the dumps.

jmorgan · on Aug 29, 2023

That's quite interesting. I hadn't thought of sparsity in the weights as a way to compress models, although this is an obvious opportunity in retrospect! I started doing some digging and found https://github.com/SqueezeAILab/SqueezeLLM, although I'm sure there's newer work on this idea.