As a complete outsider: has ML research just become a phallus measuring contest ...

visarga · on Aug 2, 2022

This is a small model keeping up with the big guys. 20B parameters might fit in 2 beefy GPUs, that's a bargain compared to GPT-3.

pjfin123 · on Aug 2, 2022

Yeah this is the opposite, they did impressively well with fewer parameters.

In general larger models and more data has been an effective strategy for getting better performance but getting the right ratio is also important: https://www.deepmind.com/publications/an-empirical-analysis-...

cuuupid · on Aug 2, 2022

+1, also this is a teacher model. The implications are huge here as AWS will likely spin this into an offering like they did with their other AI products. Building a model downstream of GPT-3 is difficult and usually yields suboptimal results; however 20b is small enough that it would be easy to finetune this on a smaller dataset for a specific task.

You could then distill that model and end up with something that’s a fraction of the size (6b parameters for example, just under 1/3, would fit on commercial GPUs like 3090s). There are some interesting examples of this with smaller models like BERT/BART or PEGASUS in Huggingface Transformer’s seq2seq distillation examples.

kajecounterhack · on Aug 3, 2022

As others pointed out, this paper tries to do more with fewer params.

But you've identified a trend does actually describe large language models for the past few years (they've been getting bigger, and bigger has been better). Like microprocessors have the famous tick/tock cycle (https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model), I think models might see be seeing something similar emerge naturally (make models bigger --> make models better (shrink) --> make models bigger again --> make models better (shink again)).

Also, most of this LLM stuff is probably not trained on NVidia hardware -- at scale it's probably cost prohibitive if not also hard to set up. Google's TPUs, MSFT/Amazon's equivalent custom hardware, or other specialized accelerators are more economical overall.