As a complete outsider: has ML research just become a phallus measuring contest to see who can stuff the most parameters into a model? In other words, who can acquire the most Nvidia cards? The model size seems to always be the headline in stuff I see on HN.
+1, also this is a teacher model. The implications are huge here as AWS will likely spin this into an offering like they did with their other AI products. Building a model downstream of GPT-3 is difficult and usually yields suboptimal results; however 20b is small enough that it would be easy to finetune this on a smaller dataset for a specific task.
You could then distill that model and end up with something that’s a fraction of the size (6b parameters for example, just under 1/3, would fit on commercial GPUs like 3090s). There are some interesting examples of this with smaller models like BERT/BART or PEGASUS in Huggingface Transformer’s seq2seq distillation examples.
As others pointed out, this paper tries to do more with fewer params.
But you've identified a trend does actually describe large language models for the past few years (they've been getting bigger, and bigger has been better). Like microprocessors have the famous tick/tock cycle (https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model), I think models might see be seeing something similar emerge naturally (make models bigger --> make models better (shrink) --> make models bigger again --> make models better (shink again)).
Also, most of this LLM stuff is probably not trained on NVidia hardware -- at scale it's probably cost prohibitive if not also hard to set up. Google's TPUs, MSFT/Amazon's equivalent custom hardware, or other specialized accelerators are more economical overall.