As others pointed out, this paper tries to do more with fewer params. But you've...

As others pointed out, this paper tries to do more with fewer params.

But you've identified a trend does actually describe large language models for the past few years (they've been getting bigger, and bigger has been better). Like microprocessors have the famous tick/tock cycle (https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model), I think models might see be seeing something similar emerge naturally (make models bigger --> make models better (shrink) --> make models bigger again --> make models better (shink again)).

Also, most of this LLM stuff is probably not trained on NVidia hardware -- at scale it's probably cost prohibitive if not also hard to set up. Google's TPUs, MSFT/Amazon's equivalent custom hardware, or other specialized accelerators are more economical overall.