This blows my mind. How is it even possible to validate a model that incorporates 20B parameters? How do you even test something this complex and non-deterministic?
I assume some kind of infallible automated tooling is used to write tests that validate this monster. I would LOVE to see what that tooling looks like.
It _is_ deterministic (same input gives same output).
You typically don't "test" pairs of inputs/outputs for a model. Instead you measure its performance by defining metrics e.g. "what's the ROUGE-2 score on summarization after fine-tuning AlexaTM 20B using N examples from dataset Y"
You can test some aspects of ML models, like sync testing (if you train on hardware A and run on hardware B, their results are not always the same). But generally you test the code that embeds the model, not the model itself.
How do you define validate? These models aren't formally proven to work in all cases or anything. They're just tested on a load of data, and if it's found that they work pretty well, then they get released.
I assume some kind of infallible automated tooling is used to write tests that validate this monster. I would LOVE to see what that tooling looks like.