Not sure how much scaling laws apply here, since this is a seq-to-seq model instead of a autoregressive causal model. It's interesting to see AlexaTM performing better than GPT-3 on SuperGLUE and SQuADv2, but it fails on Chain of Thought prompt, which is a bummer. So, is it because it's a different model or because it is positively leveraging multilingual tokens? I wish they compared this architecture to a classic GPT family model.