Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not sure how much scaling laws apply here, since this is a seq-to-seq model instead of a autoregressive causal model. It's interesting to see AlexaTM performing better than GPT-3 on SuperGLUE and SQuADv2, but it fails on Chain of Thought prompt, which is a bummer. So, is it because it's a different model or because it is positively leveraging multilingual tokens? I wish they compared this architecture to a classic GPT family model.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: