Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Agree with this take, though in an even broader way; they're optimizing for the leaderboards and benchmarks in general. Longer outputs lead to better scores on those. Even in this thread I see a lot of comments bring them up, so it works for marketing.

My take is that the leaderboards and benchmarks are still very flawed if you're using LLMs for any non-chat purpose. In the product I'm building, I have to use all of the big 4 models (GPT, Claude, Llama, Gemini), because for each of them there is at least one tasks that it performs much better than the other 3.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: