Agree with this take, though in an even broader way; they're optimizing for the leaderboards and benchmarks in general. Longer outputs lead to better scores on those. Even in this thread I see a lot of comments bring them up, so it works for marketing.
My take is that the leaderboards and benchmarks are still very flawed if you're using LLMs for any non-chat purpose. In the product I'm building, I have to use all of the big 4 models (GPT, Claude, Llama, Gemini), because for each of them there is at least one tasks that it performs much better than the other 3.
My take is that the leaderboards and benchmarks are still very flawed if you're using LLMs for any non-chat purpose. In the product I'm building, I have to use all of the big 4 models (GPT, Claude, Llama, Gemini), because for each of them there is at least one tasks that it performs much better than the other 3.