Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is more or less a funnel to their Agentic Benchmark Checklist: https://arxiv.org/abs/2507.02825


Finally, a benchmark for benchmarks. And what's great is that they already benchmarked their benchmark benchmark.

(Apologies for the benchmark snark. I'm glad people are doing this research, thanks for sharing it.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: