Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The weirdness of LLMs is that they're so damn good at so many things but then you see these glaring gaps that instantly make them seem dumb. We desperately need benchmarks and evals that test these kinds of hard to pin down cognitive abilities


Absolutely. This is not a new observation, but another thing they struggle with is self-reporting confidence intervals. When I've asked LLMs to classify/tag things along with a confidence metric, the number seems random and has no connection to the quality or difficulty of the classification.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: