> You can't do that for LLM output. That's true if you're just evaluating the fi... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		jgraettinger1 6 months ago \| parent \| context \| favorite \| on: AI agent benchmarks are broken > You can't do that for LLM output. That's true if you're just evaluating the final answer. However, wouldn't you evaluate the context -- including internal tokens -- built by the LLM under test ? In essence, the evaluator's job isn't to do separate fact-finding, but to evaluate whether the under-test LLM made good decisions given the facts at hand.

majormajor 6 months ago [–]

I would if I was the developer, but if I'm the user being sold the product, or a third-party benchmarker, I don't think I'd have full access to that if most of that is happening in the vendor's internal services.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact