Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Discriminating good answers is easier than generating them.

This is actually very wrong. Consider for instance the fact that people who grade your tests in school are typically more talented, capable, trained than the people taking the test. This is true even when an answer key exists.

> Also, human labels are good but have problems of their own,

Granted, but...

> it isn’t like by using a “different intelligence architecture” we elide all the possible errors

nobody is claiming this. We elide the specific, obvious problem that using a system to test itself gives you no reliable information. You need a control.



It isn’t actually very wrong. Your example is tangential as graders in school have multiple roles — teaching the content and grading. That’s an implementation detail, not a counter to the premise.

I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.


No. Graders having multiple roles is actually the implementation detail, since they're people, and they can't spend all day grading work. Scanning machines don't really grade work either, but I am happy to rely on them for checking an answer matches a scheme verbatim. I'm not sure why you mention scanners answering tests either, since my original comment doesn't imply that.

There is no evidence that an LLM can reliably evaluate the semantic content of a sentence, even in cases where we all agree that the semantic content exists. The thread we are participating in demonstrates a particularly egregious failure, but there is no good reason to think that more subtle failures might not exist if we happen to patch this one. Even if they were reliable, you can't evaluate a system with itself - that is basic science.


Trading control for convenience has always been the tradeoff in the recent AI hype cycle and the reason why so many people like to use ChatGPT.


Not "control", "a control". As in a control group, for a study.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: