Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I had an API integration written to convert an English language security rule into an XML object designed to instruct a remote machine how to comply with the rule programmatically. April 2023 we had about an 86% accept rate, that number has declined to 31% with no changes to the prompt.


This is the kind of info I've been looking for - I ran some informal experiments which asked ChatGPT to mark essays along various criteria analyzed how consistent the marking was. This was several months ago, GPT-4 performed quite well, but the data wasn't kept, (it was just an ad-hoc application test written in jupyter notebooks).

I'm certain it's now doing significantly worse on the same tests, but alas I have lost the historical data to prove it.


I’m curious, how do y’all keep track of performance and reliability?

I ask, because I think it’s going to be a big challenge, so I built a service to record feedback / acceptance data: https://modelgymai.com/

If you think it can help, I’d love if you’d try it out and let me know if it helps.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: