I had an API integration written to convert an English language security rule in...

courseofaction · on July 7, 2023

This is the kind of info I've been looking for - I ran some informal experiments which asked ChatGPT to mark essays along various criteria analyzed how consistent the marking was. This was several months ago, GPT-4 performed quite well, but the data wasn't kept, (it was just an ad-hoc application test written in jupyter notebooks).

I'm certain it's now doing significantly worse on the same tests, but alas I have lost the historical data to prove it.

RA_Fisher · on July 7, 2023

I’m curious, how do y’all keep track of performance and reliability?

I ask, because I think it’s going to be a big challenge, so I built a service to record feedback / acceptance data: https://modelgymai.com/

If you think it can help, I’d love if you’d try it out and let me know if it helps.