I had an API integration written to convert an English language security rule into an XML object designed to instruct a remote machine how to comply with the rule programmatically. April 2023 we had about an 86% accept rate, that number has declined to 31% with no changes to the prompt.
This is the kind of info I've been looking for - I ran some informal experiments which asked ChatGPT to mark essays along various criteria analyzed how consistent the marking was. This was several months ago, GPT-4 performed quite well, but the data wasn't kept, (it was just an ad-hoc application test written in jupyter notebooks).
I'm certain it's now doing significantly worse on the same tests, but alas I have lost the historical data to prove it.