the point is that putnam was never a test/benchmark being used by OpenAI or anyone else, so there is no smoking gun if you find putnam on the train set nor is it cheating or nefarious because nobody ever claimed otherwise.
this whole notion of putnam as test being trained on is a fully invented grievance
I've read the thread and I think it's not very coherent overall, I also not sure if we disagree =)
I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.
nobody serious (like OAI) was using the putnam problems to claim generalization. this is a refutation in search of a claim - and many people in the upstream thread are suggesting that OAI is doing something wrong by training on a benchmark.
OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.
I, actually, would disagree with this.
To me ability to solve frontiermath does imply ability to solve putnam problems too. Only with putnam problems being easier - they are already been seen by the model, and they are also simpler problems. And just like this - putnam problem with simple changes are also one of the easier stops on the way to truly generalizing math models, with frontiermath being one of the last stops on the way there.
this whole notion of putnam as test being trained on is a fully invented grievance
read the entire thread in this context