I've read the thread and I think it's not very coherent overall, I also not sure if we disagree =)
I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.
nobody serious (like OAI) was using the putnam problems to claim generalization. this is a refutation in search of a claim - and many people in the upstream thread are suggesting that OAI is doing something wrong by training on a benchmark.
OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.
I, actually, would disagree with this.
To me ability to solve frontiermath does imply ability to solve putnam problems too. Only with putnam problems being easier - they are already been seen by the model, and they are also simpler problems. And just like this - putnam problem with simple changes are also one of the easier stops on the way to truly generalizing math models, with frontiermath being one of the last stops on the way there.
I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.