Is it just an open secret that the models are currently just being hardcoded for random benchmarks? Seems weird that people would be asking Putnam problems to a chatbot :/
> Is it just an open secret that the models are currently just being hardcoded for random benchmarks? Seems weird that people would be asking Putnam problems to a chatbot :/
It's because people do keep asking these models math problems and then, when they get them right, citing it as evidence that they can actually do mathematical reasoning.
Since it's hard to determine what the models know, it's hard to determine when they're just spitting out something they were specifically trained on.
It certainly feels like certain patterns are hardcoded special cases, particularly to do with math.
"Solve (1503+5171)*(9494-4823)" reliably gets the correct answer from ChatGPT
"Write a poem about the solution to (1503+5171)*(9494-4823)" hallucinates an incorrect answer though
That suggests to me that they've papered over the models inability to do basic math, but it's a hack that doesn't generalize beyond the simplest cases.
There's a few things there that could be going on that seem more likely than "hardcoded".
1. The part of the network that does complex math and the part that write poetry are overlapping in strange ways.
2. Most of the models nowadays are assumed to be some mixture of experts. So it's possible that saying write the answer as a poem activates a different part of the model.
Watch for ChatGPT or Claude saying "analyzing" - which means they have identified they need to run a calculation and outsourced it to Python (ChatGPT) or JavaScript (Claude)
The poem thing probably causes them to not decide to use those tools.
To be clear I was testing with 4o, good to know that o1 has a better grasp of basic arithmetic. Regardless my point was less to do with the models ability to do math and more to do with OpenAI seeming to cover up its lack of ability.
“a poem about” reads to me at least like the solution need not be in the answer; maybe something like “a poem that includes the answer in the last stanza”
I've always assumed they removed it, because it's such a basic and fundamental part of ML training that you separate your test and train data. And yet I never see any papers even mention if/how they do this. And I wonder if they do, how do they guarantee with high reliability that their massive terabytes of data don't contain the answer.
I don't see any reason to assume they removed it unless they're very explicit about it. Model publishers have an extremely strong vested interest in beating benchmarks and I expect them to teach to the test if they can get away with it.
I think it's reasonable to assume that openAI is optimising for maximum hype at this point which may include wilfully overfitting for impactful benchmarks to generate positive reports.
When 4 came out they released a document that did BOTH inflate scores by changing the exam conditions, and also bragged about scoring worse than guessing on a multiple choice test.
I agree that openai is somewhat sketchy about this, but they're sketchy about everything. In the past though they have admitted up front to data contamination (e.g. the original gpt-4 press release did not use big-bench as a benchmark due to data contamination). For the Putnam in particular: this is not a benchmark that they use. There is no reason to exclude it since it is not part of the "test set" in any meaningful sense.
First of all, Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly. Secondly, removing it from internet data is not 100% accurate. There are translations of the problems and solutions or references and direct match is not enough. MMLU and test set benchmarks show more resilience though in some previous research.
OpenAI is extremely cagey about what's in their test data set generally, but absent more specific info, they're widely assumed to be grabbing whatever they can. (Notably including copyrighted information used without explicit authorization -- I'll take no position on legal issues in the New York Times's lawsuit against OpenAI, but at the very least, getting their models to regurgitate NYT articles verbatim demonstrates pretty clearly that those articles are in the training set.)
> Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly
What exactly is the source of your belief that the Putnam would not be in the test data? Didn’t they train on everything they could get their hands on?
These models are trained in two steps: training base model and then uptraining it. First step includes as much data as possible, everything company can find. For Llama models it's 15T tokens, which is ~40 TB of data. No-one really puts an effort on splitting this data into train/test/eval (and it's not very achievable either). It's just as much data as possible.
So it's like 99.9999999% wrong to assume something public isn't on the train set, such as Putnam problems in this case. This is about it.
There are benchmarks that are decided beforehand and similar sentences are removed from even the first stage of training. This is useful for tracking model performance and comparing different choices. e.g. see section 'Contamination of downstream tasks' of [1].
Every decent AI lab does this, else the benchmark result couldn't be trusted. OpenAI publishes results of ~20 benchmarks[2] and it is safe to assume they have made reasonable attempt to remove it from training set
the point is that putnam was never a test/benchmark being used by OpenAI or anyone else, so there is no smoking gun if you find putnam on the train set nor is it cheating or nefarious because nobody ever claimed otherwise.
this whole notion of putnam as test being trained on is a fully invented grievance
I've read the thread and I think it's not very coherent overall, I also not sure if we disagree =)
I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.
nobody serious (like OAI) was using the putnam problems to claim generalization. this is a refutation in search of a claim - and many people in the upstream thread are suggesting that OAI is doing something wrong by training on a benchmark.
OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.
I, actually, would disagree with this.
To me ability to solve frontiermath does imply ability to solve putnam problems too. Only with putnam problems being easier - they are already been seen by the model, and they are also simpler problems. And just like this - putnam problem with simple changes are also one of the easier stops on the way to truly generalizing math models, with frontiermath being one of the last stops on the way there.
Imagine you have someone polluting your training data every day. That's what happens when you scrape any tech forum today.
The short version is that llm trainign data is the lowest quality data you are likely to see unless you engage in massive potential copyright infringement.
Yea, people have a really hard time dealing with data leakage especially on data sets as large as LLMs need.
Basically if something appeared online or was transmitted over the wire should no longer be eligible to evaluate on. D. Sculley had a great talk at NeurIPS 2024 (same conference this paper was in) titled Empirical Rigor at Scale – or, How Not to Fool Yourself
Basically no one knows how to properly evaluate LLMs.
No, an absolute massive amount of people do. In fact they have been doing exactly as you recommend, because as you note, it's obvious and required for a basic proper evaluation.
20 years ago in grad school we were doing a very early iteration of this where we built Markov chains with Shakespeare's plays and wanted to produce a plausibly "Shakespearian" clause given a single word to start and a bearish professor said "the more plausible it gets the more I worry people might forget plausibility is all that it promises".
(There was also a much earlier piece of software that would generate semi-intelligible Kant or Hegel one sentence at a time, though that was through a series of a priori generation rules and a large at the time dictionary of stock phrases. I wonder what ever happened to that.)
"Overfitting" would be a bit more accurate term if the problem lies in the specific examples existing in its training set in various forms, places, languages etc but with the same values.
There are tests they are passing that they can't be hardcoded for by design. They still have all kinds of flaws and inconsistency but getting upset they answer "2+2=4" because someone trained them on what the answer to 2+2 is supposed to be is silly.