Is it just an open secret that the models are currently just being hardcoded for...

resoluteteeth · 2025-01-01T15:36:56 1735745816

> Is it just an open secret that the models are currently just being hardcoded for random benchmarks? Seems weird that people would be asking Putnam problems to a chatbot :/

It's because people do keep asking these models math problems and then, when they get them right, citing it as evidence that they can actually do mathematical reasoning.

Since it's hard to determine what the models know, it's hard to determine when they're just spitting out something they were specifically trained on.

Trasmatta · 2025-01-01T12:57:56 1735736276

Not hardcoded, I think it's just likely that those problems exist in its training data in some form

jsheard · 2025-01-01T14:56:11 1735743371

It certainly feels like certain patterns are hardcoded special cases, particularly to do with math.

"Solve (1503+5171)*(9494-4823)" reliably gets the correct answer from ChatGPT

"Write a poem about the solution to (1503+5171)*(9494-4823)" hallucinates an incorrect answer though

That suggests to me that they've papered over the models inability to do basic math, but it's a hack that doesn't generalize beyond the simplest cases.

mmmore · 2025-01-01T17:21:55 1735752115

There's a few things there that could be going on that seem more likely than "hardcoded".

1. The part of the network that does complex math and the part that write poetry are overlapping in strange ways.

2. Most of the models nowadays are assumed to be some mixture of experts. So it's possible that saying write the answer as a poem activates a different part of the model.

simonw · 2025-01-02T11:47:04 1735818424

Watch for ChatGPT or Claude saying "analyzing" - which means they have identified they need to run a calculation and outsourced it to Python (ChatGPT) or JavaScript (Claude)

The poem thing probably causes them to not decide to use those tools.

whimsicalism · 2025-01-01T15:26:05 1735745165

https://chatgpt.com/share/67755e6f-bfc8-8010-9aa3-8bcbbd9b26...

jsheard · 2025-01-01T15:42:16 1735746136

To be clear I was testing with 4o, good to know that o1 has a better grasp of basic arithmetic. Regardless my point was less to do with the models ability to do math and more to do with OpenAI seeming to cover up its lack of ability.

whimsicalism · 2025-01-01T15:56:26 1735746986

i think it’s mostly that o1 mini can think through the solution before it starts writing the poem.

i’m able to reproduce your failure on 4o

lelandfe · 2025-01-01T15:27:14 1735745234

“a poem about” reads to me at least like the solution need not be in the answer; maybe something like “a poem that includes the answer in the last stanza”

whimsicalism · 2025-01-01T15:56:48 1735747008

yeah but it like actually gets the answer wrong not just omits it

InkCanon · 2025-01-01T13:05:17 1735736717

I've always assumed they removed it, because it's such a basic and fundamental part of ML training that you separate your test and train data. And yet I never see any papers even mention if/how they do this. And I wonder if they do, how do they guarantee with high reliability that their massive terabytes of data don't contain the answer.

jprete · 2025-01-01T13:49:01 1735739341

I don't see any reason to assume they removed it unless they're very explicit about it. Model publishers have an extremely strong vested interest in beating benchmarks and I expect them to teach to the test if they can get away with it.

stingraycharles · 2025-01-01T14:42:42 1735742562

As usual, once a metric becomes a target, it stops being useful.

franktankbank · 2025-01-01T16:45:25 1735749925

Well, they are doing BigCorpStuff not Science

whimsicalism · 2025-01-01T17:03:41 1735751021

putnam isn’t an llm benchmark ahhhh none of these companies are reporting putnam scores there’s nothing nefarious about training on putnam problems

jprete · 2025-01-02T12:04:33 1735819473

Any problem set that can make news is implicitly an LLM benchmark.

captainbland · 2025-01-01T14:01:50 1735740110

I think it's reasonable to assume that openAI is optimising for maximum hype at this point which may include wilfully overfitting for impactful benchmarks to generate positive reports.

lupire · 2025-01-01T14:06:23 1735740383

When 4 came out they released a document that did BOTH inflate scores by changing the exam conditions, and also bragged about scoring worse than guessing on a multiple choice test.

woopwoop · 2025-01-01T17:06:31 1735751191

I agree that openai is somewhat sketchy about this, but they're sketchy about everything. In the past though they have admitted up front to data contamination (e.g. the original gpt-4 press release did not use big-bench as a benchmark due to data contamination). For the Putnam in particular: this is not a benchmark that they use. There is no reason to exclude it since it is not part of the "test set" in any meaningful sense.

marcosdumay · 2025-01-01T14:56:28 1735743388

How could they remove it?

Those are well known problems, that people talk about on different contexts. They would have to review their entire training set.

whimsicalism · 2025-01-01T14:54:53 1735743293

But putnam isn’t an official test? I find llm discourse on hn so frustrating

YetAnotherNick · 2025-01-01T13:33:27 1735738407

First of all, Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly. Secondly, removing it from internet data is not 100% accurate. There are translations of the problems and solutions or references and direct match is not enough. MMLU and test set benchmarks show more resilience though in some previous research.

rst · 2025-01-01T15:00:59 1735743659

OpenAI is extremely cagey about what's in their test data set generally, but absent more specific info, they're widely assumed to be grabbing whatever they can. (Notably including copyrighted information used without explicit authorization -- I'll take no position on legal issues in the New York Times's lawsuit against OpenAI, but at the very least, getting their models to regurgitate NYT articles verbatim demonstrates pretty clearly that those articles are in the training set.)

fn-mote · 2025-01-01T15:12:01 1735744321

Let’s think about this.

> Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly

What exactly is the source of your belief that the Putnam would not be in the test data? Didn’t they train on everything they could get their hands on?

whimsicalism · 2025-01-01T15:13:27 1735744407

do you understand the difference between test data and train data? just reread this thread of comments

YetAnotherNick · 2025-01-01T16:03:59 1735747439

I don't know why I and you are getting downvoted. Sometimes, HN crowd is just unhinged against AI.

boroboro4 · 2025-01-01T23:14:06 1735773246

These models are trained in two steps: training base model and then uptraining it. First step includes as much data as possible, everything company can find. For Llama models it's 15T tokens, which is ~40 TB of data. No-one really puts an effort on splitting this data into train/test/eval (and it's not very achievable either). It's just as much data as possible.

So it's like 99.9999999% wrong to assume something public isn't on the train set, such as Putnam problems in this case. This is about it.

YetAnotherNick · 2025-01-02T01:45:22 1735782322

There are benchmarks that are decided beforehand and similar sentences are removed from even the first stage of training. This is useful for tracking model performance and comparing different choices. e.g. see section 'Contamination of downstream tasks' of [1].

Every decent AI lab does this, else the benchmark result couldn't be trusted. OpenAI publishes results of ~20 benchmarks[2] and it is safe to assume they have made reasonable attempt to remove it from training set

[1]: https://arxiv.org/pdf/2107.06499

[2]: https://openai.com/index/hello-gpt-4o/

whimsicalism · 2025-01-01T23:34:51 1735774491

right, but where did someone assume it wasn’t in the train set? they just said it wasn’t in the test set

boroboro4 · 2025-01-01T23:46:15 1735775175

What test set is being talked about here? Why does it matter what’s on this set?

whimsicalism · 2025-01-01T23:51:24 1735775484

the point is that putnam was never a test/benchmark being used by OpenAI or anyone else, so there is no smoking gun if you find putnam on the train set nor is it cheating or nefarious because nobody ever claimed otherwise.

this whole notion of putnam as test being trained on is a fully invented grievance

read the entire thread in this context

boroboro4 · 2025-01-02T01:49:18 1735782558

I've read the thread and I think it's not very coherent overall, I also not sure if we disagree =)

I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.

whimsicalism · 2025-01-02T02:24:20 1735784660

nobody serious (like OAI) was using the putnam problems to claim generalization. this is a refutation in search of a claim - and many people in the upstream thread are suggesting that OAI is doing something wrong by training on a benchmark.

OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.

boroboro4 · 2025-01-02T03:49:44 1735789784

I, actually, would disagree with this. To me ability to solve frontiermath does imply ability to solve putnam problems too. Only with putnam problems being easier - they are already been seen by the model, and they are also simpler problems. And just like this - putnam problem with simple changes are also one of the easier stops on the way to truly generalizing math models, with frontiermath being one of the last stops on the way there.

chvid · 2025-01-01T15:25:35 1735745135

It is on the open internet - questions and suggested solutions:

https://kskedlaya.org/putnam-archive/

I would expect all llms to be trained on it.

whimsicalism · 2025-01-01T15:17:29 1735744649

funny that nobody replying to you seems to even know what a test set is. i always overestimate the depth of ML conversation you can have on HN

llm_trw · 2025-01-01T13:24:07 1735737847

Imagine you have someone polluting your training data every day. That's what happens when you scrape any tech forum today.

The short version is that llm trainign data is the lowest quality data you are likely to see unless you engage in massive potential copyright infringement.

ryvi · 2025-01-01T14:13:01 1735740781

> unless you engage in massive potential copyright infringement. And nobody is going to do that

wslh · 2025-01-01T13:04:09 1735736649

If I temember well this is call overfitting [1].

[1] https://en.wikipedia.org/wiki/Overfitting

mlepath · 2025-01-01T15:53:32 1735746812

Yea, people have a really hard time dealing with data leakage especially on data sets as large as LLMs need.

Basically if something appeared online or was transmitted over the wire should no longer be eligible to evaluate on. D. Sculley had a great talk at NeurIPS 2024 (same conference this paper was in) titled Empirical Rigor at Scale – or, How Not to Fool Yourself

Basically no one knows how to properly evaluate LLMs.

refulgentis · 2025-01-02T00:29:46 1735777786

No, an absolute massive amount of people do. In fact they have been doing exactly as you recommend, because as you note, it's obvious and required for a basic proper evaluation.

Panzer04 · 2025-01-01T13:02:18 1735736538

Seems a bit picky. If the bot has seen the exact problem before it's not really doing anything more than recall to solve it.

bandrami · 2025-01-01T13:21:31 1735737691

20 years ago in grad school we were doing a very early iteration of this where we built Markov chains with Shakespeare's plays and wanted to produce a plausibly "Shakespearian" clause given a single word to start and a bearish professor said "the more plausible it gets the more I worry people might forget plausibility is all that it promises".

(There was also a much earlier piece of software that would generate semi-intelligible Kant or Hegel one sentence at a time, though that was through a series of a priori generation rules and a large at the time dictionary of stock phrases. I wonder what ever happened to that.)

sickblastoise · 2025-01-01T14:29:56 1735741796

I think your prof’s worries came true on a massive scale

jeffreygoesto · 2025-01-01T13:27:50 1735738070

It became a successful consultant...

marcosdumay · 2025-01-01T14:58:15 1735743495

That said, a bot with contextual recall can be very useful.

The problem is just that people keep insisting that those things are intelligent.

hansworst · 2025-01-01T13:02:08 1735736528

Isn't that just the LLM equivalent of hardcoding though?

Trasmatta · 2025-01-01T13:43:33 1735739013

I wouldn't call that hardcoding, otherwise you'd have to call everything it does "hardcoded".

freehorse · 2025-01-01T16:01:26 1735747286

"Overfitting" would be a bit more accurate term if the problem lies in the specific examples existing in its training set in various forms, places, languages etc but with the same values.

strangescript · 2025-01-01T15:50:12 1735746612

There are tests they are passing that they can't be hardcoded for by design. They still have all kinds of flaws and inconsistency but getting upset they answer "2+2=4" because someone trained them on what the answer to 2+2 is supposed to be is silly.

bwfan123 · 2025-01-01T18:41:17 1735756877

this work is similar to the GSM symbolic paper (applied to putnam) https://arxiv.org/html/2410.05229v1

going forward, llm performance must be reported on the confounded benchmark as well