Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Okay, then what about elite level codeforces performance? Those problems weren’t even constructed until after the model was made.

The real problem with all of these theories is most of these benchmarks were constructed after their training dataset cutoff points.

A sudden performance improvement on a new model release is not suspicious. Any model release that is much better than a previous one is going to be a “sudden jump in performance.”

Also, OpenAI is not reading your emails - certainly not with a less than one month lead time.



o1 has a ~1650 rating, at that level many or most problems you will be solving are going to be a transplant of a relatively known problem.

Since o1 on codeforces just tried hundreds or thousands of solutions, it's not surprising it can solve problems where it is really about finding a relatively simple correspondence to a known problem and regurgitating an algorithm.

In fact when you run o1 on ""non-standard"" codeforces problems it will almost always fail.

See for example this post running o1 multiple times on various problems: https://codeforces.com/blog/entry/133887

So the thesis that it's about recognizing a problem with a known solution and not actually coming up with a solution yourself seems to hold, as o1 seems to fail even on low rated problems which require more than fitting templates.


o3 is what i’m referring to and it is 2700


It's extremely unlikely for o3 to have hit 2700 on live contests as such a rapid increase in score would have been noticed by the community. I can't find anything online detailing how contamination was avoided since it clearly wasn't run live, including in their video, and neither could I find details about the methodology (number of submissions being the big one, in contests you can also get 'hacked' esp. at a high level), problem selection, etc...

Additionally, people weren't able to replicate o1-mini results in live contests straightforwardly - often getting scores between 700 and 1200, which raises questions as for the methodology.

Perhaps o3 really is that good, but I just don't see how you can claim what you claimed for o3, we have no idea that the problems have never been seen, and the fact people find much lower Elo scores with o1/o1-mini with proper methodology raises even more questions, let alone conclusively proving these are truly novel tasks it's never seen.


Can you give an example of one of these problems that 'wasn't even constructed until after the model was made'?

I'd like to see if it's truly novel and unique, the first problem of its type ever construed by mankind, or if it's similar to existing problems.


Sorry, I thought the whole point of this thread was that models can’t handle problems when they are “slightly varied”. Mottes and baileys all over the place today.


The point is that it's not consistent on variations, unless it finds a way to connect it to something it already knows. The fact it sometimes succeeds on variations (in codeforces the models are allowed multiple tries, sometimes ridiculous numbers, to be useful) doesn't matter.

The point is that the fact it's no longer consistent once you vary the terminology indicates it's fitting a memorized template instead of reasoning from first principles.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: