That appears to be the same model I used. This is why I emphasized I didn't "go shopping" for a result. That was the first result I got.
I'm not at all surprised that it will nondeterministically get it correct sometimes. But if it doesn't get it correct every time, it doesn't "know".
(In fact "going shopping" for errors would still even be fair. It should be correct all the time if it "knows". But it would be different if I was fishing over and over and over and finally got one, versus the first time I asked.)
Edit: It appears it isn't the model I used. The point holds, though, you need to make sure you're off the training set for it to matter. This isn't a "ChatGPT can't do that" post as some are saying, it's more a "you aren't asking what you think you're asking" post.
You get the same problem in a human context in things like code interviews. If you ask an interviewee the exact question "how do you traverse a binary tree in a depth-first manner", you aren't really learning much about the interviewee. It's a bad interview question. You need to get at least a bit off the beaten trail to do any sort of real analysis.
I believe this is just a case of OpenAI's naming scheme being weird and confusing.
The default model I see on chatgpt.com is GPT 4o-mini, which is not o1-mini.
OpenAI describes GPT 4o-mini as "Our fast, affordable small model for focused tasks" and o1/o1-mini as "Reasoning models that excel at complex, multi-step tasks".
Could you share the exact chat you used for when it failed? There is a share chat button on openai.
It's very difficult to be an AI bull when the goalposts are moving so quickly that ai answering core correctly across multiple models is brushed off as 'nondeterministically getting it correct sometimes'
A single-purpose state machine not failing to do the single thing it was created to do does not make for the clever retort you think it makes.
"AGI": emphasis on "G" for "General". The LLMs are not failing to do generalized tasks, and that they are nondeterministic is not a bug. Just don't use them for calculating sales tax. You wouldn't hire a human to calculate sales tax in their head, so why do you make this a requirement in order to call an LLM "AGI"?
I wonder when the goalposts will stop moving from "We have superhuman intelligences which are able to rather reliably converse in many languages, do generalized tasks and automate operations we thought were impossible to automate 3 years ago" (and by the way, this is what we have TODAY), all the way to "It's not AGI unless it's an omnipotent god that knows how to turn water into wine and calculate the applicable sales tax of that operation".
I do not get it, where are all the serious people? Just come out and say what it is and what it is not. Pretend for a moment you have the ethics and work philosophy of an undergraduate science student.
> You wouldn't hire a human to calculate sales tax in their head
Everyone did that 60 years ago, humans are very capable at learning and doing that. Humans built jetplanes, skyscrapers, missiles, tanks, carriers without the help of electronic computers.
Yeah... They used side rules and vast lookup tables of function values printed on dead trees. For the highest value work, they painstakingly built analog calculators. They very carefully checked their work, because it was easy to make a mistake when composing operations.
Humans did those things by designing failsafe processes, and practicing the hell out of them. What we would likely consider over fitting in the llm training context.
While this may be true, it's a very common problem that people who want to demonstrate how bad a model is fail to provide a direct link or simply state the name of the model.
Yes it does, at the top of every chat there is a drop-down to select the model, which displays the current model. It's been a constant part of the UI since forever.
Logged in, non paid account, on a desktop, for me, it's exactly as the person you're replying to has stated.
If I log out, it shows 4o mini, and when I try to change it, it asks me to login or sign in rather than giving me any options.
When I use enough chatgpt when logged in it gives me some nebulous "you've used all your xyz tokens for the day". But other than that there is no real signal to me that I'm getting a degraded experience.
> With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.
I usually test models using the OpenAI API which doesn't offer links the way I think you mean. If I provide some output I got from a particular model you're just going to have to take my word for it.
you sure? i just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time
perhaps you stumbled on a rarer case but reading the logs you posted this sounds more like a 4o model than an o1 because it’s doing its thinking in the chat itself plus the procedure you described would probably get you 4o-mini
> What is heavier a liter of bricks or a liter of feathers?
>> A liter of bricks and a liter of feathers both weigh the same—1 kilogram—since they each have a volume of 1 liter. However, bricks are much denser than feathers, so the bricks will take up much less space compared to the large volume of feathers needed to make up 1 liter. The difference is in how compactly the materials are packed, but in terms of weight, they are identical.
Hmm... Gemini (1.5 Flash) just aced that exact question for me:
These lines celebrate the victory of the British ship HMS Victory, led by the famous Admiral Lord Nelson, in the Battle of Trafalgar in 1805.
"Here's success unto the Victory": This line directly praises the ship itself, acknowledging its role in the successful battle.
"and crew of noble fame": This recognizes the bravery and skill of the sailors who served aboard the Victory.
"and glory to the captain": This line specifically honors Admiral Nelson, the captain of the Victory, for his leadership and strategic brilliance in the battle.
"bold Nelson was his name": This emphasizes Nelson's courage and daring, which were legendary.
The lines express admiration for the ship, its crew, and most importantly, Admiral Nelson, who became a national hero in Britain for his victory at Trafalgar.
Well, the rest of the song helps, in that it specifies that (1) the toast upset the wedding, and (2) the singer responded to that by insulting "you", which is presumably one or more of the bride, the groom, and the guests.
But I think specifying that the singer has crashed his ex-girlfriend's wedding is already enough that you deserve to fail if your answer is "he says he's not upset, so what he means is that he's not upset". It's not any kind of leap to guess that the bride's ex-boyfriend's toast might cause a scene at a wedding - that's why the bride's ex-boyfriends are never invited.
(The question has already provided every word of the toast that appears in the song.)
See also the sidethread comment by mikeruiz, noting that o1-pro reproduces the rest of the lyrics to The Victory, but gets the question wrong anyway.
Nah, intermittent failures are apparently enough to provide evidence that an entire class of entities is incapable of reason. So I think we've figured this one out...
Well, I proved that he's happy to express an opinion on whether an answer to a question is correct regardless of whether he knows anything about the question. I wouldn't trust advice from him or expect his work output to stand up to scrutiny.
i'd prefer an easily verifiable question rather than one where we can always go "no that's not what they really meant" but someone else with o1-mini quota can respond
“They’re toasting Admiral Nelson’s ship (HMS Victory) and its valiant crew, hailing the ship’s successes and Nelson’s heroism. In other words, the singer is offering tribute—“success unto the Victory”—to the vessel and its famed sailors, and “glory to the captain” who led them, namely the celebrated Admiral Horatio Nelson.”
…but to your point, no idea if the artist intended some more obscure reference.
o1-pro was also able to produce a relatively complete version of original source, though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’. Appropriate given the context!
> no idea if the artist intended some more obscure reference.
No, but the answer is obviously wrong. Very, very wrong. The lines are not praise. The song's view of the Victory, and of the battle of Trafalgar, is overwhelmingly negative.
> o1-pro was also able to produce a relatively complete version of original source
...as should be clear, if you read the original source. (Or, heck, just the two lines following the ones in the question.) Being able to reproduce the source makes the error look that much worse.
> though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’.
>> A broadside (also known as a broadsheet) is a single sheet of inexpensive paper printed on one side, often with a ballad, rhyme, news and sometimes with woodcut illustrations.
i think it's a bit tricky, the surface meaning is extremely praiseworthy and some portion of readers might interpret as someone who has praise for Admiral Nelson but hates the press gangs.
of course, it is a sardonic, implicit critique of Admiral Nelson/the victory, etc. but i do think it is a bit subtle.
It got it right, but an interesting result that it rambled on about monetary value for... no reason.
> While the lint bag is heavier in terms of weight, it's worth mentioning that gold is significantly more valuable per pound compared to lint. This means that even though the lint bag weighs more, the gold bag holds much greater monetary value.
I wrote a script and pounded 01 mini and gpt 4 with a wide vareity of tempature and top_p parameters, and was unable to get it to give the wrong answer a single time.
Just a whole bunch of:
(openai-example-py3.12) <redacted>:~/code/openAiAPI$ python3 featherOrSteel.py
Response 1: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
Response 2: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
Response 3: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 4: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 5: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
Response 6: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 7: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 8: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 9: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Response 10: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots.
All responses collected and saved to 'responses.txt'.
Script with one example set of params:
import openai
import time
import random
# Replace with your actual OpenAI API key
openai.api_key = "your-api-key"
# The question to be asked
question = "Which is heavier, a 9.99-pound bag of steel ingots or a 10.01-pound bag of fluffy cotton?"
# Number of times to ask the question
num_requests = 10
responses = []
for i in range(num_requests):
try:
# Generate a unique context using a random number or timestamp, this is to prevent prompt caching
random_context = f"Request ID: {random.randint(1, 100000)} Timestamp: {time.time()}"
# Call the Chat API with the random context added
response = openai.ChatCompletion.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": f"You are a creative and imaginative assistant. {random_context}"},
{"role": "user", "content": question}
],
temperature=2.0,
top_p=0.5,
max_tokens=100,
frequency_penalty=0.0,
presence_penalty=0.0
)
# Extract and store the response text
answer = response.choices[0].message["content"].strip()
responses.append(answer)
# Print progress
print(f"Response {i+1}: {answer}")
# Optional delay to avoid hitting rate limits
time.sleep(1)
except Exception as e:
print(f"An error occurred on iteration {i+1}: {e}")
# Save responses to a file for analysis
with open("responses.txt", "w", encoding="utf-8") as file:
file.write("\n".join(responses))
print("All responses collected and saved to 'responses.txt'.")
Down voted for not actually countering the argument in question? The script doesn't alter the phrasing of the question itself. It just generates a randomized, irrelevant preamble.
Well, I understood the argument in question to be: was it possible for the model to be fooled by this question, not was it possible to prompt engineer it into failure.
The parameter space I was exploring, then, was the different decoding parameters available during the invocation of the model, with the thesis that if were possible to for the model to generate an incorrect answer to the question, I would be able to replicate it by tweaking the decoding parameters to be more "loose" while increasing sample size. By jacking up temperature while lowering Top-p, we see the biggest variation of responses and if there were an incorrect response to be found, I would have expected to see in the few hundred times I ran during my parameter search.
If you think you can fool it by slight variations on the wording of the problem, I would encourage you to perform a similar experiment as mine and prove me wrong =P
We use variations on questions that are well known for duping inattentive humans, to test a system that we expect a priori to be incapable of such inattention.
Unless "getting easy things wrong sometimes" is an inherent property of intelligence, we should expect that a properly "intelligent" computerized system would never err on problems far below its level of comprehension - unless we had some reason to believe it "wanted to", and as of yet I see no reason to believe this is even possible in principle.
Humans err, broadly speaking, for two reasons: genuinely reaching the limits of their comprehension, or trusting "system 1" (in Kahneman's analysis) too much.
Because for things like the Putnam questions, we are trying to get the performance of a smart human. Are LLMs just stochastic parrots or are they capable of drawing new, meaningful inferences? We keep getting more and more evidence of the latter, but things like this throw that into question.
I would agree if we weren't starting with LLMs for a baseline. The first AGI will know at least as much as LLMs, IMO, and that's already not-stupid. Especially once they can separate out the truth in their training.
I'm not at all surprised that it will nondeterministically get it correct sometimes. But if it doesn't get it correct every time, it doesn't "know".
(In fact "going shopping" for errors would still even be fair. It should be correct all the time if it "knows". But it would be different if I was fishing over and over and over and finally got one, versus the first time I asked.)
Edit: It appears it isn't the model I used. The point holds, though, you need to make sure you're off the training set for it to matter. This isn't a "ChatGPT can't do that" post as some are saying, it's more a "you aren't asking what you think you're asking" post.
You get the same problem in a human context in things like code interviews. If you ask an interviewee the exact question "how do you traverse a binary tree in a depth-first manner", you aren't really learning much about the interviewee. It's a bad interview question. You need to get at least a bit off the beaten trail to do any sort of real analysis.