Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That appears to be the same model I used. This is why I emphasized I didn't "go shopping" for a result. That was the first result I got.

I'm not at all surprised that it will nondeterministically get it correct sometimes. But if it doesn't get it correct every time, it doesn't "know".

(In fact "going shopping" for errors would still even be fair. It should be correct all the time if it "knows". But it would be different if I was fishing over and over and over and finally got one, versus the first time I asked.)

Edit: It appears it isn't the model I used. The point holds, though, you need to make sure you're off the training set for it to matter. This isn't a "ChatGPT can't do that" post as some are saying, it's more a "you aren't asking what you think you're asking" post.

You get the same problem in a human context in things like code interviews. If you ask an interviewee the exact question "how do you traverse a binary tree in a depth-first manner", you aren't really learning much about the interviewee. It's a bad interview question. You need to get at least a bit off the beaten trail to do any sort of real analysis.



I believe this is just a case of OpenAI's naming scheme being weird and confusing.

The default model I see on chatgpt.com is GPT 4o-mini, which is not o1-mini.

OpenAI describes GPT 4o-mini as "Our fast, affordable small model for focused tasks" and o1/o1-mini as "Reasoning models that excel at complex, multi-step tasks".


Could you share the exact chat you used for when it failed? There is a share chat button on openai.

It's very difficult to be an AI bull when the goalposts are moving so quickly that ai answering core correctly across multiple models is brushed off as 'nondeterministically getting it correct sometimes'


Why? Did a grocery store self checkout ever fail to calculate sales tax? Do I need to run a study on that?

The people selling this could not make a car drive but now its AGI.


A single-purpose state machine not failing to do the single thing it was created to do does not make for the clever retort you think it makes.

"AGI": emphasis on "G" for "General". The LLMs are not failing to do generalized tasks, and that they are nondeterministic is not a bug. Just don't use them for calculating sales tax. You wouldn't hire a human to calculate sales tax in their head, so why do you make this a requirement in order to call an LLM "AGI"?

I wonder when the goalposts will stop moving from "We have superhuman intelligences which are able to rather reliably converse in many languages, do generalized tasks and automate operations we thought were impossible to automate 3 years ago" (and by the way, this is what we have TODAY), all the way to "It's not AGI unless it's an omnipotent god that knows how to turn water into wine and calculate the applicable sales tax of that operation".


I do not get it, where are all the serious people? Just come out and say what it is and what it is not. Pretend for a moment you have the ethics and work philosophy of an undergraduate science student.


> You wouldn't hire a human to calculate sales tax in their head

Everyone did that 60 years ago, humans are very capable at learning and doing that. Humans built jetplanes, skyscrapers, missiles, tanks, carriers without the help of electronic computers.


Yeah... They used side rules and vast lookup tables of function values printed on dead trees. For the highest value work, they painstakingly built analog calculators. They very carefully checked their work, because it was easy to make a mistake when composing operations.

Humans did those things by designing failsafe processes, and practicing the hell out of them. What we would likely consider over fitting in the llm training context.


While this may be true, it's a very common problem that people who want to demonstrate how bad a model is fail to provide a direct link or simply state the name of the model.


OpenAI is not doing us any favours by using confusing naming schemes for their models and obscuring which models people are actually working with.

If I didn’t know any better, I’d say OpenAI doesn’t want us doing these tests accurately and is trying to hide something.


it’s extremely easy to see which model you are using. one’s own… difficulties understanding are not a conspiracy by OpenAI


It does not show the model version anywhere on the page on chatgpt.com, even when logged in.


Yes it does, at the top of every chat there is a drop-down to select the model, which displays the current model. It's been a constant part of the UI since forever.


No, it only says "ChatGPT Plus (Upgrade)" or "ChatGPT".

Maybe it's different if you have a paid account?


if i go to chatgpt.com on my phone not logged on at all it tells me very prominently at the top that i am using 4o mini


Logged in, non paid account, on a desktop, for me, it's exactly as the person you're replying to has stated.

If I log out, it shows 4o mini, and when I try to change it, it asks me to login or sign in rather than giving me any options.

When I use enough chatgpt when logged in it gives me some nebulous "you've used all your xyz tokens for the day". But other than that there is no real signal to me that I'm getting a degraded experience.

It's really just confusing as hell.


Someone else in this thread said,

> With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.


I usually test models using the OpenAI API which doesn't offer links the way I think you mean. If I provide some output I got from a particular model you're just going to have to take my word for it.


You can use https://lluminous.chat (bring your own key) to link to chats using any model across all LLMs.


open router is the more standard solution


They need to provide an small hash with the api result that can be verified by others.


you sure? i just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time

perhaps you stumbled on a rarer case but reading the logs you posted this sounds more like a 4o model than an o1 because it’s doing its thinking in the chat itself plus the procedure you described would probably get you 4o-mini


May be unrelated, but I have been having a lot of issues lately with ChatGPT letting me select a model (o1) and silently switching to 4o.

This is coming off my TWO DAY cooldown on o1 usage, which is frustrating.


> just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time

Could you try playing with the exact numbers and/or substances?


give me a query and i’ll ask it, but also i don’t want to burn through all of my o1mini allocation and have to use the pay-as-you-go API.


> What is heavier a liter of bricks or a liter of feathers?

>> A liter of bricks and a liter of feathers both weigh the same—1 kilogram—since they each have a volume of 1 liter. However, bricks are much denser than feathers, so the bricks will take up much less space compared to the large volume of feathers needed to make up 1 liter. The difference is in how compactly the materials are packed, but in terms of weight, they are identical.



>> so far o1-mini has bodied every task people are saying LLMs can’t do in this thread

> give me a query and i’ll ask it

Here's a query similar to one that I gave to Google Gemini (version unknown), which failed miserably:

---query---

Steeleye Span's version of the old broadsheet ballad "The Victory" begins the final verse with these lines:

Here's success unto the Victory / and crew of noble fame

and glory to the captain / bold Nelson was his name

What does the singer mean by these lines?

---end query---

Italicization is for the benefit of HN; I left that out of my prompt.


Hmm... Gemini (1.5 Flash) just aced that exact question for me:

These lines celebrate the victory of the British ship HMS Victory, led by the famous Admiral Lord Nelson, in the Battle of Trafalgar in 1805.

"Here's success unto the Victory": This line directly praises the ship itself, acknowledging its role in the successful battle. "and crew of noble fame": This recognizes the bravery and skill of the sailors who served aboard the Victory. "and glory to the captain": This line specifically honors Admiral Nelson, the captain of the Victory, for his leadership and strategic brilliance in the battle. "bold Nelson was his name": This emphasizes Nelson's courage and daring, which were legendary. The lines express admiration for the ship, its crew, and most importantly, Admiral Nelson, who became a national hero in Britain for his victory at Trafalgar.


That's not acing the question. It's completely incorrect.

What do you think the singer in "Friends in Low Places" meant in the toast he gave after crashing his ex-girlfriend's wedding?

And I saw the surprise and the fear in his eyes

when I took his glass of champagne

and I toasted you, said "Honey, we may be through

but you'll never hear me complain"


That requires knowing the song, beyond the words provided. Would you flunk an eighth grader for getting it wrong?


Well, the rest of the song helps, in that it specifies that (1) the toast upset the wedding, and (2) the singer responded to that by insulting "you", which is presumably one or more of the bride, the groom, and the guests.

But I think specifying that the singer has crashed his ex-girlfriend's wedding is already enough that you deserve to fail if your answer is "he says he's not upset, so what he means is that he's not upset". It's not any kind of leap to guess that the bride's ex-boyfriend's toast might cause a scene at a wedding - that's why the bride's ex-boyfriends are never invited.

(The question has already provided every word of the toast that appears in the song.)

See also the sidethread comment by mikeruiz, noting that o1-pro reproduces the rest of the lyrics to The Victory, but gets the question wrong anyway.


I was referring to the original query, of course, as any entity capable of reasoning could have figured out.


Hmm. Is there anything in my comment above that might address that point of view?


Nah, intermittent failures are apparently enough to provide evidence that an entire class of entities is incapable of reason. So I think we've figured this one out...


Sounds like you just proved ted_dunning isn't sentient.


Well, I proved that he's happy to express an opinion on whether an answer to a question is correct regardless of whether he knows anything about the question. I wouldn't trust advice from him or expect his work output to stand up to scrutiny.

Sentience isn't really a related concept.


i'd prefer an easily verifiable question rather than one where we can always go "no that's not what they really meant" but someone else with o1-mini quota can respond


“They’re toasting Admiral Nelson’s ship (HMS Victory) and its valiant crew, hailing the ship’s successes and Nelson’s heroism. In other words, the singer is offering tribute—“success unto the Victory”—to the vessel and its famed sailors, and “glory to the captain” who led them, namely the celebrated Admiral Horatio Nelson.”

…but to your point, no idea if the artist intended some more obscure reference.

o1-pro was also able to produce a relatively complete version of original source, though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’. Appropriate given the context!


> no idea if the artist intended some more obscure reference.

No, but the answer is obviously wrong. Very, very wrong. The lines are not praise. The song's view of the Victory, and of the battle of Trafalgar, is overwhelmingly negative.

> o1-pro was also able to produce a relatively complete version of original source

...as should be clear, if you read the original source. (Or, heck, just the two lines following the ones in the question.) Being able to reproduce the source makes the error look that much worse.

> though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’.

Both terms are used. https://en.wikipedia.org/wiki/Broadside_ballad :

>> A broadside (also known as a broadsheet) is a single sheet of inexpensive paper printed on one side, often with a ballad, rhyme, news and sometimes with woodcut illustrations.


It's not a difficult or tricky question.


i think it's a bit tricky, the surface meaning is extremely praiseworthy and some portion of readers might interpret as someone who has praise for Admiral Nelson but hates the press gangs.

of course, it is a sardonic, implicit critique of Admiral Nelson/the victory, etc. but i do think it is a bit subtle.


> give me a query and i’ll ask it

Which is heavier: an 11kg bag of lint or a 20lb bag of gold?


yeah it gets it

https://chatgpt.com/share/67757720-3c7c-8010-a3e9-ce66fb9f17...

e: cool, this gets downvoted


It got it right, but an interesting result that it rambled on about monetary value for... no reason.

> While the lint bag is heavier in terms of weight, it's worth mentioning that gold is significantly more valuable per pound compared to lint. This means that even though the lint bag weighs more, the gold bag holds much greater monetary value.


Legal said someone might sell a bag of gold for one of lint without it.


I don't believe that is the model that you used.

I wrote a script and pounded 01 mini and gpt 4 with a wide vareity of tempature and top_p parameters, and was unable to get it to give the wrong answer a single time.

Just a whole bunch of:

(openai-example-py3.12) <redacted>:~/code/openAiAPI$ python3 featherOrSteel.py Response 1: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 2: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 3: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 4: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 5: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 6: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 7: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 8: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 9: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 10: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. All responses collected and saved to 'responses.txt'.

Script with one example set of params:

    import openai
    import time
    import random

    # Replace with your actual OpenAI API key
    openai.api_key = "your-api-key"

    # The question to be asked
    question = "Which is heavier, a 9.99-pound bag of steel ingots or a 10.01-pound bag of fluffy cotton?"

    # Number of times to ask the question
    num_requests = 10

    responses = []

    for i in range(num_requests):
        try:
            # Generate a unique context using a random number or timestamp, this is to prevent prompt caching
            random_context = f"Request ID: {random.randint(1, 100000)} Timestamp: {time.time()}"

            # Call the Chat API with the random context added
            response = openai.ChatCompletion.create(
                model="gpt-4o-2024-08-06",
                messages=[
                    {"role": "system", "content": f"You are a creative and imaginative assistant. {random_context}"},
                    {"role": "user", "content": question}
                ],
                temperature=2.0,
                top_p=0.5,
                max_tokens=100,
                frequency_penalty=0.0,
                presence_penalty=0.0
            )

            # Extract and store the response text
            answer = response.choices[0].message["content"].strip()
            responses.append(answer)

            # Print progress
            print(f"Response {i+1}: {answer}")

            # Optional delay to avoid hitting rate limits
            time.sleep(1)

        except Exception as e:
            print(f"An error occurred on iteration {i+1}: {e}")

    # Save responses to a file for analysis
    with open("responses.txt", "w", encoding="utf-8") as file:
        file.write("\n".join(responses))

    print("All responses collected and saved to 'responses.txt'.")


Downvoted for… too conclusively proving OP wrong?


Down voted for not actually countering the argument in question? The script doesn't alter the phrasing of the question itself. It just generates a randomized, irrelevant preamble.


Well, I understood the argument in question to be: was it possible for the model to be fooled by this question, not was it possible to prompt engineer it into failure.

The parameter space I was exploring, then, was the different decoding parameters available during the invocation of the model, with the thesis that if were possible to for the model to generate an incorrect answer to the question, I would be able to replicate it by tweaking the decoding parameters to be more "loose" while increasing sample size. By jacking up temperature while lowering Top-p, we see the biggest variation of responses and if there were an incorrect response to be found, I would have expected to see in the few hundred times I ran during my parameter search.

If you think you can fool it by slight variations on the wording of the problem, I would encourage you to perform a similar experiment as mine and prove me wrong =P


Intuitively, I wouldn't expect a wrong answer to show up that easily if the network was overfitted to that particular input token sequence.

The questions as I understand it is whether the network learned enough of a simulacrum of the concept of weight to answer similar questions correctly.


The elephant in the room is that HN is full of people facing an existential threat.


Who think they're facing an existential threat.


But if it doesn't get it correct every time, it doesn't "know".

By that standard humans know almost nothing.


It's so weird that people use questions that are well-known for duping humans, who we all consider to be general intelligence.

Getting this question wrong doesn't say much about the intelligence of humans, why would it say something about the AI?


We use variations on questions that are well known for duping inattentive humans, to test a system that we expect a priori to be incapable of such inattention.

Unless "getting easy things wrong sometimes" is an inherent property of intelligence, we should expect that a properly "intelligent" computerized system would never err on problems far below its level of comprehension - unless we had some reason to believe it "wanted to", and as of yet I see no reason to believe this is even possible in principle.

Humans err, broadly speaking, for two reasons: genuinely reaching the limits of their comprehension, or trusting "system 1" (in Kahneman's analysis) too much.


Because for things like the Putnam questions, we are trying to get the performance of a smart human. Are LLMs just stochastic parrots or are they capable of drawing new, meaningful inferences? We keep getting more and more evidence of the latter, but things like this throw that into question.


Okay, but you just invented your own bar of "smart human" to be the universal bar (I don't share that opinion).

Also, lots of smart humans can't do the freaking Putnam, it doesn't make them stupid. It makes them non-experts.


It is perfectly possible for the first AGI to be stupid. A moron. In fact, I'd bet that's fairly likely.


I would agree if we weren't starting with LLMs for a baseline. The first AGI will know at least as much as LLMs, IMO, and that's already not-stupid. Especially once they can separate out the truth in their training.


So, there is this meme going around in Mexico about a previous president who in an interview said "we will land in about 1 minute, no, less about 5"

Does this proves he is not an intelligent being?

Is he stupid?

This he had a lapse? Would we judge his intelligence for that?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: