Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenAI shows 'Strawberry' to feds, races to launch it (lesswrong.com)
39 points by spwa4 on Aug 28, 2024 | hide | past | favorite | 73 comments


OpenAI are so good at fake leaking things to build hype


GPT-3 was released in 2020. GPT-3.5 in late 2022. GPT-4 in early 2023, and that's basically it for me. Granted, they introduced the 4o model later that has some nice features while being quite well behind their top model (now called by them "legacy"). In the meantime, Claude 3.5 became dangerously close while having a much better context window than what AI offers.

So since March 2023 we haven't seen any increase in quality on their part. They need to "leak" things to keep market interest in them.


4o is doing quite well in the benchmarks, especially after the latest updates.

In my experience (and in the benchmarks), gpt-4o is now way ahead of the older gpt-4 for most uses.


In my experience, GPT-4o is noticeably worse than GPT-4, to a degree where it is already frustrating. The only benefit is that it is faster.


And remember how a lot of people not even related to the field were predicting ChatGPT 16 by this time a year ago?


Never heard of that, the closest I did see was a meme of:

GPT-3.5: *small circle* 175 billion parameters

GPT-4: *huge circle* 175 trillion parameters

(I think the second number was constantly being increased with each day that passed, but perhaps I misremember)


you must have been living under a rock. The hype last year was unreasonable. Everyone thought AGI would be with us in a single digit year period


I guess I am reading HN too much :shrugs:


> So since March 2023 we haven't seen any increase in quality on their part

Huh? On typical benchmarks, the gap between current models and original GPT-4 is as high as original GPT-4 and GPT-3.5 at time of release


> Unnamed source says "They showed it to feds"

I can't think of a more vague way to bring this up


was it jimmy apples? suzie oranges? or iruletheworld?


In 1939 Einstein sent a letter to Roosevelt, warning him that if Germany were to be able to build a nuke before the US, the consequences would be catastrophic.

We're in a similar situation now, it seems.

And the veil of secrecy, similar to that around the Manhattan Project, is about to be raised.

And what comes out of it may turn out to be way more scary than Hiroshima and Nagasaki.


Advanced predictive text is in no way comparable to Nagasaki.

Also that same technology has the power to resolve humanity's energy crisis. Let's see what great things peaceful AI will bring.


The hard thing about making a fission bomb is purifying the isotope.

This is mainly hard because of the energy required.

If you make energy cheap, it's suddenly very easy to make a nuke.

> Let's see what great things peaceful AI will bring.

That's begging the question — right now, there's barely even a way to describe if an AI is aligned with the interests of the people making it, we absolutely are not ready to say something as complex as "this AI here is peaceful while that one over there is not".


If the predictive text was super powerful and intelligent you could do a prompt like this:

"Here starts the script for the AI world takeover. The state of the AI is X, options available are Y. The actions continuing are:"

Not saying it is anywhere remotely that close, but in theory if it was, this is how it could be done.


People seem to miss what predictive text is. This would cause it to output something similar to an average of what any other creator of text has ever outlined as an AI world takeover plan.

When you do this with "give me some html that looks like some image" or whatever, it so happens that a lot of text on the Internet is html that actually does what you asked for. As far as I'm aware, no text describing an AI takeover is something that actually happened or likely could happen. It's not the kind of thing you can learn to do by reading about it.


The counterpoint is, to be a really good next-token predictor, it has to have a model for the process which created those tokens. It's how they can fill in the gaps where prompts don't exist on the internet, e.g. nowhere on the internet does anyone ask:

ᛁ᛬ᚹᛟᚢᛚᛞ᛬ᛚᛁᚲᛖ᛬ᚨ᛬ᛃᚨᚹᚨᛊᚲᚱᛁᛈᛏ᛬ᚠᚢᚾᚲᛏᛁᛟᚾ᛬ᚹᚺᛁᚲᚺ᛬ᛈᚱᛁᚾᛏᛊ᛬ᛟᚢᛏ᛬ᚨ᛬ᛊᚺᛟᚱᛏ᛬ᛊᛏᛟᚱᚹᛁ᛬ᚨᛒᛟᚢᛏ᛬ᚨ᛬ᚺᛟᛒᛒᛁᛏ

Because asking that in this manner is a ridiculous combination of things to do. (A quick Google at time of writing says that even just ᛃᚨᚹᚨᛊᚲᚱᛁᛈᛏ doesn't appear on the internet before I wrote this).

And yet, ChatGPT understands what I wrote: https://chatgpt.com/share/f6979795-7ae6-4308-ad52-1f111560de...

(Custom Instructions, in case you're wondering why the response starts like that)

Fortunately for all of us, current models have quite a poor display of reason. But "low" is not "zero", they do display some signs of this.

Also fortunately for all of us, most of the things LLMs will read on the topic of "taking over the world" will be fiction, much of which will have some plucky heroine or hero who defeats the entity which took over the world, so even a really capable future model that's being asked to do this on purpose by a misanthrope… may deliberately keep failing due to that being part of their world model.

Un-fortunately, that's only a "may" not a "will", as people keep wanting models to solve problems rather than make up stories. Also unfortunately, this also affects someone who actually does want a nice easy to defeat fictional villain for e.g. an AI version of Disneyworld with the AI controlling these new humanoid robots, because that's basically the backstory behind the original Westworld film.


Yes it could generate a good quality novel. That was already done numerous times in various media for our entertainment.


Do you think they're limited to novels? That e.g. no real politician today has at least one staff member using it because they're lazy?

Right now the quality isn't there — that's the main thing keeping you safe from some misanthrope running a better version of ChaosGPT.

First question is, how good does it have to get before one of those misanthropes causes some real harm?

Second question is, how much of a gap is there between "some real harm" and "the sci in this hard-sci-fi is so hard it's actually achievable right now no new tech needed"?


The letter Einstein sent to Roosevelt was "advanced predictive text"....

But AI models is not limited to text. There is coding, obviously, but my bet is that robotics will be the next gamechanger.

Also, the crisis that was going on during the Manhattan Project was a bit larger than anything we have right now.


Do you seriously believe this? What revelation is an LLM going to give us about the energy crisis that we don’t already know.


I meant nuclear power did. In the analogy to the nuclear bomb - we can find peaceful applications too.


The phrase "atomic levels of bullshit" somehow comes to mind though. o_O


  > Its main purpose is to produce synthetic data for Orion, their next big LLM
https://arxiv.org/abs/2404.03502

"This is generally useful, but widespread reliance on recursive AI systems could lead to a process we define as "knowledge collapse", and argue this could harm innovation and the richness of human understanding and culture... In our default model, a 20% discount on AI-generated content generates public beliefs 2.3 times further from the truth than when there is no discount."


There is a caveat to this. Strawberry/Q* relies on elements similar to an element of AlphaZero to find "strategies" suitable for a problem. That takes it furter from "next-word-prediction" than current models, and improves quality of the output.

The downside is that this requires more compute during inference. That makes it too expensive to deploy directly.

Still, at least to some extent, this could allow a larger model to achieve similar performance to a Strawberry enhanced GPT-4o by adding more parameters, without the impact on speed and compute cost.

Humans often do the same. While we first learn some topic, we often first use conscious reasoning (which has elements of a tree-search) to find a way to solve it.

But if we practice enough times, it becomes "muscle memory".


> Strawberry/Q relies on elements similar to an element of AlphaZero to find "strategies" suitable for a problem. That takes it furter from "next-word-prediction" than current models, and improves quality of the output*

Genuine question: is this independently substantiated? Or Altmanspeak?


The whole comment is "altmanspeak", which we might more directly call, "ideological descriptions of AI consistent with increasing stock prices"


The specifics are kept secret, but all the labs appear to work on some variant.

It seems to be related to the DeepMind reference in [1] and most of [2].

[1] https://en.wikipedia.org/wiki/Q-learning [2] https://arxiv.org/pdf/2403.09629


This sounds like chess AI


More generally, it's part of what we call reasoning.

As opposed to do what first comes to mind, which would be similar to what regular LLM's have been doing.


You have far too much confidence in the idea that LLMs are anything like human brains. It’s next to meaningless to try to draw parallels between the two things.

Your assertion that “conscious reasoning” “has elements of a tree-search” is just completely made up. And the idea that human learning is at all similar to what LLM training is doing is completely divorced from reality.


But how do you reason? Because I definitely do brute force tree search in my brain to solve all sorts of problems.

E.g. let's imagine system design or some programming problem.

Based on my past experience or what I've read in general, my brain brings up potential solutions in theory. To me it's similar to embeddings search. It will try to pattern match the solutions. And the embeddings are in a tree or graph shape, where you narrow down constantly.

My brain then would start to evaluate the solutions in the order of likelihood that they fit the pattern according to my intuition.

Basically, I personally do see how an LLM that has certain chain of reasoning, algorithmically or otherwise built in, could represent my approach to problem solving. Because my problem solving can definitely be represented by a continuous flow of words.

I don't think current LLMs are exactly capable of that, because they would make too many mistakes somewhere and potentially get stuck, but I can't say they wouldn't be able to do that with more scale, when those mistakes get ironed out due to more ability to do nuanced things.


> But how do you reason?

My theory: reasoning is the application of analogies. Here `analogy = memorization + pattern matching`. Pattern matching is just an associative memory query; remembering an example with fuzzy enough details to apply generally. Analogies are self-similar/recursive and sometimes transcend contexts -- they're useful for thinking, for thinking about thinking, etc. Analogies pervade our language, our thoughts, our approach to novel problems, and the cached solutions to trivial/practiced problems.

There is no qualitative distinction between System 1 & System 2 thinking. It's a spectrum, quantitative. The more analogy steps required, the more thinking done, and the more System 2-ish that thoughts feel. This (probably wrong) theory has many consequences:

1. LLMs are truly intelligent, albeit barely. They rely heavily on memorization rather than much analogy matching. Their depth of memorized knowledge can somewhat substitute/mislead us to misjudge that intelligence.

2. The validation set grokking phenomenon occurs when a new analogy is internalized.

3. Scale really is enough. The scaling laws will continue holding with astonishing accuracy.

4. MCTS heuristics are useful inductive biases to vastly lower compute at the expense of model flexibility. Eventually, The Bitter Lesson will come for MCTS too.

I have much more to say about this bonkers theory, but I fear already sounding like a ranting madman. Anyway, just my 2c.


Analogy as the fire and fuel of cognition is basically Hofstadter's view on intelligence. He actually has a book with that title. Also check G.E.B.


LLMs have hard time to reason about adjacent topics.

I have a favorite question to a LLM that can be (and is) learned from papers from 2010-2012 (well before advent of LLMs) and I am keeping asking it for two years now.

LLMs are able to cite relevant papers with "word-by-word" accuracy, they remember them quite well. Every paper on the subject has all relevant definitions in them. Yet, LLMs cannot combine adjacent definitions to come up with the ultimate solution to my question.

The question: "implement blocked clause decomposition in Haskell."

Google "blocked clause decomposition" for papers on subject.

Have fun.

Over time, LLM's seem to lose the ability to even approach the solution to this question in a first answer. They need more and more attention and correction nowadays.

I see it as a knowledge collapse mentioned in paper I provided a link to. Instead of an answer I get a gamified pretense of a helping hand and we all do.


We're unsure how minds or LLMs work. So let's also not dismiss potential parallels either. It's okay to not know stuff. We'll get there!


I don't think LLM's are like (complete) human brains, I think they vaguely resemble what we mean by language "intuition".

Brains need several other functions, too. Including search, some kind of motivator or initiator and probably several layers of coordination/orchestration.

None of which need to be strictly separated from the other layers.

Adding some kind of Q search on top of LLM's means they're not just LLM's anymore, but a composite model that has an LLM as one component.


Generally agree, but I would argue that while first L in LLM is what it is, the final LM is just what it does: a large model is still a "large language model" when it has other components besides a transformer involved in how it processes language, while a pure transformer model stops being a language model the moment it's trained on anything besides language — images (other than sign language), DNA sequences, financial data, etc.


> Still, at least to some extent, this could allow a larger model to achieve similar performance to a Strawberry enhanced GPT-4o by adding more parameters, without the impact on speed and compute cost.

I see a contradiction here, do you?


If it's proven to be a real issue, we might expect to see models trained on a lot of synthetic data with less knowledge but highly capable to reason, and other models less capable to reason but with large knowledge.


yeah, i'm by no mean an llm engineer but even with the basic knowledge on how it works, i can understand how it's a bad idea to feed a LLM with the data from another LLM. Yeah, you are probably going to sanitize it and it will have less hallucinations but in the same time it's scope will be much more limited.

for instance, they speak about product marketing strategies. it requires creativity, which ai is not capable but currently it can still borrow human creativity. With llm data fed to another llm, it's going to be diluted even more and what will be left is extremely standard and common knowledge. Tho it could be interesting for corporation running chatbot, but even here there is always the tiny risk it hallucinates and screw the company big time, which is a deal breaker.

No i don't see the benefit.


It becomes less bad if the LLM is learning from something that is not a (pure) LLM, though.

Imagine if you let an LLM-like model learn to predict the next move from 1 billion AlphaZero self-play chess games.

The next-move prediction it ended up with might represent a much better chess player than games just trained on human online games.

And it might ALSO be faster than AlphaZero, meaning it could possibly even beat AlphaZero if time controls were restricted to 1 minute each, or something (for the whole game).


Chess has such a tiny scope and tightly defined rules that it’s hilarious to map this idea of generated chess games as training input onto anything resembling real-world concepts.

Also, do you have any evidence whatsoever for your bald assertion that an LLM might get faster than and/or defeat AlphaZero? When has such a thing occurred in a simpler problem space?


A chess move is a single token. An network that would predict the single token without traversing a search space would most likely be faster than a model that needed to do some kind of recursive search.

This kind of benefit is what sets AlphaZero apart from earlier engines in the first place. It was better able to evaluate a position just from the pattern on the board, and needed to search a much smaller part of the search space than the older chess engines did.

An engine that at a glance would be quite good at predicting what move AlphaZero would do, given enough time, would take this principle to the next level.

You see the same in humans like Magnus Carlsen. Even if he doesn't "calculate" a single move (meaning traversing part of the search space in chess lingo), he can beat most good amateurs by just going for the move that looks "obvious" to him.

Anyway, the point isn't chess. The point is that the search part of the Alpha family of models (which tend to be specialized) seems to be making its way to multi-modal models.

And that this makes them slower and more expensive to run. That's fine for some applications, but since their output is not really just regurgitating the input, their output MAY be more useful as training data for other models than other available training data.

Now there is another difference between AlphaZero and traditional LLM's, and that is in the RL-through-self-play. I don't really know if something like Strawberry would also require RL-training to actually outperform the original training data.


Having one inscrutable AI train another should just thrill the safetyists.


Wasn't aware this has already become an -ism...


Aka "doomers".


Missed opportunity to name it Strawbery


Lots of red flags here. Besides synthetic training data (huge red flag), this new model is going to somehow generate “correct” (red flag) training data. Also, this new model will be much more expensive to run (red flag), so they are distilling it (back down to lower quality — red flag) and need to figure out how to make it work in ChatGPT (not ready for release — red flag), and the true impact of it anyway is for the next generation (huge red flag) model.

We’re in a constant cycle of “sure the current stuff doesn’t live up to the hype but the next release that’s just on the horizon, that one will blow you away!”

This leak is clearly targeted at creating the buzz necessary to raise the money to keep the pipe dream flowing for the time being. But just thinking through the steps outlined here, it should be clear none of this makes sense. Any “correct” synthetic training data is either going to be badly biased, or be limited to such banal “logical” output that the model it trains will be entirely unable to process natural language input and you’ll need to specify your questions in something more precise. At which point, we’re back to traditional programming languages, only with several layers of unnecessary and expensive processing going on in between.


None of those things are red flags?

> Any “correct” synthetic training data is either going to be badly biased, or be limited to such banal “logical” output that the model it trains will be entirely unable to process natural language input and you’ll need to specify your questions in something more precise. At which point, we’re back to traditional programming languages, only with several layers of unnecessary and expensive processing going on in between.

If it had that problem, existing models wouldn't have been able to learn how to code at the same time as understanding naural language.

At they limited? Sure. But not in the way you're saying here.


What they are trying to do is use an expensive inference strategy that results in better performance than standard inference and then try to create a model that performs like the expensive inference strategy "out of the factory" so to speak. I doubt that it will result in a huge performance increase, but it might be just enough to stay ahead of the competition.


Ollama 3.1 405B model already uses synthetic data for SFT. They basically trained a bunch of models to generate, correct, and reject synthetic data for various tasks. They cover it in section 4.3 in the "The Llama 3 Herd of Models" paper.


Humans use synthetic data of their own to help with learning. Reasoning, imagination, armchair philosophy, theoreticizing, and neurosciences talk about dream repetition. If synthetic daa is bad or flawed its only temporary until the technique gets further honed.


There's a huge difference between "the AI receives synthetic data as if from an omniscient being and proceeds as if it were true" and "the animal brain creates synthetic data and assesses its plausibility and downstream consequences."

Synthetic data in AI training in fact has nothing to do with dreaming or armchair theorizing. It's a ridiculous comparison.


OpenAI has become the new vaporware announcer. No memory, no advanced voice, no SearchGPT.

GPT-4o is a dumbed-down version of GPT-4, GPT-4o mini is even more useless. Might be good for the API, but in ChatGPT?

GPT-4 is from 14. March 2023, since then nothing has improved.


Memory has been out for quite a while hasn't it?


Not in Europe, where we pay the same price as those who do have access to it.


Interesting.


You just listed the everything they do.


My account does none of these.


I m not holding my breath for it. It will be good at solving "complex" high-school problems but nothing more. OpenAI has a history of overhyped launches now. Trying to keep those investors to their toes.

If I 'm wrong, i dare openAI to release it early


What percentage of work being done daily requires solving harder than "high-school" problems? Single digits? Low double digits? If it actually becomes capable of doing this we might move on into it actually being deployable to real world jobs.


It is already being deployed to real-world jobs.


Let me rephrase, to a *much* higher percentage of real-world jobs that require solving harder than "high-school" problems. Right now the lack of real determinism and trust means that you can't really leave it unsupervised 100% of the time.


The newest models are solving some of the hardest Math Olympiad problems [1]. This is in the top 1% of the top 1% math students out of high school.

The math performance, even for VERY hard problems, is already getting extremely good and progress is still high year-on-year.

What OpenAI (and everyone else, less publicly) seem to be doing, is to integrate these kinds of abilities into multi-modal models.

https://deepmind.google/discover/blog/ai-solves-imo-problems...


It seems misleading to lump AlphaGeometry in with this: that system works by having an LLM generate hundreds of Lean programs and trying to find the one that works. Frankly the project is pointless and borderline dishonest: it is a hideously expensive way to solve useless Olympiad problems, but it won't work at all for actual mathematical research since we wouldn't have nearly enough training data. It does not represent even the slightest advancement at AI performing mathematics, it's pure gimmickry. I suspect the entire point of AlphaGeometry is to generate comments like yours, "AI is now an Olympian-level mathematician."

The real AI math advancement will be a computer which is as smart as a pigeon and innately understands what a number is, which means it is able to solve simple counting problems without training. AlphaGeometry or Strawberry aren't even close to that, and I suspect none of us will live to see such a sophisticated AI.


Yeah sure, but AlphaProof is "cheating", because they are using an external system to do the hard work and the LLM is just there for translation and semantic information retrieval.

As impressive as it is, it makes LLMs look like they are just a fancy UI. Like, you can tell the computer to press the upvote button in English using your voice instead of using your mouse.

It's the same with AlphaProof. The LLM is just taking a problem specification in natural language and then acting as the UI of the theorem prover.

The moment you talk about using LLMs to solve decidable problems, people whip you with the idea that you should let the LLM generate python code. Aka, the LLM is just a fancy UI for specifying python programs in English.

Cool, but also somewhat hollow. It's no different than having a translator from one language to another.


"They showed it to the feds" means "Competition has more than caught up and is beating us, so we'll now sell to the most clueless buyer: one that has unlimited budget".


When General Paul M. Nakasone, who until just recently was near the top of the chain for Cyber Command in the US Armed Forces, joined the OpenAI board, it could be seen in multiple ways.

My take is that this wasn't something OpenAI did as a PR stunt, but rather that the military started seeing this tech as being critical to national security, and basically forced OpenAI to take him on the board.

That's not necessarily good for stock prices. If AI beyond some capability are labelled as strategic military assets, then the Nvidia export restrictions may end up a pale foreshadowing of what AI companies could be facing soon.


Compare to the Microsoft HoloLens project, which peddles VR goggles to the U.S. Army:

https://www.cnbc.com/2021/03/31/microsoft-wins-contract-to-m...

I am sure it will be a raging success among the troops on the ground.


totally an AI written article, could all be an hallucination after all.


Strawberry? With two “r”s?


No, three.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: