yeah, i'm by no mean an llm engineer but even with the basic knowledge on how it...

trashtester · on Aug 28, 2024

It becomes less bad if the LLM is learning from something that is not a (pure) LLM, though.

Imagine if you let an LLM-like model learn to predict the next move from 1 billion AlphaZero self-play chess games.

The next-move prediction it ended up with might represent a much better chess player than games just trained on human online games.

And it might ALSO be faster than AlphaZero, meaning it could possibly even beat AlphaZero if time controls were restricted to 1 minute each, or something (for the whole game).

skywhopper · on Aug 28, 2024

Chess has such a tiny scope and tightly defined rules that it’s hilarious to map this idea of generated chess games as training input onto anything resembling real-world concepts.

Also, do you have any evidence whatsoever for your bald assertion that an LLM might get faster than and/or defeat AlphaZero? When has such a thing occurred in a simpler problem space?

trashtester · on Aug 28, 2024

A chess move is a single token. An network that would predict the single token without traversing a search space would most likely be faster than a model that needed to do some kind of recursive search.

This kind of benefit is what sets AlphaZero apart from earlier engines in the first place. It was better able to evaluate a position just from the pattern on the board, and needed to search a much smaller part of the search space than the older chess engines did.

An engine that at a glance would be quite good at predicting what move AlphaZero would do, given enough time, would take this principle to the next level.

You see the same in humans like Magnus Carlsen. Even if he doesn't "calculate" a single move (meaning traversing part of the search space in chess lingo), he can beat most good amateurs by just going for the move that looks "obvious" to him.

Anyway, the point isn't chess. The point is that the search part of the Alpha family of models (which tend to be specialized) seems to be making its way to multi-modal models.

And that this makes them slower and more expensive to run. That's fine for some applications, but since their output is not really just regurgitating the input, their output MAY be more useful as training data for other models than other available training data.

Now there is another difference between AlphaZero and traditional LLM's, and that is in the RL-through-self-play. I don't really know if something like Strawberry would also require RL-training to actually outperform the original training data.