> Microsoft used LLMs to write millions of short stories and textbooks in which ...

andai · on Sept 21, 2024

Look into Microsoft's Phi papers. The whole idea here is that if you train models on higher quality data (i.e. textbooks instead of blogspam) you get higher quality results.

The exact training is proprietary but they seem to use a lot of GPT-4 generated training data.

On that note... I've often wondered if broad memorization of trivia is really a sensible use of precious neurons. It seems like a system trained on a narrower range of high quality inputs would be much more useful (to me) than one that memorized billions of things I have no interest in.

At least at the small model scale, the general knowledge aspect seems to be very unreliable anyways -- so why not throw it out entirely?

throwthrowuknow · on Sept 21, 2024

The trivia include information about many things: grammar, vocabulary, slang, entity relationships, metaphor, among others but chiefly they also constitute models of human thought and behaviour. If all you want is a fancy technical encyclopedia then by all means chop away at the training set but if you want something you can talk to then you’ll need to keep the diversity.

visarga · on Sept 21, 2024

> you’ll need to keep the diversity.

You can get diverse low quality data from the web, but for diverse high quality data the organic content is exhausted. The only way is to generate it, and you can maintain a good distribution by structured randomness. For example just sample 5 random words from the dictionary and ask the model to compose a piece of text from them. It will be more diverse than web text.

throwthrowuknow · on Sept 21, 2024

not exhausted, just not currently being collected. Generating via existing models is ok for distilling a better training set or refining existing low quality samples but won’t break out of distribution without some feedback mechanism. That’s why simulation is promising but it’s pretty narrow at the moment. There’s still a lot of space to fill in the point cloud so coming up with novel data collection methods is important. I think this is off topic though, my original contention was if you take too thin of a slice you won’t get a very useful model.

deegles · on Sept 21, 2024

You're not just memorizing text though. Each piece of trivia is something that represents coherent parts of reality. Think of it as being highly compressed.

snovv_crash · on Sept 21, 2024

From what I've seen Phi does well in benchmarks but poorly in real world scenarios. They also made some odd decisions regarding the network structure which means that the memory requirements for larger context is really high.

ComputerGuru · on Sept 21, 2024

> I've often wondered if broad memorization of trivia is really a sensible use of precious neurons.

I agree if we are talking about maxing raw reasoning and logical onference abilities, but the problem is that the ship has sailed and people expect llms to have domain knowledge (even more than expert users are clamoring for LLMs to have better logic).

I bet a model with actual human “intelligence” but no Google-scale encyclopedic knowledge of the world it lives in would be scored less preferentially by the masses than what we have now.

kkielhofner · on Sept 21, 2024

Synthetic data (data from some kind of generative AI) has been used in some form or another for quite some time[0]. The license for LLaMA 3.1 has been updated to specifically allow its use for generation of synthetic training data. Famously, there is a ToS clause from OpenAI in terms of using them for data generation for other models but it's not enforced ATM. It's pretty common/typical to look through a model card, paper, etc and see the use of an LLM or other generative AI for some form of synthetic data generation in the development process - various stages of data prep, training, evaluation, etc.

Phi is another really good example but that's already covered from the article.

[0] - https://www.latent.space/i/146879553/synthetic-data-is-all-y...

moffkalast · on Sept 21, 2024

As others point out, it's essentially distillation of a larger model to a smaller one. But you're right, it doesn't work very well. Phi's performance is high on benchmarks but not nearly as good in actual real world usage. It is extremely overfit on a narrow range of topics in a narrow format.

mrbungie · on Sept 21, 2024

I would guess correctly aligned and/or finely filtered synthetic data coming from LLMs may be good.

Mode colapse theories (and simplified models used as proof of existence of said problem) assume affected LLMs are going to be trained with poor quality LLM-generated batches of text from the internet (i.e. reddit or other social networks).

sandwichmonger · on Sept 21, 2024

That's the number one way of getting mad LLM disease. Feeding LLMs to LLMs.

gugagore · on Sept 21, 2024

Generally (not just for LLMs) this is called student-teacher training and/or knowledge distillation.

calf · on Sept 21, 2024

It reminds me of when I take notes from a textbook then intensively review my own notes

solardev · on Sept 21, 2024

And then when it comes time for the test, I end up hallucinating answers too.

staticman2 · on Sept 21, 2024

There's been efforts to train small LLM's on bigger LLM's. Ever since Llama came out the community was creating custom fine tunes this way using ChatGPT.

brap · on Sept 21, 2024

I think it can be a tradeoff to get to smaller models. Use larger models trained on the whole internet to produce output that would train the smaller model.

iJohnDoe · on Sept 21, 2024

> Microsoft used LLMs to write millions of short stories and textbooks

Millions? Where are they? Where are they used?

HPsquared · on Sept 21, 2024

Model developers don't usually release training data like that.