You can get diverse low quality data from the web, but for diverse high quality data the organic content is exhausted. The only way is to generate it, and you can maintain a good distribution by structured randomness. For example just sample 5 random words from the dictionary and ask the model to compose a piece of text from them. It will be more diverse than web text.
not exhausted, just not currently being collected. Generating via existing models is ok for distilling a better training set or refining existing low quality samples but won’t break out of distribution without some feedback mechanism. That’s why simulation is promising but it’s pretty narrow at the moment. There’s still a lot of space to fill in the point cloud so coming up with novel data collection methods is important. I think this is off topic though, my original contention was if you take too thin of a slice you won’t get a very useful model.
You can get diverse low quality data from the web, but for diverse high quality data the organic content is exhausted. The only way is to generate it, and you can maintain a good distribution by structured randomness. For example just sample 5 random words from the dictionary and ask the model to compose a piece of text from them. It will be more diverse than web text.