If I had the funds I'd run all the training set (GPT4 used 13 trillion tokens) through a LLM to mine factual statements, then do reconciliation or even better, I'd save a summary description of the diverse results. In the end we'd end up with an universal KB. Even for controversial topics, it would at least model the distribution of opinions, and be able to confirm if a statement doesn't exist in the database.
Besides mining KB triplets I'd also use the LLM with contextual material to generate Wikipedia-style articles based off external references. It should write 1000x more articles covering all known names and concepts, creating trillions of synthetic tokens of high quality. This would be added to the pre-training stage.
Besides mining KB triplets I'd also use the LLM with contextual material to generate Wikipedia-style articles based off external references. It should write 1000x more articles covering all known names and concepts, creating trillions of synthetic tokens of high quality. This would be added to the pre-training stage.