Who wants to learn from anything less than the best sources?
I've often thought that a search engine that indexes only the highest quality, probably hand-curated, sources would be highly desireable. I'm not really interested in learning from everyone about, for example, physics or history or climate change or the invasion of Ukraine; I only want the best. I'm not missing out, practically: there is far more than enough of the 'best' to consume all my time; there's a large opportunity cost to reading other things. Choosing the 'best' is somewhat subjective, but it is far better than arbitrarily or randomly choosing sources.
LLMs, used for knowledge discovery and retrieval, would seem to benefit from the same sources.
Diversity and quantity are important for LLM training.
A search engine can index more than just "the best sources", and show results from the tail when no relevant matches are in the best sources.
I would agree that with a softer restatement of your thesis though, I am sure there is a lot of diminishing marginal utility in search indexing broadly, especially as the web keeps getting more and more full of spam and nonsense.
For pre-training LLMs, the quality/quantity/diversity story is more nuanced. They do seem to benefit a lot from quantity. For a fixed LLM training budget, the choice to train on the same high quality documents for more epochs, or to train on lower quality but unseen data is an interesting area of research. Empirically, the research finds that additional epochs on the same data starts to diminish after the 4th iteration. All the research I've read tends to have an all or nothing flavor to data selection. Either it makes it in, and gets processed the same number of times, or it doesn't get in at all. There is probably some juice in the middle ground, where high quality data gets 4x'ed, bad data is still eliminated, but the lesser but not terrible data gets in once.
This might be a naive question but how does one determine what "best" is for multiple subjects?
Even in your example, physics and mathematics could be curated for "best" when dealing with equations and foundational knowledge that has been hardened over decades. For history, climate change of invasion of Ukraine, isn't that sensitive to bias, manipulation and interpretation? These are not exact sciences.
> how does one determine what "best" is for multiple subjects?
Perhaps invert the question - how to recognize "not-best"? If it's on a consensus list of common misconceptions, it's not-best. Science textbooks, web and outreach content, are thus often not-best. If the topic isn't the author's direct research or professional focus, it's likely non-best. People badly underestimate how rapidly expertise degrades as you blur from focus to subfield, let alone to broader field. Journalism is pervasively not-best. If the author won't be embarrassed by serving not-best, it likely is. Beware communities where avoiding not-best embarrassment isn't a dominating incentive.
> not exact sciences.
Most content fails even the newspaper test, that any professional familiar with the topic will recognize that it's wrong. This applies as much to science and engineering as to whatever. Not-best.
"Soft" fields do have challenges. Subcultures with incompatible "this work is great/trash" evaluations. Integration of diverse perspectives in general.
But note that agreement and uncertainty is often poorly characterized. A description of "A, B, and C" rather than "A. And also B and C, orders-of-magnitude down.". "B vs C!" rather than "A. And A.B vs A.C." Leaving out the important insight, the foundational context, is common. And sloppy argumentation. Not-best. Basically, there's opportunity for very atypically extensive pruning of not-best before becoming constrained by uncertainty rather than by effort.
Once you eliminate the not-best, whatever remains, however imperfect, is... far less wretched than usual.
Perhaps you can limit to training only on peer reviewed sources. The peer review process is imperfect but it is perhaps the closest thing we have to flagging something as the "best" answer for a particular topic.
History (maybe exception for scientific history), politics and current affairs I would say falls outside the scope of "scientific knowledge". I do not think it is possible to avoid bias in those topics.
A significant question is what the cutoff point would be for a model based on "scientific knowledge"? Should subjects like economics, philosophy etc be included as Scientific knowledge or Should it be limited to "hard" sciences only?
Peer review isn't all that and a lot of subjects are censored and even disinformation is accepted if it flatters the ideological inclinations of the publication.
See: Proximal Origins from Nature
You have to spend quite a lot of time thinking about quality and values. It becomes impossible as the size of the “best slice” you’re seeking gets smaller (top half is much easier than top ten percent, etc)
If your values are “everyone should agree with my opinions” you’ll have a garbage biased data set. There are other values though. Bias free is also impossible because having a definition of a perfectly neutral bias is itself a very strong bias.
"Best" will be chosen by the creators of software for specific application uses. Medical software will use the "best" medical LLM under the hood. Programming software (Copilot et. all) will use the "best" programming LLM. General purpose language models will probably still be used by the public when doing internet searches. Or, an idea that just popped into my head, use a classifier to determine which model can most accurately answer the user's query, and send the query off to that model for a response.
> how does one determine what "best" is for multiple subjects?
Also, "best" depends on audience and use-case.
Imagine a horribly tone-deaf LLM-powered Sesame Street episode about the importance of recycling, illustrated by supply-demand graphs and Kekulé structures of plastic polymers.
On the other hand, given the query "facial recognition in paper wasps", Perplexity just gave not only an answer in accord with my prior understanding derived from reading in the field, but also surfaced a paper [1] published two months ago that I hadn't yet seen.
I expected less, and I suspect a researcher could easily find gaps. But from the perspective an amateur autodidact, that's still a fairly impressive result.
> Even a subscription model will eventually skew towards placating the masses with "dumbed down" content.
Accuracy and simplicity are not the same. I can see that most people won't want to read the Stanford Encyclopedia of Philosophy's take on Plato. But anyone can read the Associated Press rather than someone's misinfo on the topic. Cut out the latter.
That makes sense but the principle of "more data = more better" suggests that maybe training an LLM on all the possible data and then fine tuning it to only spit out the best answers might be better than only training it on only the best data to begin with.
I've often thought that a search engine that indexes only the highest quality, probably hand-curated, sources would be highly desireable. I'm not really interested in learning from everyone about, for example, physics or history or climate change or the invasion of Ukraine; I only want the best. I'm not missing out, practically: there is far more than enough of the 'best' to consume all my time; there's a large opportunity cost to reading other things. Choosing the 'best' is somewhat subjective, but it is far better than arbitrarily or randomly choosing sources.
LLMs, used for knowledge discovery and retrieval, would seem to benefit from the same sources.