Yes, there's a wide variety of use cases that require different ratios of accuracy/speed. If you require 3 responses to be accurate, you have to multiply all 3 response accuracy probabilities, and as you've shown, this can reduce overall accuracy quite a bit. Of course, this does make the assumption that those 3 responses are independent of one another.
One thing I considered some months ago that was very similar to what you guys have done, but at a higher abstraction layer:
1. Consult many models (or a single model with higher temp) with the same prompt
2. Intelligently chunk the outputs (by entity, concept, subject, etc.)
3. Put each chunk into a semantic bucket (similar chunks live in the same bucket)
4. Select winning buckets for each chunk.
4a. Optionally push the undervoted chunks back into the model contexts for followup: is this a good idea, does it fit with what you recommended, etc.
4b. do the whole chunk/vote thing again
5. Fuse outputs. Mention outliers.
Token spend is heavy here, where we rely on LLMs to make decisions instead of the underlying math you guys went with. IMO, the solution y'all have reached is far more elegant than my idea.
I like the direction you're going with this strategy. There are many approaches, nuances, edge cases, and clever tricks to each of these steps, even without taking into account token probability distributions. Very powerful to get it right.
Great question. What I can say is we experimented a _ton_. If you take a basic approach and simply ask the same prompt of a bunch of LLMs and then ask another LLM to combine the results, you'll get a pretty poor answer. At best, you'll get a response that is the average of the ensemble, which by definition is going to be worse than the best model of the ensemble. Of course, you're going to want a mechanism to choose the ensemble effectively. At worst, you'll regurgitate the worst model of the ensemble. And you'll have the added expense and potential latency, too. Not a good solution at all.
We didn't experiment with different ensemble mechanisms rigorously enough for a research paper. We will, though.
Majority voting was actually how we started, and we came up with great mechanisms for stopping early, saving token costs and time, along with other interesting things we could do with that simple mechanism. The issue we had was that the orchestration could already choose a model beforehand almost as good (according to simpler benchmarks than HLE we ran at the time) as majority voting could pick after the responses were complete. And we tried many voting mechanisms, such as all models in the ensemble voting on all others.
An ablation study would be great to do now, with many other ideas we've played with. We have better benchmarks than we did just a few months ago, and it would be great to understand the tradeoffs of different approaches so that there could be alternative options for different use cases.
I want to clarify what Ken meant by "entropy in the output token probability distributions." Whenever an LLM outputs a token, it's choosing that token out of all possible tokens. Every possible output token has a probability assigned by the model (typically a logarithm of the probability). This is a probability distribution (the output token probabilities sum to 1). Entropy is a measure of uncertainty and can quantify if a token probability distribution is certain (1 token has a 99.9% probability, and the rest share the leftover 0.1% probability) or uncertain (every token has roughly the same probability, so it's pretty much random which token is selected). Low entropy is the former case, and high entropy is the latter.
There is interesting research in the correlation of entropy with accuracy and hallucinations:
Wow if it is that easy to detect hallucinations, are the big models or rigs (agentic scaffolds) building any self-correcting behaviour. Or possibly switching it to an I don't know mode so it can ask the human for help understanding.
Maybe this insight is why I feel hallucinations are much rarer in the last 12 months on top models. Are they being detected before they get sent out.
I wouldn't say it's easy to detect hallucinations. Understanding output token probability distributions is only part of a solution, and we still aren't perfect. Just better than individual models.
Hallucinations may seem rarer for a few reasons. First, models are more accurate with certain prompts. Second, models are more convincing when they do hallucinate. They may get an overall idea, but hallucinate the details. Hallucinations are still a major problem and are fundamental to the way modern LLMs work.
Is the difficulty that in high entropy situations, you can’t really tell whether it’s because the model is uncertain, or because of the options are so semantically similar that it doesn’t matter which one you choose? Like pure synonyms.
If 2 (or more) tokens are synonymous with each other with high probabilities (49.9% each for a total of 99.8%), that's still low entropy. Not as low as a singular high-probability token, but low enough for us to consider this a low-entropy token distribution.
You can't look at a single token distribution, though. There are many legitimate high-confidence, high-accuracy cases in which many tokens could come next. For example, the first token of a paragraph. You need to look at pools of entropies over segments of the output or the whole output sequence.
Although there's a correlation between uncertainty and hallucinations or inaccuracies, there's no guarantee. This is a challenging area that we're monitoring the latest literature for and contributing where we can.
Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?
You're right! I could've phrased my comment better. Ken actually wanted to edit his post, but it was too late. So he asked me to write a response explaining what he meant. Of course, he could've commented too. I was just trying to be helpful to him and others wanting an explanation.
reply