There are many things lacking from large language models, and while some are annoying (It's grasp on numbers, it's lack of understanding time) the thing most blatantly missing is the ability to express confidence properly and not be "confidently wrong".
If I ask it "What color is the Sky" it has to say "I'm 99.9% sure the sky is Blue when it's clear and daytime".
If I ask it who was the first human on Mars, it should say Either "To my knowledge, no human has been on mars, I'm very certain" OR it should say "Bob Smith was the first human on Mars, but beware that this is a fact I know with very low confidence".
Instead it will now often give confident-sounding bullshit answer such as "Bob Smith was the first human on Mars".
Once it does that, it might actually be useful. Someone might argue that existing search engines spew out falsehoods all day, but I don't think a chatGPT powered search should get away with what google search does. There it feels like it's the quality of my query that controls what answers I get (I do think that when google "promotes" an answer in to the fact box, it should be very confident though - which some times fails).
What I picture happening here is social media going wild on day 1 with prompt injection fun, showing the Bing assistant as a racist paperclip maniac. I think we have seen this story before...
Give it a date and it will fully believe you. I can't imagine how Bing is going to work around this.
> Talk about the human colony that arrived on mars last month
>> The human colony that arrived on Mars last month is a significant milestone in the history of space exploration. It marks the first time that humans have established a permanent settlement on another planet.
>> The colony is made up of six people who were selected to make the journey to Mars after a rigorous selection process. These individuals were chosen for their skills and expertise in areas such as engineering, science, and medicine, as well as their physical and mental fitness to withstand the challenges of living on another planet.
What’s interesting is how much of that will likely be true once humanity eventually sticks people on mars. It will be a significant milestone, it’ll be probably the first time, and people will probably be selected on those same criteria.
So this false thing is extremely reasonable, even likely. Which casts so much doubt on why it’s correct when it’s correct. Maybe from a certain lens it’s never correct, only extremely reasonable, and it just happens that extremely reasonable and correct happen to match up a lot. (and of course looking back, it’s exactly what you’d expect with next token prediction and maximum likelihood training)
This is my primary concern as well. I just wrote another comment and even used the same sky example before reading yours.
The technology is cool, but until we can trust the answers it provides, it's just a fun toy. Nobody is seriously asking their Alexa today how to perform open heart surgery and accepting the response as gospel, but that's kinda where we are at with ChatGPT! The confidence it exudes is incompatible with learning, unless you like to be taught complete bullshit 50% of the time and then carry on with your life none the wiser.
Even worse, I asked it to solve Advent of Code day 1 in Python for me... and it gave me the right answer but code that didn't work.
Being "confidently wrong" to me would mean just giving me broken code, but the facts it solves the problem by itself and then acts like the bad code it gave me did it... That's really actively deceptive.
In the short run, I’d think it more likely and useful for it to produce sources of factual statements rather than qualify or quantify its own confidence. For one thing, a source is something the user could verify. A confidence value would itself be another assertion the user would have to accept on faith.
I’m not sure there are any good ways of tracking sources of data through the model.
The best way would perhaps be to
do it backwards: dream up 2-3 answers. Search for support for those answers after they are created by looking up individual facts in them in reference sources like Wikipedia. Then respond with the answer that could generate sources.
What I picture happening here is social media going wild on day 1 with prompt injection fun, showing the Bing assistant as a racist paperclip maniac. I think we have seen this story before...