Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am interested in the stats error here. If there are 100 balls in a bag, 10 of each rainbow color and 30 blank, and I tell you 3 of the green balls have property X, while 10 total of the balls in the bag have property X.

You blindly draw a ball from the bag; do you know anything about the likelihood of that ball having property X? You now inspect it and find it’s green. Do you now know anything different about that likelihood? If I offered you “pay $100 to win $900 if you drew a property X ball, with the option to double your wager after seeing the color”, you would be happy indeed to see that you’d drawn a green ball and would double your wager.

Regardless of the colors, ratios, and desirability of property X, it seems to me like you do know something new about the likelihood after learning the color.

It also seems to hold if there are 300 million balls, 10 balls with property X, and 3 of those are green, just at a strong cause to believe “this specific ball does not have property X” overall (such as the case in the people examples you gave).



I'm sure there is a better term for it, but its a kind of unwarranted generalization, with respect to population sizes, e.g.:

    - Nearly all murderers are men. 
    - Bob is a man.
    - Therefore Bob is probably a murderer.
There's obviously a problem there. The population of murderers is an infinitesimal fraction of the male population. While you can say its more likely that any given male is a murderer, than any given female, its not really useful for making judgements about individuals.


Sure, but that's just a logic error. If you switch it around so we know that Bob is a murderer, then we can say that he's probably a man (and that's actually right).


> - Therefore Bob is probably a murderer.

You are mixing logic and probability.

"its not really useful for making judgements about individuals." is not equal to "tells you nothing"


The related mistake here is to jump from "not really useful for making judgements about individuals" to "not useful for making judgements about averages". Sometimes you are trying to predict about an individual. Sometimes you are trying to predict a population average.

Sticking with medical examples, the first matters for "does this guy have this disease?" The second matters for "how many doses should we order?"

Or, more controversially, "how many of the doctors we train today will still be working in 30 years?" Which came up as a question regarding the increasing proportion of women in medical school. And many people were loudly offended that this might be something to take into consideration. Because they read it as the first kind of question, "will this individual...". But that's different.


That error can creep in to diagnostics though. You can be at a 5 times increased risk for a very rare disease and it’s still overwhelmingly unlikely you have the disease.

Sometimes testing positive on a fairly accurate test for a very rare condition means you are still more likely to not have the condition, and even Doctors sometimes screw that logic up.


No, that's the first error. The usual base rate problem. I agree that it's a common mistake, but it's not the one I was trying to point out.

Knowing that this very rare disease affects 1 in 10_000 men, and 1 in 100_000 women, doesn't tell you that much about how to perform diagnosis on any one individual. But it does tell you accurately that the male ward probably needs 10x as many beds as the female ward.


No. Not what I was saying ... in my example the male ward needs no more beds than the female ward because the disease is ridiculously rare. I think we are inadvertently talking past each other.

Statistics are hard. At least for me.

Edit: also I wasn’t trying to respond directly/refute your comment. Just chiming in with a related concept.


OK maybe the beds is a bad example (maybe most of their occupants would be under observation... or there would be less than 1 per hospital). Let's say it's 90% fatal, and an outbreak occurs in a country where men are always buried in black coffins, and women in white.

Then we know what mix of coffins to load on the UN plane. But as a doctor on the ground, after your 99% accurate test comes back, you certainly need more information, and knowing the sex of the patient is not much help. The doctor, and the guy loading the plane, are asking very different questions.


Yeah it certainly still tells you something, but most people's intuitions about probability mislead them about the amount of information contained. Its like how people can be lead to believe they have a very rare disease if they test positive for it. Just because most people who have the disease test positive for it doesn't mean you can just ignore the prior probability that the disease is extremely rare in the first place, so you still likely don't have it. https://en.wikipedia.org/wiki/Base_rate_fallacy


Thats the prototypical example used in all bayes theorem explantions :D.


>Regardless of the colors, ratios, and desirability of property X, it seems to me like you do know something new about the likelihood after learning the color.

Only if you're given the marginal probability ahead of time ;-). Marginals are usually intractable to compute, and people tend to vastly underestimate them. Priors have a similar issue: base rates for a lot of "interesting" conclusions are often really, really low (which is, in a way, why we'd be interested in estimating the posterior!).

Let's estimate the most controversial thing possible: P(criminal | black guy) = p(black guy | criminal) * p(criminal) / p(black guy).

Problem is, the prior (in all the cases I've checked) is lot smaller than the marginal (there's usually at least two orders of magnitude difference!), so even a seemingly high likelihood "washes out" in the posterior.


In your examples, the probability of a random green ball having property X is always higher than that of a random ball. If the incidence of X in the green population is higher than the base rate, then if a random ball is green, it means it is more likely to be X.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: