Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't trust that this number is proving what you're saying it does. It's a meta point, but I found myself trusting such "data" less and less.

"Do you have data?" or "Do you have a peer-reviewed paper about it?" is an Internet meme at this point. Bringing some numbers or DOIs is table stakes now, in anything but dumbest of discussions.

These values - "0.076 per 1000 miles", "13158 miles on average" - have a history behind them. They're pulled out of a set of test drives. But how did the drives look? Was it a collection of similar drives, or a bunch of longer drives with significant disengagement plus a lot of short and easy drives to lower the average? To understand the true meaning of the numbers, you'd need to see (among other things) the distribution of distances covered by test drives.

I wish we had a way to require and embed context with the data, inline with the data. So that when I see "0.0076 per 1000 miles", I know it's:

  SELECT 1000
       * (SELECT sum(disengagements) FROM drives)
       / (SELECT sum(distance) FROM drives)
And so that I could click around and turn it into e.g. a probability distribution of:

  SELECT distance FROM drives
Or view a cumulative distribution of disengagements per distance:

   SELECT t1.distance, SUM(t2.disengagements)
   FROM drives t1
   INNER JOIN drives t2 ON t1.distance >= t2.distance
   GROUP BY t1.distance
   ORDER BY t1.distance
Now the problem usually is, this data is often not available - people release aggregate summaries instead of full data sets (whether that's the case with Waymo, I don't know - I didn't check, because I don't care about this particular case). But it should be made available, because point summaries of distributions are the easiest way to derive bullshit conclusions (and/or intentionally mislead people). And conversely, I don't trust point summaries unless I either a) review the underlying data (or at least know the shape of its distribution), or b) can trust that the person giving me the summary reviewed the underlying data (or knows the shape of its distribution) and isn't trying to mislead.

So there are two problems here: getting people to release more data, and creating tech infrastructure so that this data can be explored inline (preferably with nice UX that doesn't require user to type in SQL).

</end-of-meta-rant>



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: