Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unless the data is completely random it's not crazy to say that the data can be reconstructed from a reduced version.

If you have a million points that largely fall on a 3-dimensional line and you project that into 2 dimensions, you can easily recover that lost dimension with losses relative to the deviation. And that loss may not even matter depending on the kinds of data and margins of error you're working in.



This is actually a nice illustration of the central problem with this argument: the more personally identifiable a piece of information is, the less recoverable it'll be, and vice-versa. If all of the points of data are on some n-dimensional line, then obviously all of them can easily be recovered, but knowing all those things about a person doesn't actually tell you any more about them than knowing just one of those things. Conversely, if the points of data are very random then it'll only require a handful of points to uniquely identify a person and find the entry in the original data set with all their other information, but dimensionality reduction will have to throw that data away - you simply won't be able to recover that information from the model. (We actually know from the literature on de-anonymization that a lot of data falls into the second category.)


Except that that toy example bears no resemblance to the actual situation.


How many dimensions were they working with and how much variance and correlation was there in the features? What's the margin of error for the end product?


I don't know precisely, but it's pretty obvious that there'd be no way to reconstruct personally identifying info from it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: