Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't totally understand this comment. Random data can get you more scale than production data, in that it can just be made up. All the load and E2E testing can be done with test data, no problem.

This idea of data being statistically significant has come up, but that's also easy to replicate with random data once you know the distributions of the data. In practice, those distributions rarely change, especially around demographic data. However, I don't think I've seen a case where this has been a problem. I'd be interested to learn about one.



The ideal scenario is that you're able to augment your existing data with more data that looks just like it. The matter of statistical significance really depends on the use-case. For load testing, it's probably not as important as it is for something like feature testin/debugging/analytical queries.

Even if you know the distribution of the data (which imo can be fairly difficult) replicating that can also be tricky. If you know that a gender column is 30-70 male - female, how do you create 30% male names? How about the female names? Are they the same name or do you repeat names? Does it matter? In some cases it does and in others it doesn't.

What we've seen is that it's really use-case specific and there are some tools that can help but there isn't a complete tool set. That's what we're trying to build over time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: