Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I haven't read the text, but data drift refers to how, after deploying a machine learning model, the input data changes over time to something that wasn't tested on. For instance, let's say you create a gradient boosting forecasting model that does a great job at predicting tomorrow's earnings. At the time of training, the earnings might be in the $1000 per day range. But a year later, the earnings might be in the $100k range. The model has never seen numbers this high before, so it doesn't know how to handle them well. That is data drift.


Right. Can you share how such issues are handled in the ML pipeline?


The most common solution is to frequently retrain on the latest data. A forecasting model might retrain every week, including the last weeks data, and might even drop older data, for instance training data older than a year.

It's best to transform your target variables, like "number of orders", to "number of orders per customer per day" or something like that. And then in your pipeline, you feed the latest estimate on your number of customers (e.g. average of the last two weeks). That's way more robust over time.


Makes sense. We need to continuously monitor the performance of the model deployed in the field with our preexisting statistical knowledge of the data and then accordingly schedule regular "model updates".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: