Finally! Somebody who is actually talking about the contents :-) Could you clari...

dijksterhuis · on Sept 17, 2024

is mentioned in the text in a couple of places

https://ppml.dev/design-code.html#data-debt

https://ppml.dev/troubleshooting-code.html#troubleshooting-d...

Probably a very over simplified example below. Because data doesn’t usually drift in this obvious way. It’s usually more subtle and happening over a longer period.

My model is learning on my business data of orders over time.

People keep ordering every day. but usually in small amounts.

But today we got a new customer and they put in monthly orders which are 1000x larger than all other combined.

They are going to keep making orders for the next year or so. At which point they stop ordering from us.

Two data drift “episodes” here:

1. When we get the new customer. We’ve now got an outlier. They aren’t like all the other customers. How’s will the model react to this when being trained? Will it skew the output? Do we exclude the new customer in training data? Or do we change the model to account for them?

2. When that customer stops ordering after a year. Now the outlier is gone. But maybe we changed some model settings and tweaked it a bit to account for it. Now we need to account for that customer not being around anymore.

Data drift is a big PITA.

rramadass · on Sept 17, 2024

I got the obvious way. What i was asking about is how do you identify drift in the data in the first place? The Model has been deployed after training/test data-set passes. Presumably with drift in the input the model's predictions will not be "good" anymore. How do you disambiguate this case from the model itself being wrong for other reasons?

dijksterhuis · on Sept 18, 2024

It's a continuous process checking the training data is from the same "distribution". Usually through automated pipelines running against the ingested training data (i.e. once you've got the new data fully processed and ready for training, but prior to actually training the model).

In the pipelines you do some checks on statistical outliers/differences. Check the current training data against historical versions of the training set. If anything goes beyond some specified tolerances you highlight that for manual testing/checks.

Using the toy example from before, something like checking the sum of orders per customers in a month compared to the last N months. If the maximum per customer orders this month is 100x higher than any previous month then something has significantly changed in the data. May affect training, we need to investigate this.

If you've identified some statistical changes/differences, that's usually where someone needs to investigate in more depth. Train a dev model on the brand new training data. Pass multiple unseen test dataset(s) through it. What happens?

* Is global test accuracy up or down?

* Is robustness affected?

* Is the accuracy degrading for specific classes?

* How does this compare to drifts we've seen before?

Then you make decisions about whether you need to:

* exclude parts of the new training data?

* tweak some model hyperparameters?

* tweak the architecture of the model?

There's no single right answer on what to do at this point. This is the difficult and expensive bit of machine learning. It requires a lot of continuous experimentation even after you've got something running initially.

rramadass · on Sept 18, 2024

Nice. It is these sort of issues that made me realize that ML Engineering/MLOps are a very different kind of beast where Statistics and coupling of input data to the Model plays a very significant part. The awareness about the data domain is vital.

tomtom1337 · on Sept 17, 2024

I haven't read the text, but data drift refers to how, after deploying a machine learning model, the input data changes over time to something that wasn't tested on. For instance, let's say you create a gradient boosting forecasting model that does a great job at predicting tomorrow's earnings. At the time of training, the earnings might be in the $1000 per day range. But a year later, the earnings might be in the $100k range. The model has never seen numbers this high before, so it doesn't know how to handle them well. That is data drift.

rramadass · on Sept 18, 2024

Right. Can you share how such issues are handled in the ML pipeline?

tomtom1337 · on Sept 19, 2024

The most common solution is to frequently retrain on the latest data. A forecasting model might retrain every week, including the last weeks data, and might even drop older data, for instance training data older than a year.

It's best to transform your target variables, like "number of orders", to "number of orders per customer per day" or something like that. And then in your pipeline, you feed the latest estimate on your number of customers (e.g. average of the last two weeks). That's way more robust over time.

rramadass · on Sept 19, 2024

Makes sense. We need to continuously monitor the performance of the model deployed in the field with our preexisting statistical knowledge of the data and then accordingly schedule regular "model updates".