Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Launch HN: Dioptra (YC W22) – Improve ML models by improving their training data
55 points by farahg on June 21, 2022 | hide | past | favorite | 19 comments
Hi HN! We're Pierre, Jacques, and Farah from Dioptra (https://dioptra.ai). Dioptra tracks ML metrics to identify model error patterns and suggest the best data curation strategy to fix them.

We’ve seen a shift in paradigm in recent years in ML: the “code” has become a commodity: many powerful ML models are open source today. The real challenge is to grow and curate quality data. This raises the need for new data centric tools: IDEs, debuggers, monitoring. Dioptra is a data centric tool that helps debug models and fix them by systematically curating and growing the best data, at scale.

We experienced this problem, first hand, deploying and retraining models. Once a model was in production, maintenance was a huge pain. First, it was hard to assess model performance. Accessing the right production data to diagnose was complicated. We had to build custom scripts to connect to DBs, download production data (Compliance, look the other way!) and analyze it.

Second, it was hard to translate the diagnosis into concrete next steps: find the best data to fix and retrain my model. It required another set of scripts to sample new data, label it and retrain. With a large enough labeling budget, we were able to improve our models, but it wasn’t optimal: labeling is expensive, and random data sampling doesn’t yield the best results. And since the process relied on our individual domain expertise (aka gut feelings) it was inconsistent from one data scientist to the next and not scalable.

We talked to a couple hundred ML practitioners who helped us validate and refine our thinking (we thank every single one of them!). For example, one NLP team had to read more than 10 long legal contracts per week per person. The goal was to track any model errors. Once a month, they synthesized an Excel sheet to detect patterns of errors. Once detected, they had to read more contracts to build their retraining dataset! There were multiple issues with that process. First, the assessment of errors was subjective since it depended on individual interpretations of the legal language. Second, the sourcing of retraining data was time consuming and anecdotal. Finally, they had to spend a lot of time coaching new members to minimize subjectivity.

Processes like this highlight how model improvement needs to be less anecdotal and more systematic. A related problem is lack of tooling, which puts a huge strain on ML teams that are constantly asked to innovate and take on new projects.

Dioptra computes a comprehensive set of metrics to give ML teams a full view of their model and detect failure modes. Teams can objectively prioritize their efforts based on the impact of each error pattern. They can also slice and dice to root-cause errors, zero in on faulty data, and visualize it. What used to take days of reading can now be done in a couple hours. Teams can then quality check and curate the best data for retraining using our embedding similarity search or active learning techniques. They can easily understand, customize and systematically engineer their data curation strategy with our automation APIs in order to get the best model at each iteration and stay on top of the latest production patterns. Additionally, Dioptra fits within any ML stack. We have native integrations with major deep learning frameworks.

Some of our customers reduced their data ops costs by 30%. Others improved their model accuracy by 20% in one retraining cycle thanks to Dioptra.

Active Learning, which has been around for a while but was sort of confidential until recently, makes intentional retraining possible. This approach has been validated by ML organizations like Tesla, Cruise and Waymo. Recently, other companies like Pinterest started building similar infrastructure. However it is costly to build and requires specialized skills. We want to make it accessible to everybody.

We created an interactive demo for HN: https://capture.navattic.com/cl4hciffr2881909mv2qrlsc9g

Please share any feedback and thoughts. Thanks for reading!



This looks like a dashboard version of my own Python scripts that I use to evaluate training data.

** reads more into the blurb**

Use our APIs to improve your models! Blech, not interested anymore. I can't hand out access to a 3rd party server for our data for legal reasons. Your startup can't afford to jump through the hoops to sell to my employer either, not that I would even recommend the app right now without trialing it myself.

I need something I can afford to purchase individually through my manager's discretionary budget (or myself) and run on our own servers. Most ML startups are SAAS and fail this test for me.


I don't know how tight your legal restrictions are but we work from metadata only. We don't need your text / img / audio. We just need their embeddings and a few other stuff. And we are working on a self hosted version as well. Out of curiosity, what would you expect in terms of pricing ?


I don't know how much I would pay because I have not researched your product, sorry for being unhelpful. Just to throw a useless number out, if there was a standalone version licensed for individual use I would impulse pay $100 to run it on my workstation and give it a try. Because while I do have scripts to automate a lot of my data analysis it's always nice to have a turnkey solution that automates it even more, and there is the promise that next year I could buy a new version that stays up to date with the integrations or adds more features. I can't imagine that is enough money to fund your development, but... there is no chance anyone in our org will jump through the hoops to adopt and purchase this as part of your enterprise sales when we already have tooling that does what you do, just not as polished. This is code we own and can extend ourselves, which I'm guessing is not something we can do with a SaaS offering.

I am afraid any information at all about data or models can't go to a 3rd party server. Practically speaking stuff has to be able to run on firewalled servers that are only accessible on the company network. Exceptions would have to go through the legal department and at least 6 months of meetings where you would be repeatedly told no, unless it is vital to the product, which a dashboard isn't, even if it could theoretically be allowed.


Thanks for sharing


(Not OP) Out of curiosity, how much is your manager's discretionary budget?

I feel like any own-infrastructure per-seat license is going to be way beyond that. Maybe as a marketplace app. [1]

1. https://aws.amazon.com/marketplace/


I agree. I'm not sure how it at other companies. Both at my current company and previous one, line managers had a misc budget of $2000 before needing oversight, with bigger numbers as you went up the leadership chain, but I never interacted with those types frequently enough to ask.

This was for one time purchases only. Recurring charges i.e. SaaS offerings would need to go up to the division head for approval, which means effectively no unless it was essential stuff like Jira/Bitbucket, Office, etc.


I feel like your starting assumption already diverges from my world.

> “code” has become a commodity: many powerful ML models are open source today. The real challenge is to grow and curate quality data.

The main recent improvements in translation and speech recognition were all new mathematical methods that enable us to use uncurated random data and still get good results. CTC loss allows to use un-aligned text as groundtruth. wav2vec allows to use random sound without text for pre-training. OPUS is basically a processing pipeline for generating bilingual corpora. Word embeddings allow to use random monolingual text for pre-training. We've also seen a lot of one-shot or zero-shot methods. Plus XLS-R was all about transfer learning to reuse knowledge from cheap abundant data for resource-constrained environments.

My prediction for the future of ML would be that we'll soon need so little training data that a single person can label it on a weekend.

On the other hand, I know first-hand that almost nobody can use the "open source" ML model and deploy it effectively in production. Doing so requires sparsity, quantization, conversions, and in many cases l33t C++ skillz to implement optimized fused SSE ops so that a model will run with decent speed on cheap CPU hardware rather than mandating expensive GPU servers.


I don't think our assumptions are so far appart. The methods you mentioned made it from research to the open source community fairly quickly. In fact, most companies rely on this kind of open research to develop their models. In a lot of use cases, it has become more about finding the right data that improving the model code. (I like Andrew Ng thoughts on this: https://datacentricai.org/) At the same time, there are still a lot of unsolved engineering challenges with the code when it comes to productionalizing models, especially for real time speech transcription.

And we agree with your prediction. That's why we started Dioptra: to come up with a systematic way to curate high quality data so you can annotate just the data that matters.


how do u compare between Aquarium, HumanLoop, Cord.Tech, Lightly, https://prodi.gy/ and you guys ? And probably SnorkelAI (who probably is considered the OG with their paper https://arxiv.org/abs/1711.10160)

P.S. not trying to compare the product or the company/team, but was hoping for a more technical understanding. only reason i even mentioned those names is cos they were all launched here on HN.

I generally term this space as "algorithmic labeling" and there are many approaches here - https://paperswithcode.com/paper/machine-learning-algorithms... . Any kind of algorithms or domains that work well for you...or dont work well ?


Thanks for the question and the papers. Like some of those companies, we are believers in the data centric approach to ML. But labeling is not our focus (unlike Snorkel, HumanLoop, Prodigy or Cord.Tech). We focus on model diagnosis and mining the best data to improve them. So there are more similarities with Aquarium or Lightly.

There is a great talk from Tesla [1] on what they call the "Data Engine" (that probably inspired some of us :)). One of the things we took from it was that in order to truly close the loop on the ML data flywheel, we needed to turn production into a reliable datasource. It had to become accessible, understandable and minable. To achieve this we took the approach of combining ML observability with Active Learning mining frameworks. Combining both is important in our view because Observability tells how the model behaves in the real world and Active Learning finds the right samples to fix / improve the model on real world data. They go hand in hand.

Technically, it means that we integrate with serving and labeling platforms. We ingest data both in streaming and batch. We can mine on production streams including on device (for iot use-cases where accessing data is a challenge). We have an extensive set of metrics to understand model behavior in the wild and solve use cases like data drift (detecting it, triggering mining and sending the data for labeling/retraining). And we are geared toward automation.

Regarding which data sampling works well and which doesn't, we found that it's not a one size fits all. Combining uncertainty sampling and diversity sampling is very powerful in a lot of use-cases and can perform as good as random sampling with 10x less data some times. But model based sampling strategies can also underperform on drifted datasets (essentially a model can be very confidently wrong on new kind of sample), hence the need to also have similarity sampling techniques.

Overall, we were able to show that we can intentionally drive specific model performance metrics, either globally or locally, by picking one technique vs another. Happy to share more if you want.

[1] https://www.youtube.com/watch?v=Ucp0TTmvqOE&t=7714s


>Technically, it means that we integrate with serving and labeling platforms. We ingest data both in streaming and batch. We can mine on production streams including on device (for iot use-cases where accessing data is a challenge).

hold on - are you saying you look at production serving data ...and are able to determine what was the problem in training data that caused it ? That is pretty cool.


Yes, that's correct. We integrate with the major ML frameworks to monitor serving data and compare it to the training data to identify potential error patterns and mine the live stream for data to fix them. I'd love to show you the product in more details and get your feedback if you're open to it !


hi. so i dont use this kind of modeling today (i used to in my previous product).

But the reason i asked is - that is a fantastic feature and differentiation. i wonder why you dont put that claim up as the hero on your website ? i dont see a reason why anyone would NOT use something like this.


We’ve been trying to implement an active-learning retraining loop for our critical NLP models for Koko but have never found the time to prioritize the work as it was multi-sprint level of effort. We’ve been working with them for the for a few weeks and we and we are seeing meaningful performance improvement with our models. I highly recommend trying them out.


For many domains active learning is not that efficient actually. The promise is that you make a subset of labels and train on them the model with the same accuracy. The reality is that in order to estimate long tail properly you need all the data points in the training set, not just a subset.

Consider simple language model case. In order to learn some specific phrases you need to see them in the training, and phrases of interest are rare (usually 1-2 cases per terabytes of data). You simply can not select a half.

A semi-supervised learning and self-supervised learning are more reasonable and widely used. You still consider all the data for training. You just don't annotate it manually.


You are right. Being able to learn good feature representations through SSL is very powerful. We leverage such representation to perform tasks like semantic search to tackle problems like long tail sampling. We have seen pretty good results mining for edge cases. Let me know if you'd like to chat about it.


This is an interesting problem to solve. For the sake of better understanding, can OP or someone else here suggest research papers or code that describes similar approaches to detect and remove outlier data by analyzing the embedding space.


In the example shown in the product tour, we use an approach based on Diversity Sampling. Basically we look for the datapoints that would be the most representative of the drifted domain (and therefore be outliers to the training domain). Here is a blog post that summarizes some of those techniques [1]. I also found this paper that describes a similar approach [2]. Happy to chat more about it.

[1]: https://towardsdatascience.com/https-towardsdatascience-com-... [2]: https://arxiv.org/abs/1904.03122


Thank you. Will take a more closer look and reach out




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: