Launch HN: Dioptra (YC W22) – Improve ML models by improving their training data

krapht · on June 22, 2022

This looks like a dashboard version of my own Python scripts that I use to evaluate training data.

** reads more into the blurb**

Use our APIs to improve your models! Blech, not interested anymore. I can't hand out access to a 3rd party server for our data for legal reasons. Your startup can't afford to jump through the hoops to sell to my employer either, not that I would even recommend the app right now without trialing it myself.

I need something I can afford to purchase individually through my manager's discretionary budget (or myself) and run on our own servers. Most ML startups are SAAS and fail this test for me.

parnoux · on June 22, 2022

I don't know how tight your legal restrictions are but we work from metadata only. We don't need your text / img / audio. We just need their embeddings and a few other stuff. And we are working on a self hosted version as well. Out of curiosity, what would you expect in terms of pricing ?

krapht · on June 22, 2022

I don't know how much I would pay because I have not researched your product, sorry for being unhelpful. Just to throw a useless number out, if there was a standalone version licensed for individual use I would impulse pay $100 to run it on my workstation and give it a try. Because while I do have scripts to automate a lot of my data analysis it's always nice to have a turnkey solution that automates it even more, and there is the promise that next year I could buy a new version that stays up to date with the integrations or adds more features. I can't imagine that is enough money to fund your development, but... there is no chance anyone in our org will jump through the hoops to adopt and purchase this as part of your enterprise sales when we already have tooling that does what you do, just not as polished. This is code we own and can extend ourselves, which I'm guessing is not something we can do with a SaaS offering.

I am afraid any information at all about data or models can't go to a 3rd party server. Practically speaking stuff has to be able to run on firewalled servers that are only accessible on the company network. Exceptions would have to go through the legal department and at least 6 months of meetings where you would be repeatedly told no, unless it is vital to the product, which a dashboard isn't, even if it could theoretically be allowed.

parnoux · on June 22, 2022

Thanks for sharing

carom · on June 22, 2022

(Not OP) Out of curiosity, how much is your manager's discretionary budget?

I feel like any own-infrastructure per-seat license is going to be way beyond that. Maybe as a marketplace app. [1]

1. https://aws.amazon.com/marketplace/

krapht · on June 22, 2022

I agree. I'm not sure how it at other companies. Both at my current company and previous one, line managers had a misc budget of $2000 before needing oversight, with bigger numbers as you went up the leadership chain, but I never interacted with those types frequently enough to ask.

This was for one time purchases only. Recurring charges i.e. SaaS offerings would need to go up to the division head for approval, which means effectively no unless it was essential stuff like Jira/Bitbucket, Office, etc.

fxtentacle · on June 22, 2022

I feel like your starting assumption already diverges from my world.

> “code” has become a commodity: many powerful ML models are open source today. The real challenge is to grow and curate quality data.

The main recent improvements in translation and speech recognition were all new mathematical methods that enable us to use uncurated random data and still get good results. CTC loss allows to use un-aligned text as groundtruth. wav2vec allows to use random sound without text for pre-training. OPUS is basically a processing pipeline for generating bilingual corpora. Word embeddings allow to use random monolingual text for pre-training. We've also seen a lot of one-shot or zero-shot methods. Plus XLS-R was all about transfer learning to reuse knowledge from cheap abundant data for resource-constrained environments.

My prediction for the future of ML would be that we'll soon need so little training data that a single person can label it on a weekend.

On the other hand, I know first-hand that almost nobody can use the "open source" ML model and deploy it effectively in production. Doing so requires sparsity, quantization, conversions, and in many cases l33t C++ skillz to implement optimized fused SSE ops so that a model will run with decent speed on cheap CPU hardware rather than mandating expensive GPU servers.

parnoux · on June 22, 2022

I don't think our assumptions are so far appart. The methods you mentioned made it from research to the open source community fairly quickly. In fact, most companies rely on this kind of open research to develop their models. In a lot of use cases, it has become more about finding the right data that improving the model code. (I like Andrew Ng thoughts on this: https://datacentricai.org/) At the same time, there are still a lot of unsolved engineering challenges with the code when it comes to productionalizing models, especially for real time speech transcription.

And we agree with your prediction. That's why we started Dioptra: to come up with a systematic way to curate high quality data so you can annotate just the data that matters.

sandGorgon · on June 22, 2022

how do u compare between Aquarium, HumanLoop, Cord.Tech, Lightly, https://prodi.gy/ and you guys ? And probably SnorkelAI (who probably is considered the OG with their paper https://arxiv.org/abs/1711.10160)

P.S. not trying to compare the product or the company/team, but was hoping for a more technical understanding. only reason i even mentioned those names is cos they were all launched here on HN.

I generally term this space as "algorithmic labeling" and there are many approaches here - https://paperswithcode.com/paper/machine-learning-algorithms... . Any kind of algorithms or domains that work well for you...or dont work well ?

parnoux · on June 22, 2022

Thanks for the question and the papers. Like some of those companies, we are believers in the data centric approach to ML. But labeling is not our focus (unlike Snorkel, HumanLoop, Prodigy or Cord.Tech). We focus on model diagnosis and mining the best data to improve them. So there are more similarities with Aquarium or Lightly.

There is a great talk from Tesla [1] on what they call the "Data Engine" (that probably inspired some of us :)). One of the things we took from it was that in order to truly close the loop on the ML data flywheel, we needed to turn production into a reliable datasource. It had to become accessible, understandable and minable. To achieve this we took the approach of combining ML observability with Active Learning mining frameworks. Combining both is important in our view because Observability tells how the model behaves in the real world and Active Learning finds the right samples to fix / improve the model on real world data. They go hand in hand.

Technically, it means that we integrate with serving and labeling platforms. We ingest data both in streaming and batch. We can mine on production streams including on device (for iot use-cases where accessing data is a challenge). We have an extensive set of metrics to understand model behavior in the wild and solve use cases like data drift (detecting it, triggering mining and sending the data for labeling/retraining). And we are geared toward automation.

Regarding which data sampling works well and which doesn't, we found that it's not a one size fits all. Combining uncertainty sampling and diversity sampling is very powerful in a lot of use-cases and can perform as good as random sampling with 10x less data some times. But model based sampling strategies can also underperform on drifted datasets (essentially a model can be very confidently wrong on new kind of sample), hence the need to also have similarity sampling techniques.

Overall, we were able to show that we can intentionally drive specific model performance metrics, either globally or locally, by picking one technique vs another. Happy to share more if you want.

[1] https://www.youtube.com/watch?v=Ucp0TTmvqOE&t=7714s

sandGorgon · on June 23, 2022

>Technically, it means that we integrate with serving and labeling platforms. We ingest data both in streaming and batch. We can mine on production streams including on device (for iot use-cases where accessing data is a challenge).

hold on - are you saying you look at production serving data ...and are able to determine what was the problem in training data that caused it ? That is pretty cool.

parnoux · on June 23, 2022

Yes, that's correct. We integrate with the major ML frameworks to monitor serving data and compare it to the training data to identify potential error patterns and mine the live stream for data to fix them. I'd love to show you the product in more details and get your feedback if you're open to it !

sandGorgon · on June 24, 2022

hi. so i dont use this kind of modeling today (i used to in my previous product).

But the reason i asked is - that is a fantastic feature and differentiation. i wonder why you dont put that claim up as the hero on your website ? i dont see a reason why anyone would NOT use something like this.

kkouddous · on June 21, 2022

We’ve been trying to implement an active-learning retraining loop for our critical NLP models for Koko but have never found the time to prioritize the work as it was multi-sprint level of effort. We’ve been working with them for the for a few weeks and we and we are seeing meaningful performance improvement with our models. I highly recommend trying them out.

nshm · on June 21, 2022

For many domains active learning is not that efficient actually. The promise is that you make a subset of labels and train on them the model with the same accuracy. The reality is that in order to estimate long tail properly you need all the data points in the training set, not just a subset.

Consider simple language model case. In order to learn some specific phrases you need to see them in the training, and phrases of interest are rare (usually 1-2 cases per terabytes of data). You simply can not select a half.

A semi-supervised learning and self-supervised learning are more reasonable and widely used. You still consider all the data for training. You just don't annotate it manually.

parnoux · on June 21, 2022

You are right. Being able to learn good feature representations through SSL is very powerful. We leverage such representation to perform tasks like semantic search to tackle problems like long tail sampling. We have seen pretty good results mining for edge cases. Let me know if you'd like to chat about it.

wanderingmind · on June 22, 2022

This is an interesting problem to solve. For the sake of better understanding, can OP or someone else here suggest research papers or code that describes similar approaches to detect and remove outlier data by analyzing the embedding space.

parnoux · on June 22, 2022

In the example shown in the product tour, we use an approach based on Diversity Sampling. Basically we look for the datapoints that would be the most representative of the drifted domain (and therefore be outliers to the training domain). Here is a blog post that summarizes some of those techniques [1]. I also found this paper that describes a similar approach [2]. Happy to chat more about it.

[1]: https://towardsdatascience.com/https-towardsdatascience-com-... [2]: https://arxiv.org/abs/1904.03122

wanderingmind · on June 22, 2022

Thank you. Will take a more closer look and reach out