Hacker Newsnew | past | comments | ask | show | jobs | submit | alexwatson405's commentslogin

Hi all- I’m a co-founder from Gretel; our team and tech are now part of NVIDIA.

NeMo Data Designer is our core product from Gretel and now the internal framework we use heavily for both pre- and post-training data in Nemotron for a variety of use cases.

The OSS version is fully general-purpose: Python-first, modular, and designed so you can mix statistical samplers, LLM columns, and seed datasets in a single pipeline.

Happy to answer questions or hear feedback on missing features


Hey there! Co-founder of Gretel.ai here, and I think I can provide some insights on this topic.

Firstly, the concept you're hinting at is not purely traditional ML. In traditional machine learning, we often prioritize feature extraction and engineering specific to a given problem space before training.

What you're describing and what we've been working on at Gretel.ai, is leveraging the power of models like Large Language Models (LLMs) to understand and extrapolate from vast amounts of diverse data without the need for time-consuming feature engineering. Here's a link to our open-source library https://github.com/gretelai/gretel-synthetics for synthetic data generation (currently supporting GAN and RNN-based language models), and also our recent announcement around a Tabular LLM we're training to help people build with data https://gretel.ai/tabular-llm

A few areas where we've found tabular or Large Data Models to be really useful are: * Creating privacy preserving versions of sensitive data * Creating additional labeled examples for ML training (much less expensive than traditional data collection/ml techniques) * Augmenting existing datasets with new fields, cleaning data, filling in missing values

Lots of mentions of RLHF here in the threads, one area I think RLHF will be super helpful is in ensuring that LLM data models return diverse and ethically fair results (hopefully better than the data they were trained on). Cheers!


“Pics and it didn’t happen.” Love it


Here is the call for papers (CFP) section in the FAQ. Links to a Google Form https://gretel.ai/synthesize2023#faqs


Yep- versioning is definitely important, and not what Gretel focuses on. You could connect a Gretel project stream up to a Dat backend for versioning/lineage.

So you could use Gretel to anonymize or build a synthetic version of a dataset for sharing, and then use Dat for versioning


Good point in the article that barring apps from aggressively tracking users (which is a good thing IMO), creates more power for companies like Facebook/Apple/Google/Amazon that already have access to the data


+ Neat example of using the synthetic data to balance limited ML datasets: https://towardsdatascience.com/improving-massively-imbalance...


Nice write-up! Getting servos and gears strong enough to jump will be a challenge, but looking forward to seeing what you come up with! I had a project to create a SpotMini from a Mekamon last year- biggest problem I ran into was that the servos could not support the weight of an iPhone- https://medium.com/@zredlined/making-my-own-spot-mini-2-2f12...


@ofalko Our code is open source, you can always check it out for yourself. =)

https://github.com/gretelai/gretel-synthetics


To quote the great Jean Luc Picard, pick "one impossible thing at a time".

I also find myself going in a lot of directions, and I've found that picking an idea and sticking with it, until it fails or works, is an achievement in itself.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: