Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know why the eleuther project riles me up so much. Their work on the pile gets to me because they're so cavalier about copyright (while I defend myself by training on similarly pirated text datasets, but feel different because I don't redistribute them and am honest that it's pirated. to be clear, i'm rolling my eyes at my rationalization right here). Their work on gpt-neo riles me up because they do such a weak job comparing it to the models whose hype they're riding. It also riles me up because so many people just eat it up uncritically.

But it's all out of proportion. I think it's that last part (the uncritical reaction) that makes me blow this out of proportion.



> Their work on GPT-Neo rules me up because they do such a weak job comparing it to the models whose hype they’re riding.

Building open source infrastructure is hard. There does not currently exist a comprehensive open source framework for evaluating language models. We are currently working on building one (https://github.com/EleutherAI/lm-evaluation-harness) and are excited to share results when we have the harness built.

If you don’t think the model works, you are welcome to not use it and you are welcome to produce evaluations showing that it doesn’t work. We would happily advertise your eval results side by side with our own.

I am curious where you think we are riding the hype /to/ so to speak. The attention we’ve gotten in the last two weeks has actually been a net negative from a productivity POV, as it’s diverted energy away from our larger modeling work towards bug fixes and usability improvements. We are a dozen or so people hanging out in a discord channel and coding stuff in our free time, so it’s not like we are making money or anything based on this either.


Hi! I’m the EAI person who your criticism of the Pile is most directed at. I’m curious if you read Sections 6.5 and 7 of the Pile working paper and, if so, what your response to it is. As you note, virtually everyone trains on copyright data and just ignores any implications of that fact. I feel that our paper is very upfront about this though, going as far as to have a table that explicitly lists which subsets contain copyrighted text.

Also, I realize that you don’t have any ways of knowing this but we also have separated out the subset of the Pile that we can confirm is licensed CC-BY-SA or more leniently. This wasn’t done in time for the preprint, but is in the (currently under review) peer reviewed publication. Unfortunately the conference rules forbid you from posting materials or updating preprints between Jan 1st 2021 and the final decision announcement. But we will be making the license-compliant subset of the Pile public when we are able to and will give it equal prominence on our website to the “full” Pile.

Also, we will be releasing a datasheet for the dataset but again conference limitations prevent us from doing so yet.

If you’re interested in talking about this in depth, feel free to send me an email.


Hi again! We had a back-and-forth about this a while back regarding the paper and I think we didn't end up on the same page regarding the "public data" definition in the paper (found it! [0]). I love that you're upfront in the paper, because it's silly how most people just don't acknowledge it (though they usually don't redistribute it publicly like the pile does).

I think the gist was us disagreeing about the relevance of

> Public data is data which is freely and readily available on the internet. This primarily excludes ... and data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web.

That last phrase is what got to me. It puts things in the same category that feel too different. E.g. the harry potter books in vs this comment I'm writing. They're both available within a few clicks from the search bar (one because I put it there, another because it was put up against the wishes of the author and owners), but that commonality doesn't feel relevant.

Excluding torrents especially seems like a cop out explicitly to get around the issue of "X is the top result when i google it" being so common as a torrent. I think you're trying to exclude that content as public because then it defines too much as public? But torrent vs ftp doesn't feel at all relevant when it's just google plus a click or three. Or searching on pirate bay plus a single click.

I imagine a judge looking at the copyright status of someone's pirate site and saying they can't redistribute the content, and the pirate responding "okay we'll take down the ftp server and put up a torrent instead, so that it's not public. If you google us (or search on pirate bay), the top result will stop saying 'X download' and now it'll say 'X download torrent'" and expecting the law to be on their side.

I didn't really buy the arguments in section 7 either. The usage points seem legitimate, but don't cover redistribution.

> But we will be making the license-compliant subset of the Pile public when we are able to and will give it equal prominence on our website to the “full” Pile.

This is fantastic and I want to sincerely thank you for that.

I'm trying not to be combative, but I feel like publicly redistributing other people's work does raise the bar quite a lot higher than just using it to train.

[0] https://news.ycombinator.com/item?id=25616218


I don't have a dog in this fight, but I think you should re-read this: data which cannot be easily obtained but can be obtained, e.g. through a torrent or on the dark web.

It's an extra piece of engineering to reliably scrape torrents and the dark web and exclude spam traps. "Easily obtained" is probably as much about this vs the copyright aspects.

The person you are replying to is correct in saying that most people train on the "public web" (eg, common crawl data). The copyright implications of this haven't been tested in court as yet.

It is worth noting that common-crawl data is widely distributed and would seem to raise the same issues you are identifying here.


Why would it matter? Legally even? Once you have the pirated dataset you're merely letting the program analyze it, not copying it. The resulting network isn't a transformation of the copyrighted work, by the pigeon-hole principle. It's like reporting on the spelling of the corpus, the results aren't tainted by the legality of the access.

Also, what kind of joke would it be if we could only train AIs on text we were allowed to use? That much bias would make the result worthless at predicting the real world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: