Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FWIW, just as Amazon and SalesForce are already doing their version of CoPilot, there is nothing to prevent GitHub from training its models off open source that is hosted on sr.ht or GitLab or anywhere else. If it is open source then the source is going to be available to be used for the models.


> there is nothing to prevent GitHub from training its models off open source that is hosted on sr.ht or GitLab or anywhere else

I disagree. I think that GitHub is training Copilot on only repos currently on GitHub, because they can easily do so due to the Terms of Service [1] which allow them to "parse [user content] into a search index or otherwise analyze [user content] on our servers". They can reasonably show that an ML model fits the definition of "otherwise analyze content", regardless of the license.

If code that was never hosted on GitHub to begin with starts showing up verbatim in Copilot suggestions, it might be a more legally challenging position for them. It is unfortunate grey area if people decide to upload copies of free software repos to GitHub anyway, but perhaps future versions of the GPL or other free/libre licenses could be explicit about usage in training ML models.

I have no idea what Amazon or Salesforce are doing, but it would be interesting to hear what they are using for training data and how they justify compliance with the software licenses.

----------------------------------------

[1] https://docs.github.com/en/site-policy/github-terms/github-t...


That's an interesting take but there's not really any protection in current open source licenses to stop someone else from agreeing to GitHub or any other hosting's terms for their clone of your repo.


Not a lawyer but there’s something incredibly subtle here. By only using content on GitHub, GitHub can ensure that every user has agreed to this term specifically:

> If you are uploading Content you did not create or own, you are responsible for ensuring that the Content you upload is licensed under terms that grant these permissions to other GitHub Users.

https://docs.github.com/en/site-policy/github-terms/github-t...

And those users indemnify GitHub, too.

So code that has been uploaded to GitHub by non-GitHub employees does have a “color” to it (to quote a great blog post on IP law that I can’t find right now) that other instances of the repository under identical licenses on other hosting services do not - because in theory the uploader assumes some responsibility for any dispute!

Whether this “color” has significant legal merit is beyond my understanding, but I have no doubt it is a factor in their approach.


Not clear to me how it’s supposed to immunize Copilot from copyright claims, though I can see how it’ll justify its training process.

There is a vague supposition widely believed that NN weights do no longer contain training data, and thus the trainee holds full copyright, but that won’t stand when the NN returns excerpts from said training data. Then it just becomes an unattributed copy, and should be subject to takedowns.


But "these permissions" are listed there explicitly:

> you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking)

There is no mention of making derivations or anything that looks like it could cover for CoPilot.


If M$ can parse and index code, they are able to do the same with sofware licenses and skip the projects that forbid this behaviour in robots.txt fashion.


A lot of open source licenses require numerous things including attribution, at the very least. So when copilot barfs out someones code verbatim, or nearly verbatim, after having memorized that code and it doesn't say who it's referencing, suddenly there's a problem.


IANAL, but still no.

The user who uploaded the code did provide attribution. They could have uploaded that code as part of a vendoring. There is not some flag that says they are not the copyright holders.

Copilot trains on that code just the same.

More importantly, the fair use they are claiming seems to bypass attribution and licensing requirements. It need only access to the code.

That is to say that it can 100% train on code hosted elsewhere. Until this fair use gets challenged by lawsuits, it will hold true.

The fact that we have not heard of these lawsuits yet, at least to me, shows that lawyers agree it is legal.


Just saying what you want to be true, doesn't make it so. Attribution is required for all licenses that require it. Microsoft is laundering intellectual property.


A license is describing how you want to reassign rights given to you by copyright. AFAICT Microsoft believes they are within fair use granted by copyright law itself.

If they are correct they don't have to follow individual license terms, if they are incorrect it makes no difference if they scrape or receive code because any complex project is a mix of many past authors not all of whom must be on GitHub.


Unfortunately, they're not since it barfs out code verbatim sometimes and gives no attribution. So again, they can say whatever they want, it doesn't make it so.


Just because you say it's violating copyright and licensing doesn't make it so.

If the courts don't agree then there is no penalty. It would legally be fair use. It does not matter what you or I think.

You are arguing that it is wrong, but very clearly many lawyers disagree with you.


Just because it hasn't been tested in court yet doesn't mean it's legal, my friend. It's an easy fix. Ask each dev for permission or provide attribution as per each license when the same code comes out of the model. Easy peasy.


>That is to say that it can 100% train on code hosted elsewhere.

I'm not sure if this is true. The US isn't the only jurisdiction in which GitHub/Microsoft operates in.


Anyone could just push the repos to GitHub, including GitHub employees.

Here are links to Amazon and Salesforce

https://aws.amazon.com/codewhisperer/ https://blog.salesforceairesearch.com/codegen/


Agreed, that's an unfortunate loophole which I already acknowledged; and as I said, it will probably be resolved in the future by software licenses that explicitly prohibit or accept the possibility of a hosting platform that automatically incorporates that code into ML model training.

> including GitHub employees

It would probably be different if GitHub's employees do it at GitHub's direction, since in that case they'd be explicitly taking an action on behalf of their employer to pull in that code.


Showing a reaction is very important in my book. I’m also moving to Source Hut.

Not hosting my code in servers of a company which is openly hostile to open source and Linux is a good move.

Lastly, you can try to prevent scraping. It’s a cat and mouse game, but at least your servers can’t be directly accessed by the training code itself.


Perhaps the licenses just need to be amended to explicitly say “you may not train AI models with this source code” and then get ready for a court battle.

I think Microsoft gives back to open source almost as much as they take, so I’m fine with them.


The loss of attribution, for example, is going to become an issue. Especially since copilot barfs out code chunks verbatim at times.

And either way, it's time to yet again to step around Microsoft's interference in the beautiful world of open source.


Wait Amazon? what are they training it on? code hosted on AWS? what kind of mess are we in...?

Honestly all this is fine but train it on your own code, you have the money to do that, and the scale, why bother us the small fish. What's the worse that can happen, you will have to maybe pay people to write code, maybe start open source sponsor program to get rights to do such things, like based on need of project you might give them free hosting, free services, free etc... for project owners of smaller projects give them all some small grants. If every company did that for such models we will have much better OSS.

Also still IMO there really doesn't seem to be much we can do against it in the OSS space atleast, all code is open-source is hence code that's all openly readable, if a model is assumed to be a system that doesn't do 1:1 copy, the grounds seem shaky for legal process. Though honestly I very much want to be wrong, can we like file a lawsuit against such AIs in general, I see a similar problem with DallE looking at copyrighted art... What's next copyrighted music? Can't believe I am saying this but I feel bad for people who have copyrights. Music labels, entertainment empires, big services companies(like Accenture) and such will probably somehow survive, but the smaller people will get crushed.

And I say that as someone who primarily pursued Machine Learning in academics until graduation.

Maybe getting some "justified" compensation would be achievable...!?


It's called signalling. The chimp troupe has a wide variety of chimps with different needs. How do you find the ones where needs align? You send out such signals.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: