Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The anonymous programmers have repeatedly insisted Copilot could, and would, generate code identical to what they had written themselves, which is a key pillar of their lawsuit since there is an identicality requirement for their DMCA claim. However, Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.

It sounds fair from how the article describes it



Huh. There have definitely been well publicized examples of this happening, like the quake inverse square root


You can't copyright a mathematical operation. Only a particular implementation of it, and even then it may not be copyrightable if its a straightforward and obvious implementation.

That said the implementation doesn't appear to be totally trivial and copilot apparently even copies the comments which are almost certainly copyrightable in themselves.

https://x.com/StefanKarpinski/status/1410971061181681674 https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb1...

However a twitter post on its own isn't evidence a court will accept. You would need the original poster to testify that what is seen in the post is actually what he got from copilot and not just a meme or joke that he made.

Also the plaintiffs in this case don't include id-Software and there is some evidence that id-Software actually stole the fast inverse sqrt code from 3dfx so they might not want to bring a claim here anyways.


Not sure where you thought I said you could copyright a mathematical operation, I was clearly referring to the implementation due to the mention of “quake”.

When it was reported, I was able to reproduce it myself.


Weren't people getting it to spit out valid windows keys also?


GPT4 regurgitated almost full NYT articles verbatim. It's strange that this lawsuit seems to be so amateurish that they failed to properly demonstrate the reproduction. Though of course it might require a lot of legal technicalities that we naively think are trivial but they might be not.


I read that case.

Absolutely there were a few outliers where a judge might want to look more closely. I'd be surprised if -under scrutiny- there wouldn't be any issues whatsoever that OpenAI overlooked.

However, it seemed to me that over half of the NYT complaints were examples of using the -then rather new- ChatGPT web browsing feature to browse their own website. In the case, they then claimed surprise when it did just what you'd expect a web browsing feature to do.


> You can't copyright a mathematical operation.

i agree from a philosophical pov, but this is clearly not the case in law.

https://en.wikipedia.org/wiki/Illegal_number


The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...


Its even simpler, iD is owned by ZeniMax. ZeniMax is owned by Microsoft.. who would they even sue?


That's not how that works.

All the plaintiffs would need to do is provide evidence that copywritten code was produced verbatim. This includes showing the copyrighted code on GitHub, showing copilot reproducing the code (including how you manipulated copilot to do it), showing that they match, and showing that the setting to turn off reproduction of public code is set.

It makes no difference who owns the copyrighted code, it need only be shown that copilot is violating copyright. Microsoft can't say "uhh that doesn't count" or whatever simply because they own a company that owns a company that owns copyright on the code.


"Trust no one... even yourself"


Algorithms can and are definitely patented in utility patents in the US.



It reads like the judge required them to show it happened to their code, not to any code in general. That's a much higher bar. There are thousands of instances of fast inverse square root in the training data but only one copy of your random github repositories. Getting to model to reproduce your code verbatim might be possible for all we know, but it isn't trivial.


>It reads like the judge required them to show it happened to their code, not to any code in general.

Rightly so, you have to show some sort of damage to sue someone, not just theoretical damages.


of course for standing. but it seems like with the right plaintiffs this could have gone forward


But that’s like saying my lawsuit alleging Taylor Swift copied my song could have gone forward with a plaintiff who had, years ago, written a song similar to what Ms. Swift recorded recently. That”s true, but perhaps the lesson here is that damages that hinge on statistically rare victims should not extrapolated out to provide windfalls for people who have not been harmed.


i think that is a weak analogy and also unnecessary bc it is already clear what i am saying


If it only copies code that has been widely stolen already then that's a lot weaker of a case and is something they can do a lot to prevent on a technical level.


Code that has been copied widely != code that has been widely stolen.

Open source licenses allow sharing under certain conditions.


It could be forced, of course. I can republish my copyrighted code millions of times all over the internet. Next time they retrain there is a good chance my code will end up in their corpus, maybe many many times, reinforcing it statistically.


The article mentions that GitHub copilot has been trained to avoid directly copying specific cases it knows, and that although you can get it to spit out copyright code by prefixing the copyrighted code as a starting point, in normal us cases its quite rare.


yes, but you need to show that it happened _in your case_, not that it can happen in general.


Fast inverse square root is now part of the public domain.

Also, even if this weren’t the case you can’t sue for damages to other people (they’d need to bring their own suit)


Is the particular implementation that the model spits out 70+ years old?


[deleted]


But copilot distributed it (allegedly) without complying with the GPL license (which requires any distribution to be accompanied by the license) so it still would be an instance of copyright infringement. https://x.com/StefanKarpinski/status/1410971061181681674


Has it really already been 70 years since John Carmack died?


Ah, you're right. I was wrong to say "public domain".

It would be more correct to say Quake III Arena was released to the public as free software under the GPLv2 license.


There is a large gap between public domain and GPL. For starters if Copilot is emitting GPL code for closed source projects... that's copyright infringement.


That would be license infringement, not copyright infringement.


Copyright infringement is emitting the code. The license gives you permission to emit the code, under certain conditions. If you don't meet the conditions, it's still copyright infringement like before.


No.

Copyright infringement could be emitting the code in a manner that exceeds fair use.

The license gives you permission to utilize the code in a certain way. If Copilot gives you GPLed code that you then put into your closed source project, you have infringed the license, not Copilot.

> If you don't meet the conditions, it's still copyright infringement like before.

Licensing and copyright are two separate things. Neither has anything to do with the other. You can be in compliance with copyright, but out of license compliance, you can be the reverse. But nothing about copyright infringement here is tied to licensing.

To be clear: I am a person who trashed his Reddit account when they said they were going to license that text for training (trashed in the sense of "ran a script that scrubbed each of my comments first with nonsense edits, then deleted them"). I am a photographer who has significant concerns with training other models on people's creative output. I have similar concerns about Copilot.

But confusing licensing and copyright here only muddies waters.


Without adhering to the conditions of the GPL you have no license to redistribute the code and are therefore infringing the copyright of the author.


Apparently, the court disagrees with you, and doesn't find "emitting" the code a copyright infringement.

It'd be a long bow to draw to say that what is akin to a search result of a snippet of code is "redistributing a software package".


Where it gets ethnically dubious is that:

1. The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

2. LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it. The copyright filter is only a legal protection, not a practical protection against the issue of copyright infringement.

Everyone who knows how these systems work understand this. The copilot FAQ to this day claims that you should run copyright scanning tools on your codebase because your developers might "copy code from an online source or library".

Github has it's own research from 2021 showing that these tools do indeed copy their training data occasionally: https://github.blog/2021-06-30-github-copilot-research-recit...

They clearly know the problem is real. Their own research agreed, their FAQs and legal documents are carefully phrased to avoid admitting it. But rather than owning up to the problem, it's "Ner ner ner ner ner, you can't prove it to a boomer judge".


> The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

More than that: the fact that they claimed it wasn't possible before adding the filter, to filter out the thing that said wasn't possible. This doesn't help me trust anything else they might say or have already said.

My take on that was always: if it isn't possible, then why are MS not training the AIs on their internal code (like that for Office, in the case of MS with their copilot product) as well as public code? There must be good examples for it to learn from in there, unless of course they thing public code is massively better than their internal works.


How do you know they aren’t training it on their internal code?

Since you really need to work hard to make the AI spit out anything verbatim, and you have no knowledge of their internal code, how could you ever prove or deny it?


> How do you know they aren’t training it on their internal code?

Because if they were, they would have said.

It would be an excellent answer to the concerns being discussed here: “we are so sure that there is nothing to worry about in this regard, that we are using our own code as well as the stuff we've schlepped from github and other public sources”.


> Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it.

Actually, it does. The production of the output is what matters here.


If you copy someone else's copyrighted work and then rearrange a few lines and rename a few things, you're probably still infringing.


For a book or a song, for sure, although that isn't really punished. Search the drama surrounding a popular YA author in the 10's, Cassandra Claire. For code since you can only copy the form and not the function that might actually be enough.

People do clean room implementations because of paranoia, not because it's actually a necessary requirement.


Moving a few things around means your internal process already had copywrite infringement.


Probably not. Copyright infringement in the manner we're talking about presumes you already have license to access the code (like how Github does). What you don't have license to do is distribute the code -- entirely or not without meeting certain conditions. You're perfectly free to do whatever naughty things you want with the code, sans run it, in private.

The literal act of making modifications isn't infringement until you distribute those modifications -- and we're talking about a situation where you've changed the code enough that it isn't considered a derivative work anymore (apparently) so that's kosher.


First the case would be dismissed if Copilot had permission to make copies. Clearly they didn’t. Copyright cares about copies, for profit distribution just makes this worse.

> you already have license to access the code

This isn’t access, that occurs before the AI is trained. It’s access > make copy for training > AI does lossy compression > request unzips that compression making a new copy > process fuzzes the copy so it’s not so obvious > derivative work sent to users.


Clearly Copilot had permission to make (unmodified) copies, the same way Github's webserver had permission to make (unmodified) copies. The lawsuit is about making partial copies without attribution.


GitHub's terms of service (TOS), in my non-lawyerly opinion, clearly states the license for uploaded works granted to them by users doesn't cover using the data to train an LLM or any kind of model beyond those used to improve the hosting service:

>You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time

>This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

https://docs.github.com/en/site-policy/github-terms/github-t...

I think the important questions are (1) whether "the Service" includes Copilot, and (2) whether GitHub is selling users' content with Copilot.

For (1), I'm unhappy to admit Copilot probably does fall under "the Service," which is nebulously defined as "applications, software, products, and services provided by GitHub." But I'll still say that users' could not agree to this use while GitHub was training The Copilot model but hadn't yet announced it. At that time, a reasonable user would've believed GitHub's services only covered repository hosting, user accounts, and the extra features attached to those (issue trackers, organizations, etc).

GitHub could defend themselves on point (2) by saying they aren't selling the code, instead selling a product that used the code as input. But does that differ much from selling an online service that relies on running user code? The code is input for their servers, and it doesn't need to be distributed as part of that questionable service. But it's a clear break from the TOS.


GitHub’s web server is not the same thing as Copilot and needs separate permission.

GitHub didn’t just copy open source code they copped everything without respect to license. As such attribution which may have allowed some copying isn’t generally relevant.

Really a public repo on GitHub doesn’t even mean the person uploading it owns the code, if they needed to verify ownership before training they couldn’t have started. Thus by necessity they must take the stance that copyright is irrelevant.


If you’ve copied three lines and rearrange and reword them, there’s little infringement left.

If you copy a whole book and do the same, there’s still lines-3 infringement left.


> 1.

Isn't that akin to destruction of evidence?


Legally? No.

In spirit? ... Probably?

Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.

They have a giant corpus of code tagged with license, SELECT BY license MIT/Equivalent and you're done, problem solved because those licenses explicitly grant permission for this kind of reuse.

(It's still not very cash money to take open source work for commercial gain without paying the original authors, and there's a humorous question if MIT-copilot would need to come with a multi-gigabyte attribution file, but everyone widely agrees it's legal and permitted.)

The only reason you'd hack a filter on top rather than doing the above is if you'd want to hide the copyright problem. It's an objectively worse solution.


> Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.

Absolutely not trivial, in fact completely impossible by computer alone. You can't determine if you have the right to reproduce a piece of code just by looking at the code and tags themselves. *Taps the color-of-your-bits sign.*

* I can fork a GPL project on Github and replace the license file with MIT. Okay to reproduce?

* If I license my project as MIT but it includes code I copied inappropriately and don't have the right to reproduce myself, can Github? (No) This one is why indemnity clauses exist on contracted works.

* I create a git repo for work and select the MIT license but I don't actually own the copyright on that code and so that license is worthless.


There is no difference when it comes to MIT and GPL here. If your model outputs my MIT licensed code, you still need to provide attribution in the form of a copyright notice as required by the MIT license.


Have the copyleft people, or anyone else, produced some boilerplate licenses that explicitly deny use in training models?


I would think it is pretty obviously not.

Is taking away a drunk driver's keys (before they get in the car) destruction of the evidence of their drunk driving?


This is not what I meant. By placing a copyright filter and claiming it never happened (please read the line I was replying to) before the system can be audited, they're indeed taking away the drunk driver's keys, which is a good thing, but also removing the offending car before Police arrives.


In this metaphor, removing the car of someone who was going to drink and drive but didn't, is certainly not a crime. Presumably though you mean removing the car after drunk driving actually took place - which might be, but probably depends a lot on if the person knew, and what the intent of the action was.

In the current case - its unclear if any crime took place at all, it seems clear that the primary intent was to prevent future crime not hide evidence of past ones. Most importantly the past version of the app is not destroyed (presumably). Github still has the version of the software without the copyright filter. If relavent and appropriate, the court could order them to produce the original version. It can't be destroying evidence if the evidence was not destroyed.


Yes, sorta. We're talking about software, therefore a piece of code that does something programmatically isn't like the drunk driver in a car that may cause more accidents, and although we aren't sure about that we prevent him/her to drive anyway just to be safe. The software would most certainly repeat its routine because it has be written to do so, that's why I wondered about destruction of evidence; by removing/modifying it, or placing filters, they would prevent it from repeating the wrongdoing, but also take away any means of auditing the software to find what happened and why.


Not in any way I'm aware of - and would be required if they were served a DMCA notification/Cease and Desist against a specific prompt.

The people that think Copilot is infringng their copyright would be happy with that I would think? Unless they take a much stricter definition of fair use than current courts do.


No more so than scanner/printer manufacturers adding tech to prevent you from scanning and printing currency is destruction of evidence that they are in fact producing illegal machines for counterfeiting.


> The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.

> LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it

Copyright infringement and plagerism are different things. Stuff can be copyright infringement without being plagerized, and can be plagerized without being copyright infringement. The two concepts are similar but should not be conflated, especially in a legal context.

Courts decide based on laws, not on gut feeling about what is "fair".

> They clearly know the problem is real

They know the risk is real. That is not the same thing as saying that they actually comitted copyright infringement.

A risk of something happening is not the same as actually doing the thing.

> "Ner ner ner ner ner, you can't prove it to a boomer judge".

Its always a cop-out to assume that they lost the argument because the judge didn't understand. I suspect the judge understood just fine but the law and the evidence simply wasn't on their side.


> Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.

Doesn't mean you weren't, at some point, guilty of it, either. It doesn't retcon things.


Sure, which is why we require evidence of wrong doing. Otherwise its just a witch hunt.

After all, you yourself probably cannot prove that you didn't commit the same offense at some point in time in the past. Like Russel's teapot, its almost always impossible to disprove something like that.


Yeah but I think the main concern in this situation is copilot moving forward, not their past mistakes.


This is so stupid. Going after likeness is doomed to fail against constantly mutating enemies like booming tech companies with infinite resources. And likeness itself isn’t even that big of a deal, and even if you win it’s such a minor case-by-case event that puts an enormous burden of proof on the victims to even get started. If the narrative centers around likeness, they’ve already won.

The main issue, as I see it, is that they took copyrighted material and made new commercial products without compensating (let alone acquiring permission from) the rights holders, ie their suppliers. Specifically, they sneaked a fair use sticker on mass AI training, with neither precedent nor a ruling anywhere. Fair use originates in times before there were even computers. (Imo it’s as outrageous as applying a free-mushroom-picking-on-non-cultivated-land law to justify industrial scale farming on private land.) That’s what should be challenged.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: