That's why copyright holders for reference works have been using copyright traps for ages. That's where you include a fictional town in a map, a nonsense word in a dictionary, or a fake person in your phone book. If your competitors reproduce the trap, then that's clear evidence you can use in court.
We don't need the copyright traps here though as Github openly admits to using the public code for training. They just don't care that they're essentially license laundering code since they can make money doing it.
That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.
We know that isn't the case because we can see code being reproduced even with comments, and Github has been open about the fact that they used everything they had in training.
That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.
Okay, so to get to the spirit of your question, lets say Github managed to program a model that worked using only their own code or code that was explicitly put in the public domain. If Github managed to reproduce code that wasn't in the training set, then it can't be accused of copying it. At that point the argument could be made that it independently created it.
At the same time algorithms can't be copyrighted, but implementations of an algorithm can be, so if Github was basically just spitting out an algorithm that just happened to be implemented similarly to how some other code it wasn't trained on implemented it, then I would say there was no copyright violation.
>We know that isn't the case because we can see code being reproduced even with comments
If the comment is something like
//check fromIndex is greater than toIndex
then that is not any more individualistic or different than the actual function. Sadly, many people comment like this, on the other hand if it reproduced a comment with typos or something more complicated like
/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/
In almost any other scenario this would be evidence. But Fast Inverse Square Root isn’t some tightly held secret. That exact code, with those specific comments included, is found in the Wikipedia page for that algo:
How about rewording a code snippet so it doesn't exactly replicate the source, but is functionally identical? Could be applied before training. Can we say the LLM only learned the ideas not the expression? Copyright should protect expression and not restrict reusing ideas.
Except that's not how LLM works. LLM has no idea about "ideas", only probabilities of how certain words string together.
So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.
Also, such "functionally identical but not violating copyright" transformation is not possible to do, both given the complexity of the problem and the sheer volume of the data.
And training it on some simplistically obfuscated code wouldn't help - all it would learn would be production of obfuscated code. Not useful for the intended use.
> It doesn't understand that the two are equivalent.
it doesn't need to understand the way a human might do the understanding.
The pattern that the LLM managed to extract could include the structure, rather than the pure text. And in reproducing the structure, the LLM can replace the variable names but keep the structure intact.
I am not sure if copilot is able to do this, but chatGPT was somewhat able to (if imperfectly at the moment).
Copying a piece of code and changing the variable names is still a copy. It is similar to how copying a piece of music and changing the pitch/volume/any other attribute would still be a copy of the original music.
The thing that the LLM need to do is to convince a judge/jury that it has not created a copy, and that it operate differently from a transformation.
Consider a junior dev who writes a range check function while working for a company (so they own the copyright) then goes to a different company and writes the same range function because that's just how he writes code.
> Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.
Why is this? Copilot in some ways is an automated way to search code & stack overflow. There is a very annoying website that does nothing more than show relevant code samples of various google search terms.
If the manual version of something is okay (eg: googling for code, finding it, fitting for a new and specific purpose that is similar), why would an automated version of that be any different?
I'm confused by the forbidden surveillance example. Generally surveillance camera's are legal for any place where there is no expectation of privacy. The expectation of privacy is largely only ones home, outside of that you can be video taped all day long by anyone. I'm not sure how this is analgous..
The million messages example is interesting. Though, what examples are there? In what cases is something legal to do it once, but there is some threshold where you cannot do it many times?
The "sending millions of messages" is only perhaps illegal because it breaks terms of service. Or, the one message is perhaps also illegal but nobody cares to pursue litigation for one instance of an infraction. The point remains though, if an individual does something once that is legal - it makes that activity legal, period and full stop. No?
I was thinking about things like spam and also social media.
Note that my main objection is to equating a person doing something with an automated process. Sometimes it may be legal or other times illegal but it just clearly isn’t the same.
For the last point, I think the answer to that is a definite no in most jurisdictions. Laws and judicial conventions often allow differing circumstances to affect the legality of things.
> For the last point, I think the answer to that is a definite no in most jurisdictions. Laws and judicial conventions often allow differing circumstances to affect the legality of things.
Curious, any concrete examples? I can't really think of any where one instance is okay but many is not. I can think of examples where one instance is ignored and many instances are harder to ignore (and so is prosecuted), but overall - I can't really think of anything that is okay to do once but not many times.
Playing a movie for a few friends who visits is fine, but start to demand tickets and suddenly it will look like a cinema which is not fine.
The reason is always the same. Courts and judges will look at the situation and make a decision about what seems fair and what does not. It is them that need to be convinced that a specific use of a copyrighted work is permitted either through fair use or by a license.
> Playing a movie for a few friends who visits is fine, but start to demand tickets and suddenly it will look like a cinema which is not fine.
Interesting analogy. "Ripping" something off an only using it for your personal project sounds like the "playing a movie for a few friends". Doing so for the benefit of corporation that then has thousands of daily visitors sounds like the "movie cinema" example. Though, in both cases it was an individual googling and finding how to implement a specific function.
"fair use" in copyright is pretty specific in that it refers to things like "you can play portions of a clip in order to comment on it." Or as another example, you can use clips/portions for the purposes of a review commentary.
"Form and function" is perhaps a very important crux here. Some things you can only do a certain way. For example, quick-sort, there are is only really one way to implement quick sort (or otherwise it is not at all quick sort!).
Personally I feel the copyright line is higher than a function, the copyright is on the collection of functions who together create a specific software. The individual functions IMHO are as copyright'able as-is a cog on a bike cassette, or the chain on a motorcycle.
I think there are quite few things in programming that can only be implemented one way. I see it as similar to music in that almost every song have notes going up or down the scale. Obviously there can't be that many variations, but then the important distinctions is often in the details. Applying copyright on a single function is like applying copyright on a single riff. Sometimes the legal system will accept it, but it should be the exception rather than the norm.
Fair use seem to had a change in scope. Historically it seems to be mostly about things like "play a clip in order to comment on it.", but now we have things like google making a copy of all books ever written in order for people to search through them. Similar arguments has been made over copying news articles from news sites in order to put a portion of it in search results. A stack overflow-like search engine that trawled proprietary code bases would likely be sued, but in theory they could argue fair use just like google.
I am pretty sure both cases would break copyright. But in the first case the copyright holders would never go after you and the second they would. But in both cases they could. The damages that a company could recover from you for watching a movie with a few friends is much lower than the damages they could recover if you made money selling tickets. Not to mention the negative PR a company would get for going after someone buying a DVD and watching it with friends.
IMO it’s the same thing because I fundamentally see LLMs in the same role as calculators that helps reduce cognitive load by offloading repetitive work.
Practically with an LLM the programmer can focus on the creative part (handler function, react component, etc) while the LLM generates the necessary boilerplate for the ever changing frameworks and infra configurations. The programmer (and QA) would still review and test everything but would save time writing boilerplate and ship features faster.
It literally means reproduced in some capacity. Just because its called "training" it doesn't mean it has any reasonable analogy to how humans learn or how expert humans train in a skill.
GPT-style models literally aim to reproduce the input character by character (token by token).
now if he had written a specification as to what the function should be, then passed it to someone else that had never seen the function and worked from the spec then he'd be ok
It's not nearly that simple. No real copyright case is going to hinge on what a single range check function looks like.
This is human law, it's not a programming situation where you can just apply some simple rule and get a deterministic answer. Context plays a huge part, among other things.
I should have said that no successful copyright case is going to hinge on that.
Oracle's position on that was legally incorrect, for the reason I was alluding to: the relevant standard requires that illegal copying involve the core of the creative expression of the original work, which a generic range check function clearly doesn't do.
As the copyright holder of "throw new", the Junior dev infringed my copyright! Let alone them infringing copyright of the company they crafted that code for.
On a more serious note, there is a question whether algorithms and code blocks can be copyrighted, or if it is the _software_ that is copyrighted. Let's say I use websockets and you crib my usage of websockets for your own application. My opinion is that unless you rebuild the same thing I did, then "cribbing" is the long held art of "let me google how to do that". The artistic creation is the end software product, not really some measly embedded function that is boiler plate (form and function) for anything to work.
The 'form and function' clause of copyright almost certainly makes a range check function not a copyright infringement.
Easy money idea: when you know an employee will be leaving the company, have them spend their last weeks writing basic, foundational functions in multiple languages!
Also, re: maps, fake streets and cul-de-sacs that don't exist.
I've set a "trap" myself years ago in code in a novel solution at the time for uploading photos from iOS non-interactively after the fact. It was to support disconnected field workers taking photos from iPhones/iPads, with the payloads uploaded at a later date.
Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.
Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.
Yeah. The feature set offerred by the competitor was similar to ours, and we went through the wringer building that solution, so i unminified their code and sure enough...it easn't exactly theirs.
Oh yeah and they ripped off our website too. That was the first clue haha.
If you look at the Legal Action section of your link you'll see the line "However, the case was dismissed" quite a few times. That's because data isn't copyrightable.
Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA
The other problem with these "copyright traps" is that they do nothing to prove someone copied the legitimate parts of the data.
Suppose you recreate the entire dataset from scratch. Then someone notices (e.g. using an automated comparison) that the "trap" is in the other dataset but missing from yours, and submits it to you to add.
This is arguably too small an addition to be copyrighted on its own, but regardless of that, it would then be all you have to remove to get back to a clean version. And since it's erroneous data, you would want to remove it anyway.
Which country's laws apply and what remedies you can get if they were violated is far more complicated than geolocation of data.
But very broadly speaking you would need to sue in an EU court to enforce EU law. And you could sue a US company in specific EU country's court if the company had more than some minimum level of connection to the that country. The country the data is hosted in isn't key, though it can be evidence of connection to that country.
Where the data is stored does not matter much. Laws deal with people and companies, so it matters where you live or where your company operates. So if you live in the US you don't have to worry about EU laws unless you do buisness in EU.
It's occasionally explained—but still not widely understood, I'd wager—that this is the reason why so much GNU code is hard to follow.
In the US legal system the merger doctrine is a concept whereby a given expression cannot be granted protection if it's not sufficiently creative—and there only so many ways to express something when stripped down to its fundamentals. In response to this, RMS and Moglen encouraged contributors from very early on to try to express the inner workings of GNU utilities in creative and non-obvious ways out of caution against the possibility that the copyleft obligations of the GPL wrt a given package could be nullified by a finding in court that it did not pass the threshold for creativity.
GNU code is partially hard to follow because of RMS paranoia, but that mostly manifests itself in the code being weirdly structured. The far bigger reason is that GNU code tends to run with really strange optimizations and project decisions since they want their tools to be able to run on ancient mainframes that practically nobody uses anymore, so everything is overoptimized for that.
I first saw this in action on StackOverflow when, during an interview, a candidate copy-pasted a solution verbatim including the attribution. Didn't even give it a second thought, like they didn't even read the code or what it was doing.
It wasn't the right solution to the problem in question, for what it's worth.
https://en.wikipedia.org/wiki/Copyright_trap