That's why copyright holders for reference works have been using copyright traps...

tedivm · on April 21, 2023

We don't need the copyright traps here though as Github openly admits to using the public code for training. They just don't care that they're essentially license laundering code since they can make money doing it.

That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.

peytoncasper · on April 21, 2023

What happens if GitHub didn't use GPL licensed code, but still generated code that was identical to GPL licensed code?

tedivm · on April 21, 2023

We know that isn't the case because we can see code being reproduced even with comments, and Github has been open about the fact that they used everything they had in training.

That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.

Okay, so to get to the spirit of your question, lets say Github managed to program a model that worked using only their own code or code that was explicitly put in the public domain. If Github managed to reproduce code that wasn't in the training set, then it can't be accused of copying it. At that point the argument could be made that it independently created it.

At the same time algorithms can't be copyrighted, but implementations of an algorithm can be, so if Github was basically just spitting out an algorithm that just happened to be implemented similarly to how some other code it wasn't trained on implemented it, then I would say there was no copyright violation.

bryanrasmussen · on April 21, 2023

>We know that isn't the case because we can see code being reproduced even with comments

If the comment is something like

//check fromIndex is greater than toIndex

then that is not any more individualistic or different than the actual function. Sadly, many people comment like this, on the other hand if it reproduced a comment with typos or something more complicated like

/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/

then yeah, then you would have something

marginalia_nu · on April 21, 2023

Well yeah, we've already seen exactly this:

https://twitter.com/StefanKarpinski/status/14109710611816816...

theRealMe · on April 21, 2023

In almost any other scenario this would be evidence. But Fast Inverse Square Root isn’t some tightly held secret. That exact code, with those specific comments included, is found in the Wikipedia page for that algo:

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root

nextaccountic · on April 22, 2023

The usage in Wikipedia is probably fair use, but it's still copyrighted, even without being a secret

marginalia_nu · on April 21, 2023

Yeah, it's still GPL though.

theRealMe · on April 21, 2023

True.

bryanrasmussen · on April 21, 2023

OK that tracks as more than just lazy comments lookalikes.

visarga · on April 21, 2023

How about rewording a code snippet so it doesn't exactly replicate the source, but is functionally identical? Could be applied before training. Can we say the LLM only learned the ideas not the expression? Copyright should protect expression and not restrict reusing ideas.

janoc · on April 21, 2023

Except that's not how LLM works. LLM has no idea about "ideas", only probabilities of how certain words string together.

So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.

Also, such "functionally identical but not violating copyright" transformation is not possible to do, both given the complexity of the problem and the sheer volume of the data.

And training it on some simplistically obfuscated code wouldn't help - all it would learn would be production of obfuscated code. Not useful for the intended use.

chii · on April 22, 2023

> It doesn't understand that the two are equivalent.

it doesn't need to understand the way a human might do the understanding.

The pattern that the LLM managed to extract could include the structure, rather than the pure text. And in reproducing the structure, the LLM can replace the variable names but keep the structure intact.

I am not sure if copilot is able to do this, but chatGPT was somewhat able to (if imperfectly at the moment).

belorn · on April 22, 2023

Copying a piece of code and changing the variable names is still a copy. It is similar to how copying a piece of music and changing the pitch/volume/any other attribute would still be a copy of the original music.

The thing that the LLM need to do is to convince a judge/jury that it has not created a copy, and that it operate differently from a transformation.

nextaccountic · on April 22, 2023

> So you literally can't make it produce functionally identical but not verbatim identical code. It doesn't understand that the two are equivalent.

But it does - similar but not identical code are closer in the embedding space

NoZebra120vClip · on April 22, 2023

> Copyright should protect expression and not restrict reusing ideas.

That's what patents are for.

layer8 · on April 21, 2023

They’d have to prove to the court that the former is true despite the latter happening, which I imagine would be difficult to do in practice.

concordDance · on April 21, 2023

Using for training doesn't mean its reproduced.

Consider a junior dev who writes a range check function while working for a company (so they own the copyright) then goes to a different company and writes the same range function because that's just how he writes code.

Has copyright been infringed?

Swenrekcah · on April 21, 2023

That programmer definitely reproduced the code, so if copilot does the same that’s definitely reproduction.

Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.

seadan83 · on April 21, 2023

> Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.

Why is this? Copilot in some ways is an automated way to search code & stack overflow. There is a very annoying website that does nothing more than show relevant code samples of various google search terms.

If the manual version of something is okay (eg: googling for code, finding it, fitting for a new and specific purpose that is similar), why would an automated version of that be any different?

Swenrekcah · on April 21, 2023

For a similar reason that camera surveillance can be forbidden in places where people are still allowed to keep their eyes open.

Or sending millions of messages in an automated way can be illegal but millions of people sending a message is not.

seadan83 · on April 21, 2023

I'm confused by the forbidden surveillance example. Generally surveillance camera's are legal for any place where there is no expectation of privacy. The expectation of privacy is largely only ones home, outside of that you can be video taped all day long by anyone. I'm not sure how this is analgous..

The million messages example is interesting. Though, what examples are there? In what cases is something legal to do it once, but there is some threshold where you cannot do it many times?

The "sending millions of messages" is only perhaps illegal because it breaks terms of service. Or, the one message is perhaps also illegal but nobody cares to pursue litigation for one instance of an infraction. The point remains though, if an individual does something once that is legal - it makes that activity legal, period and full stop. No?

Swenrekcah · on April 21, 2023

I was thinking about things like spam and also social media.

Note that my main objection is to equating a person doing something with an automated process. Sometimes it may be legal or other times illegal but it just clearly isn’t the same.

For the last point, I think the answer to that is a definite no in most jurisdictions. Laws and judicial conventions often allow differing circumstances to affect the legality of things.

seadan83 · on April 22, 2023

> For the last point, I think the answer to that is a definite no in most jurisdictions. Laws and judicial conventions often allow differing circumstances to affect the legality of things.

Curious, any concrete examples? I can't really think of any where one instance is okay but many is not. I can think of examples where one instance is ignored and many instances are harder to ignore (and so is prosecuted), but overall - I can't really think of anything that is okay to do once but not many times.

Swenrekcah · on April 22, 2023

Don’t forget the automation part, that’s the key issue.

Unsolicited robocalls are illegal in many places where human callers may not be.

belorn · on April 21, 2023

Playing a movie for a few friends who visits is fine, but start to demand tickets and suddenly it will look like a cinema which is not fine.

The reason is always the same. Courts and judges will look at the situation and make a decision about what seems fair and what does not. It is them that need to be convinced that a specific use of a copyrighted work is permitted either through fair use or by a license.

seadan83 · on April 21, 2023

> Playing a movie for a few friends who visits is fine, but start to demand tickets and suddenly it will look like a cinema which is not fine.

Interesting analogy. "Ripping" something off an only using it for your personal project sounds like the "playing a movie for a few friends". Doing so for the benefit of corporation that then has thousands of daily visitors sounds like the "movie cinema" example. Though, in both cases it was an individual googling and finding how to implement a specific function.

"fair use" in copyright is pretty specific in that it refers to things like "you can play portions of a clip in order to comment on it." Or as another example, you can use clips/portions for the purposes of a review commentary.

"Form and function" is perhaps a very important crux here. Some things you can only do a certain way. For example, quick-sort, there are is only really one way to implement quick sort (or otherwise it is not at all quick sort!).

Personally I feel the copyright line is higher than a function, the copyright is on the collection of functions who together create a specific software. The individual functions IMHO are as copyright'able as-is a cog on a bike cassette, or the chain on a motorcycle.

belorn · on April 22, 2023

I think there are quite few things in programming that can only be implemented one way. I see it as similar to music in that almost every song have notes going up or down the scale. Obviously there can't be that many variations, but then the important distinctions is often in the details. Applying copyright on a single function is like applying copyright on a single riff. Sometimes the legal system will accept it, but it should be the exception rather than the norm.

Fair use seem to had a change in scope. Historically it seems to be mostly about things like "play a clip in order to comment on it.", but now we have things like google making a copy of all books ever written in order for people to search through them. Similar arguments has been made over copying news articles from news sites in order to put a portion of it in search results. A stack overflow-like search engine that trawled proprietary code bases would likely be sued, but in theory they could argue fair use just like google.

sinrtb · on April 25, 2023

I am pretty sure both cases would break copyright. But in the first case the copyright holders would never go after you and the second they would. But in both cases they could. The damages that a company could recover from you for watching a movie with a few friends is much lower than the damages they could recover if you made money selling tickets. Not to mention the negative PR a company would get for going after someone buying a DVD and watching it with friends.

okamiueru · on April 21, 2023

Because of the license the code is under?

jitix · on April 21, 2023

IMO it’s the same thing because I fundamentally see LLMs in the same role as calculators that helps reduce cognitive load by offloading repetitive work.

Practically with an LLM the programmer can focus on the creative part (handler function, react component, etc) while the LLM generates the necessary boilerplate for the ever changing frameworks and infra configurations. The programmer (and QA) would still review and test everything but would save time writing boilerplate and ship features faster.

ml-anon · on April 24, 2023

It literally means reproduced in some capacity. Just because its called "training" it doesn't mean it has any reasonable analogy to how humans learn or how expert humans train in a skill.

GPT-style models literally aim to reproduce the input character by character (token by token).

cyanydeez · on April 21, 2023

They have clean room implementation for just this problem.

The _only_ escape clause is some random function that says how arbitrary a code block is. Or nontrivial.

A person or AI can absolutely be violating copyright via your example.

lallysingh · on April 21, 2023

Yup. They just copied manually.

blibble · on April 21, 2023

> Has copyright been infringed?

yes

now if he had written a specification as to what the function should be, then passed it to someone else that had never seen the function and worked from the spec then he'd be ok

see: IBM BIOS

antonvs · on April 21, 2023

> yes

It's not nearly that simple. No real copyright case is going to hinge on what a single range check function looks like.

This is human law, it's not a programming situation where you can just apply some simple rule and get a deterministic answer. Context plays a huge part, among other things.

blibble · on April 21, 2023

> No real copyright case is going to hinge on what a single range check function looks like.

you realise this exact extremely famous function was the focus of a billion dollar supreme court copyright battle that went on for years?

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...

(the entire basis of GPs joke)

antonvs · on April 22, 2023

I should have said that no successful copyright case is going to hinge on that.

Oracle's position on that was legally incorrect, for the reason I was alluding to: the relevant standard requires that illegal copying involve the core of the creative expression of the original work, which a generic range check function clearly doesn't do.

seadan83 · on April 21, 2023

As the copyright holder of "throw new", the Junior dev infringed my copyright! Let alone them infringing copyright of the company they crafted that code for.

On a more serious note, there is a question whether algorithms and code blocks can be copyrighted, or if it is the _software_ that is copyrighted. Let's say I use websockets and you crib my usage of websockets for your own application. My opinion is that unless you rebuild the same thing I did, then "cribbing" is the long held art of "let me google how to do that". The artistic creation is the end software product, not really some measly embedded function that is boiler plate (form and function) for anything to work.

The 'form and function' clause of copyright almost certainly makes a range check function not a copyright infringement.

veec_cas_tant · on April 21, 2023

Easy money idea: when you know an employee will be leaving the company, have them spend their last weeks writing basic, foundational functions in multiple languages!

jschrf · on April 21, 2023

Also, re: maps, fake streets and cul-de-sacs that don't exist.

I've set a "trap" myself years ago in code in a novel solution at the time for uploading photos from iOS non-interactively after the fact. It was to support disconnected field workers taking photos from iPhones/iPads, with the payloads uploaded at a later date.

Chunked form data constructed in userland JS was the solution. Chunk separator was 17 dashes in a row (completely arbitrary), company name in 1337 speak, plus 17 more dashes.

Found a competitor that had copied the code, changing only the 1337 speak part. 17 dashes remained on each side. Helped me realize that they had unminified and indeed ripped off our R&D work.

Wonder if Copilot could be gamed the same way.

peteradio · on April 21, 2023

How did you manage to find that your competitor copied your code?

aetch · on April 21, 2023

Javascript

jschrf · on April 22, 2023

Yeah. The feature set offerred by the competitor was similar to ours, and we went through the wringer building that solution, so i unminified their code and sure enough...it easn't exactly theirs.

Oh yeah and they ripped off our website too. That was the first clue haha.

iudqnolq · on April 21, 2023

If you look at the Legal Action section of your link you'll see the line "However, the case was dismissed" quite a few times. That's because data isn't copyrightable.

Edit: As sroussey points out s/isn't copyrightable/isn't copyrightable in the USA

AnthonyMouse · on April 21, 2023

The other problem with these "copyright traps" is that they do nothing to prove someone copied the legitimate parts of the data.

Suppose you recreate the entire dataset from scratch. Then someone notices (e.g. using an automated comparison) that the "trap" is in the other dataset but missing from yours, and submits it to you to add.

This is arguably too small an addition to be copyrighted on its own, but regardless of that, it would then be all you have to remove to get back to a clean version. And since it's erroneous data, you would want to remove it anyway.

plasticchris · on April 21, 2023

My favorite of these was a town founded to match the map. Pretty sure I heard an npr story on it.

sroussey · on April 21, 2023

Not in the USA, but it is in the EU and elsewhere.

whiplash451 · on April 21, 2023

How do you define the geolocation of data?

If my website is hosted in EU but a company scans it from the internet in the US, how could they possibly know it is hosted in EU?

iudqnolq · on April 21, 2023

Which country's laws apply and what remedies you can get if they were violated is far more complicated than geolocation of data.

But very broadly speaking you would need to sue in an EU court to enforce EU law. And you could sue a US company in specific EU country's court if the company had more than some minimum level of connection to the that country. The country the data is hosted in isn't key, though it can be evidence of connection to that country.

z3t4 · on April 21, 2023

Where the data is stored does not matter much. Laws deal with people and companies, so it matters where you live or where your company operates. So if you live in the US you don't have to worry about EU laws unless you do buisness in EU.

concordDance · on April 21, 2023

Hence why you should live on that unclaimed but of land in Africa. :D

jprete · on April 21, 2023

The relevant line is “information alone without a minimum of original creativity cannot be protected by copyright”.

There is definitely creativity in writing code; it’s not a completely deterministic translation of even a complete specification.

iudqnolq · on April 21, 2023

Oh absolutely. I was speaking only about the comment I replied to.

cxr · on April 21, 2023

It's occasionally explained—but still not widely understood, I'd wager—that this is the reason why so much GNU code is hard to follow.

In the US legal system the merger doctrine is a concept whereby a given expression cannot be granted protection if it's not sufficiently creative—and there only so many ways to express something when stripped down to its fundamentals. In response to this, RMS and Moglen encouraged contributors from very early on to try to express the inner workings of GNU utilities in creative and non-obvious ways out of caution against the possibility that the copyleft obligations of the GPL wrt a given package could be nullified by a finding in court that it did not pass the threshold for creativity.

noirscape · on April 21, 2023

GNU code is partially hard to follow because of RMS paranoia, but that mostly manifests itself in the code being weirdly structured. The far bigger reason is that GNU code tends to run with really strange optimizations and project decisions since they want their tools to be able to run on ancient mainframes that practically nobody uses anymore, so everything is overoptimized for that.

ljm · on April 21, 2023

I first saw this in action on StackOverflow when, during an interview, a candidate copy-pasted a solution verbatim including the attribution. Didn't even give it a second thought, like they didn't even read the code or what it was doing.

It wasn't the right solution to the problem in question, for what it's worth.

Just manually did what GPT does now.

netfortius · on April 21, 2023

I think I mentioned this before, in another context: the solution is known as "honeytoken", and it is equally applicable in computer security.