As a substack author, with “permission is not granted to use any portion of this to train an ai” at the bottom of most of my posts, it’s bullshit that you have to do this sort of thing, and that it will almost certainly not work
This must be illegal, but how are all the little bloggers going to oppose it?
Why do you think it would be illegal? You can state "permission is not granted to X" on anything you want, but that doesn't mean the law is on your side. Regular rules of copyright still apply.
P.S. Permission is not granted to downvote my comment!
Vitriol aside, you need to chill for a bit and touch grass.
"Training" doesn't really have a well-defined meaning, I could use your website to train something as simple as a histogram of word counts for an AI for example. Nothing about that constitutes copyright infringement under even the loosest definition of their legal concept.
Additionally weights from training and the AI's output are two completely different matters from a legal perspective as well.
I agree it is wrong and should be illegal. That being said, I do find the argument that's it's no different than a human learning from and occasionally reconstructing copyrighted things compelling.
Most normal humans do not spend their time profitably selling their "occasionally reconstructing copyrighted things" at a rate a millions of users per second, which is a pretty important difference in practice.
That said, the law is not made with super-humans being able to reproduce (slightly transformative) as good as all they read (all worlds' knowledge) in mind. A clarifying law should be created.
> Is it legal to transcribe a book from memory for money?
If it's an accurate transcription and you don't have permission, then it's not legal. It doesn't matter if it's for money or not (or if it's from memory or not).
> Does it matter how faithful your transcription is?
Yes, it matters. Copyright covers the specific expression of an idea, not the idea itself.
What's right or wrong, and what's legal or illegal, are two different things. There are plenty of right things that are illegal and wrong things that are legal.
"Illegal" is too strong. But if you specifically disallow the use of your website contents from being used to train AI, then anyone doing so is violating the terms of service.
Which doesn't really mean anything.
At this point, the only defense I can think of is to not make the content publicly available. Which is what I've done.
It's a bit different because the AI is reading it with the intent of reproducing (certain aspects of) it for other people to later consume without visiting the original site. Fair use doctrine has long held that small pieces of copyrighted material can be reproduced, but the line is very blurry and generally has to be litigated if there's any ambiguity whatsoever. I'd bet many of the models we're currently using today will be pulled from serving the public over copyright lawsuits in the coming years.
I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that. Eventually new models will just pop up with more carefully curated data anyway.
> I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that.
From a US copyright law point of view, this is most likely correct. Copyright law doesn't prevent you from ingesting copyrighted works, it prevents you from distributing them.
There is also a great deal of existing case law about how different a work has to be before it's not infringing from another work anymore. There are existing rules of thumb judges go by when trying to determine if infringement occurred. They include things like the amount of difference in expression, the quantity, whether or not it's incidental, etc.
And that's not even getting into the question of fair use -- which is a whole other kettle of fish.
I suspect that the courts will deal with these issues the way that they've always dealt with these issues: on a case-by-case basis.
Sure, but that would be illegal too. I'm saying it doesn't matter who reads your website, but everyone knows exactly why GPT and Bard are going to do with the information they're "learning" from it, so they're trying to block it from reading in the first place.
Many LLM's will happily recite large segments of copyrighted material word-for-word, despite the fact that it can be difficult to tell what's happening "under the hood".
> It's a bit different because the AI is reading it with the intent of reproducing (certain aspects of) it for other people to later
It could be illegal if the AI reproduces vast portions of it. If you could ask the LLM over a course of prompts to generate a significant portion of the content (as the copyright law defines it), then yes.
As long as the AI isn't reproducing it, then I am not sure if it would count.
Scale and position matter. Google is the conduit that connects most people to most websites, so in the EU they are considered a "gatekeeper" and need to be careful about conflicts of interest with the people and websites using their "gate". I hope American competition law catches up to the point we can recognize that market makers simply should not be participating in the markets they make (and Google search is a market maker; it's connecting "buyers" [viewers or advertisers, depending on your perspective] to "sellers" [websites or viewers, respectively]), but I digress.
The point is that Google has a certain market position that makes it very different when they "recite the vague plot of a novel or a fact they learned". The point of competition law is to "distort" free market capitalism for the betterment of society. This is one of those cases where practical considerations trump information idealism. The quality of information on the internet will go down if we stop rewarding original publishers.
Yep, there’s a big difference in practice. If an AI could attribute and provide royalties then it may not be so different but that’s never going to happen. A big reason Bard exists is Google trying to ensure they stay profitable and relevant. They don’t care where the knowledge really comes from.
Does that include book readers for the blind? They typically have some sort of optical character recognition and benefit a user, just like an ML training dataset benefits users.
My point being: it's exceptionally hard to create laws that deny precisely what you don't want and allow precisely what you want, without quickly getting into details that bring the entire law's assumptions into question. Here being "because an ML training is not a person, it has no right to scan the web".
The main difference here is that these AI bots are operating with an entirely different agenda. The ethics remain to be seen and the jury is out as to whether they will benefit the user they way the promise they will.
Also on a whole different scale and instead of supplementing the web content it’s devaluing it to a degree.
The "ai bots" aren't operating with an agenda- at least as far as we can tell now, training algorithms and their scrapers do not have agency.
Basically you're assuming the agenda of the operator, saying "that's bad an shouldn't be allowed". But I see the web- except for things specifically labelled with standard copyright disclaimers- as effectively a large corpus of publicly available data, "in the market square for all to see".
This must be illegal, but how are all the little bloggers going to oppose it?