A long form response to the concerns and comments and general principles many people had in the post about authors suing companies creating LLMs.
Most of this stems from a misunderstand of how LLM work.
The original work is not stored anywhere. No copy of it has been made. Just tons and tons of statistics used to inform models.
Since there is no copy there is no violation of copyright. Again, no copy of the book is getting made. The content of the books is not stored “verbatim”. The book is not copied. I don’t know how many other ways to put this.
Summarizing a book also does not require one to have “read” it, contrary to the complaint. I never read “The DaVinci Code”, but I can give a summary of it.
With assertions in the complaint being clearly false it’s hard to take it seriously and it’ll get chucked the first time a judge has to deal with it.
Maybe Silverman would have a point if it were standard practice to pay royalties to people you get inspiration from. But she doesn’t pay everyone who wrote anything she read, said anything she heard, or other comedians who influenced her. So why should someone influenced by her pay?
If I read 100,000 books how do you determine “which one” I got inspiration from? Same situation here.
Copyright doesn’t apply just to stuff copied verbatim though, it applies to a lot more. It really doesn’t matter if it is or isn’t stored verbatim. Translations and derivative works are not exact copies and still fall under copyright. Copyright even applies to broad things such as “a concept of a character” and this can result in some pretty strange arguments some copyright holders might use, such as “Sherlock Holmes that doesn’t smile is public domain, but Sherlock Holmes who shows emotion is copyright infringement” as described here.
It doesn’t matter if an exact copy of the book was made. It matters if the core information that book carried was taken as a whole and used elsewhere. And even though the data was transformed as statistical information, the information is still there in that model. The model itself is basically just an “unauthorized translation” of hundreds of thousands of works into a very esoteric format.
The whole argument of “inspiration” is also misleading. Inspiration is purely a human trait. We’re not talking about humans being inspired. We’re talking about humans using copyrighted material to create a model, and about computers using that model to create content. Unless you’d argue that humans should be considered the same thing as machines in the eyes of the law, this argument simply doesn’t work.
Look up RAM copy doctrine. It is pretty easy to argue they are making a copy.
Aptly put 👏
If the rumor is true that OpenAI is using libgen to obtain books, then this will be a very interesting fight.
Authors profiteering from arcane copyright laws vs. a sleazy company that hypes up an LLM as if it were HAL from 2001. Who is worse? Who should lose?! I’m on the edge of my seat already!
Authors profiteering from arcane copyright laws
I get this argument from the film, movie, television, videogame industry, and other more modern ones out there. But outside a handful of actual big name authors the average writer isnt exactly raking it in.
Also thanks to being a relic of the past we do still have libraries which offer books for free to read with a subscription and not only is this common, but its a celebrated thing among most authors and the reading community.
I’ll bet ChatGPT could write an epic rap battle about it.
If I read her book, someone asked me to summarize it and I did - would she sue me for copyright infringement too? Do I need her permission to read her book?
It seems to me like a cheap attempt to advertise her book.
US Courts have already ruled in the past that human authorship is required for copyright. It’d be a logical conclusion as such that human authorship would also be required to justify a fair use defence. You providing a summary without any quotations would likely justify fair use - which is still copyright infringement, but a mere defence of said infringement. A machine or algorithm that cannot perform the act of creative authorship would thus not be exempted by the fair use defence.
US Courts have already ruled in the past that human authorship is required for copyright
Irrelevant to the issue at hand. Here, Silverman is the only one making a copyright claim. ChatGPT is not claiming a copyright on its output.
It’d be a logical conclusion as such that human authorship would also be required to justify a fair use defence.
I disagree. Nothing about “fair use” requires that the work be copyrighted on its own, or even copyrightable. It simply can’t be subject to the original copyright.
A summary is a “transformative derivation”. Even if that summary cannot be copyrighted on its for some reason, it is not subject to the copyright of the original work.
To read it in the first place, before you summarize it, you need to obtain it legally by either buying it, or checking it out from the library (which has bought it).
That is not actually true.
If I create unauthorized copies of Silverman’s book, and hand one to you, I have violated her copyright; you have not. You are free to read that unauthorized copy. You are free to discuss what you have read.
Copyright law prohibits me from creating and distributing her book. It does not prohibit you from receiving an unauthorized copy. Hell, it doesn’t even prohibit you from soliciting an unauthorized copy.
Or you sit in the library and “read” it. Now how do you define where the library is? Many libraries loan out digital copies. You can sit in a book store (they exist!) and read a book without purchasing it too.
It’s going to be difficult to use the “they couldn’t possibly have had legit access to all these books” argument in court.
I think this has to do with intent. If I read a book to use it for the basis of a play, that would be illegal. If I read for enjoyment, that is legal. Since AI does not read for enjoyment, but only to use it for the basis of creating something else, that would be illegal.
Is my logic flawed?
This isn’t how it works at all. I can, and should, and do, read and consume all sorts of media with the intention of stealing from it for my own works. If you ask for writing advice, this is actually probably one of the first things you’ll hear: read how other people do it.
So this does not work as an argument, “the intent of the reading” because if so humans could never generate any new media either.
This is the thing I kept shouting when diffusion models took off. People are effectively saying “make it illegal for neural nets to learn from anything creative or productive anywhere in any way”
Because despite the differences in architecture, I think it is parallel.
If the intent and purpose of the tool was to make copies of the work in a way we would consider theft of done by a human, I would understand.
The same way there isn’t any legal protection on neural nets learning from personal and abstract information to manipulate and predict or control the public, the intended function of the tool should make it illegal.
But people are too self focused and ignorant to riot enmass about that one.
The dialogue should also be in creating a safety net as more and more people lose value in the face of new technology.
But fuck any of that, what if an a.i. learned from a painting I made ten year ago, like every other artists who may have learned from it? Unforgivable.
I don’t believe it’s reproducing my art, even if asked to do so, and I don’t think I’m entitled to anything.
Also copyright has been fucked for decades. It hasn’t served the people since long before the Mickey mouse protection act.
Regardless of intent, let’s not pretend that the scale at which LLMs “process” information to generate new content is comparable to humans. That is obviously what was intended for copyright laws (so far).
We don’t need to pretend though. People with speed reading skills are faster than most humans as well and could read a lot more books.
It’s very probable that you read at least one writers whole library, even if it’s as many stories as Terry Pratchett got published which will always be true for human written books as writing them takes longer than reading.
Obviously the acquirement of those stories has to be made in a legal way and no actual passages should be stored in the model but the amount of data processed should have no say on if it can be used.
And as written by others here. Making copyright law more strict puts big corps at an advantage because they have big legal teams and money to just pay the copyright fee while your regular user would not be able to.
It’s comparing a bird to a plane, but I still think the process constitutes “learning,” which may sound anthropomorphic to some, but I don’t think we have a more accurate synonym. I think the plane is flying even if the wings aren’t flapping and the plane doesn’t do anything else birds do. I think LLMs, while different, reflect the subconscious aspect of human speech, and reflect the concept of learning from the data more than “copying” the data. It’s not copying and selling content unless you count being prompted into repeating something it was trained on heavily enough for accurate verbatim reconstruction. To me, that’s no more worrying than Disney being able to buy writers that have memorized some of their favorite material, and can reconstruct it on demand. If you ask your intern to reproduce something verbatim with the intent of selling it. I still don’t think the training or “learning” were the issues.
To accurately address the differences, we probably need new language and ideals for the specific situations that arise in the building of neural nets, but I still consider much of the backlash completely removed from any understanding of what has been done with the “copywrited material.”
I tend to view it thinking about naturally training these machines in the future with real world content. Should a neural net built to act in the real world be sued if an image of a coca-cola can was in the training data somewhere, and some of the machines end up being used to make cans for a competitor?
How many layers of abstraction, or how much mixture with other training data do you need to not consider that bit of information to be comparable to the crime of someone intentionally and directly creating an identical logo and product to sell?
Copyright laws already need an overhaul prior to a.i.
It’s no coincidence that warner and Disney are so giant right now, and own so much of other people’s ideas. That they have the money to control what ideas get funded or not. How long has Disney been dead? More than half a century. So why does his business own the rights of so many artists who came after?
I don’t think the copywrite system is ready to handle the complexity of artificial minds at any stage, whether it is the pareidolic aspect of retrieving visual concepts of images in diffusion models, or the complex abilities that arise from current scale LLMs? which again, I believe are able to resemble the subconscious aspect of word predictions that exists in our minds
We can’t even get people to confidently legislate a simple ethical issue like letting people have consensual relationships with the gender of their own choice. I don’t have hope we can accurately adjust at each stage of development of a technology so complex we don’t even have the language to properly describe the functioning. I just believe that limiting our future and important technology for such grotesquely misdirected egoism would be far more harmful than good
The greater focus should be in guaranteeing that technological or creative developments benefit the common people, not just the rich. This should have been the focus for the past half century. People refuse this conceptually because they’ve been convinced that any economic re-balancing is evil when it benefits the poor. Those with the ability to change anything are only incentivized to help themselves.
But everyone is just mad at the machine because “what if it learned from my property?”
I think the article even promotes Adobe as the ethical alternative. Congrats, you’ve limited the environment so that only the existing owners of everything can advance. I don’t want to pay Adobe a subscription for the rest of my life for the right to create on par with more wealthy individuals. How is this helping the world or creation of art?
Your logic is flawed in that derivative works are not a violation of copyright. Generally, copyright protects a text or piece of art from being reproduced. Specific characters and settings can be protected by copyright, concepts and themes cannot. People take inspiration from the work of others all the time. Lots of TV shows or whatever are heavily informed by previous works, and that’s totally fine.
Copyright protects the reproduction of other peoples work, and the reuse of their specific characters. It doesn’t protect style, themes, concepts, etc. IE. the things that an AI is trying to derive. So like if you trained your LLM only on Tolkien such that it always told stories about Gandalf and the hobbits, then that would be a problem.
“Reading with intent?” that sounds ridiculous. The only thing of concern is the work produced.
Open up! It’s the thought-police! We have reason to believe you are reading with intent to commit a criminal act! You are under arrest! Anything you say or think can and will be used against you in the court of law!
Are you producing a play of the book or are you reading it as inspiration for a similar story though as that’s two different things.
I think the whole thing about megacorps being the problem here is a bit short sighted, I don’t think it will be too much longer before anyone can spin up their own LLM. It doesn’t exactly take Google levels of resources. I’m as happy to shit on megacorps as the next person here but IP law as it is is BS.
More likely than not any changes made will be to benefit large corporations at the expense of individuals and competition. I’m imagining a world where copyright law has made it so that only big corporations can afford to pay for LLM training data. As if individuals had to pay library book prices for a personal book to train their personal LLM. This desire to “cash in” may just play right into the megacorporation’s hand.
I agree that cashing in is at least important part of this. As I understand it, however, past a certain point creating and using LLMs is in fact extremely expensive. That’s why GPT4 limits user interactions, for example. I also think that the more restricted these tools are in general, the better for everyone. It’s absolutely possible to use them in positive ways, but as it stamps they are mostly just flooding the internet with garbage at killing low level content jobs.
We’re already heading in a direction that mainly benefits those who are already in power. The real impact of these lawsuits appears to be favoring corporations and copyright holders, without sufficient thought to how they might limit individuals like us. People are already anxious about AI taking their jobs, right? But if we keep creating laws that continuously favor the same powerful few, it shouldn’t shock us when the average person can’t keep up. Just to give you an idea, instead of being able to use Large Language Models (LLMs) to make my work easier, I may be forced to completely abandon this tech due to this kind of shortsightedness. LLMs should be a tool available to ALL of us, not just those at the top.
I don’t know what the authors are complaining about. All the AI is doing is trawling through a lexicon of words and rearranging them into an order that will sell books. It’s exactly what authors do. This is about money.
Hi, it’s me the author!
First of all, thanks for reading.
In the article I explain that it is not exactly what authors do, we reading and writing are an inherently human activity and the consumption and processing of massive amounts of data (far more than a human with a photographic memory could process in a hundred million lifetimes) is a completely different process to that.
I also point out that I don’t have a problem with LLMs as a concept, and I’m actually excited about what they can do, but that they are inherently different from humans and should be treated as such by the law.
My main point is that authors should have the ability to decree that they don’t want their work used as training data for megacorporations to profit from without their consent.
So, yes in a way it is about money, but the money in question being the money OpenAI and Meta are making off the backs of millions of unpaid and often unsuspecting people.
I think it’s an interesting topic, thanks for the article.
It does start to raise some interesting questions, if an author doesn’t want they book to be ingested by a LLM, then what is acceptable? Should all LLMs now be ignorant of that work? What about summaries or reviews of that work?
What if from a summary of a book an LLM could extrapolate what’s in the book? Or write a similar book to the original, does that become a new work or is it still fall into the issue of copyright?
I do fear that copyright laws will muddy the waters and slow down the development of LLMs and have a greater impact more than any government standards ever will!
I’m all for muddy waters and slow development of LLMs at this juncture. The world is enough of a capitalist horrorshow and so far all this tech provides is a faster way to accelerate the already ridiculously wide class divide. Just my cynical luddite take of the day…
deleted by creator
One of the things they are alleging is that the books were acquired illegally
That is a separate crime. And it’s the distribution part that’s illegal, so the lawsuit should be aimed at Library Genesis for that part.
They’re “complaining” about unique qualities of their art being used, without consent, to create new things which ultimately de-value their original art.
It’s a debate to be had, I’m not clearly in favour of either argument here, but it’s quite obvious what they’re upset with.
If it’s a debate to be had then it’s something that should have been debated hundreds of years ago when copyright was first invented, because every author or artist re-uses the “unique qualities” of other peoples’ art when making their own new stuff.
There’s the famous “good authors copy, great authors steal” quote, but I rather like the related one by C. E. M. Joad mentioned in that article: “the height of originality is skill in concealing origins.”
I think the main difference between derivative/inspired works created by humans and those created by AI is the presence of “creative effort.” This is something that humans can do, but narrow AI cannot.
Even bland statements humans make about nonfiction facts have some creativity in them, even if the ideas are non-copyrightable (e.g., I cannot copyright the fact that the declaration of independence was signed in 1776. However, the exact way I present this fact can be copyrightable- a timeline, chart, table, passage of text, etc. could all be copyrightable).
“Creative effort” is a hard thing to pin down, since “effort” alone does not qualify (e.g., I can’t copyright a phone directory even if I spent a lot of effort collecting names/numbers, since simply putting names and numbers alongside each other in alphabetical isn’t particularly creative or original). I don’t think there’s really a bright line test for what constitutes as “creative,” but it doesn’t take a lot. Randomness doesn’t qualify either (e.g., I can’t just pick a random stone out of a stream and declare copyright on it, even if it’s a very unique-looking rock).
Narrow AI is ultimately just a very complex algorithm created based on training data. This is oversimplifying a lot of steps involved, but there isn’t anything “creative” or “subjective” involved in how an LLM creates passages of text. At most, I think you could say that the developers of the AI have copyright over the initial code used to make that AI. I think that the outputs of some functional AI could be copyrightable by its developers, but I don’t think any machine-learning AI would really qualify if it’s the sole source of the work.
Personally, I think that the results of what an AI like Midjourney or ChatGPT creates would fall under public domain. Most of the time, it’s removed enough from the source material that it’s not really derivative anymore. However, I think if someone were to prompt one of these AI to create a work that explicitly mimics that of an author or artist, that could be infringement.
IANAL, this is just one random internet user’s opinion.