Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744
Are the models that OpenAI creates open source? I don’t know enough about LLMs but if ChatGPT wants exemptions from the law, it result in a public good (emphasis on public).
Nothing about OpenAI is open-source. The name is a misdirection.
If you use my IP without my permission and profit it from it, then that is IP theft, whether or not you republish a plagiarized version.
So I guess every reaction and review on the internet that is ad supported or behind a payroll is theft too?
OpenAI does not publish their models openly. Other companies like Microsoft and Meta do.
The STT (speech to text) model that they created is open source (Whisper) as well as a few others:
Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.
The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).
Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.
They are model-available if anything.
I did a quick check on the license for Whisper:
Whisper’s code and model weights are released under the MIT License. See LICENSE for further details.
So that definitely meets the Open Source Definition on your first link.
And it looks like it also meets the definition of open source as per your second link.
Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
Whisper’s code and model weights are released under the MIT License. See LICENSE for further details. So that definitely meets the Open Source Definition on your first link.
Model weights by themselves do not qualify as “open source”, as the OSAID qualifies. Weights are not source.
Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
This is not training data. These are testing metrics.
Edit: additionally, assuming you might have been talking about the link to the research paper. It’s not published under an OSD license. If it were this would qualify the model.
I don’t understand. What’s missing from the code, model, and weights provided to make this “open source” by the definition of your first link? it seems to meet all of those requirements.
As for the OSAID, the exact training dataset is not required, per your quote, they just need to provide enough information that someone else could train the model using a “similar dataset”.
Oh and for the OSAID part, the only issue stopping Whisper from being considered open source as per the OSAID is that the information on the training data is published through arxiv, so using the data as written could present licensing issues.
Ok, but the most important part of that research paper is published on the github repository, which explains how to provide audio data and text data to recreate any STT model in the same way that they have done.
See the “Approach” section of the github repository: https://github.com/openai/whisper?tab=readme-ov-file#approach
And the Traning Data section of their github: https://github.com/openai/whisper/blob/main/model-card.md#training-data
With this you don’t really need to use the paper hosted on arxiv, you have enough information on how to train/modify the model.
There are guides on how to Finetune the model yourself: https://huggingface.co/blog/fine-tune-whisper
Which, from what I understand on the link to the OSAID, is exactly what they are asking for. The ability to retrain/finetune a model fits this definition very well:
The preferred form of making modifications to a machine-learning system is:
- Data information […]
- Code […]
- Weights […]
All 3 of those have been provided.
The problem with just shipping AI model weights is that they run up against the issue of point 2 of the OSD:
The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
AI models can’t be distributed purely as source because they are pre-trained. It’s the same as distributing pre-compiled binaries.
It’s the entire reason the OSAID exists:
- The OSD doesn’t fit because it requires you distribute the source code in a non-preprocessed manner.
- AIs can’t necessarily distribute the training data alongside the code that trains the model, so in order to help bridge the gap the OSI made the OSAID - as long as you fully document the way you trained the model so that somebody that has access to the training data you used can make a mostly similar set of weights, you fall within the OSAID
Edit: also the information about the training data has to be published in an OSD-equivalent license (such as creative Commons) so that using it doesn’t cause licensing issues with research paper print companies (like arxiv)
Studied AI at uni. I’m also a cyber security professional. AI can be hacked or tricked into exposing training data. Therefore your claim about it disposing of the training material is totally wrong.
Ask your search engine of choice what happened when Gippity was asked to print the word “book” indefinitely. Answer: it printed training material after printing the word book a couple hundred times.
Also my main tutor in uni was a neuroscientist. Dude straight up told us that the current AI was only capable of accurately modelling something as complex as a dragon fly. For larger organisms it is nowhere near an accurate recreation of a brain. There are complexities in our brain chemistry that simply aren’t accounted for in a statistical inference model and definitely not in the current gpt models.
Your first point is misguided and incorrect. If you’ve ever learned something by ‘cramming’, a.k.a. just repeating ingesting material until you remember it completely. You don’t need the book in front of you anymore to write the material down verbatim in a test. You still discarded your training material despite you knowing the exact contents. If this was all the AI could do it would indeed be an infringement machine. But you said it yourself, you need to trick the AI to do this. It’s not made to do this, but certain sentences are indeed almost certain to show up with the right conditioning. Which is indeed something anyone using an AI should be aware of, and avoid that kind of conditioning. (Which in practice often just means, don’t ask the AI to make something infringing)
I think you’re anthropomorphising the tech tbh. It’s not a person or an animal, it’s a machine and cramming doesn’t work in the idea of neural networks. They’re a mathematical calculation over a vast multidimensional matrix, effectively solving a polynomial of an unimaginable order. So “cramming” as you put it doesn’t work because by definition an LLM cannot forget information because once it’s applied the calculations, it is in there forever. That information is supposed to be blended together. Overfitting is the closest thing to what you’re describing, which would be inputting similar information (training data) and performing the similar calculations throughout the network, and it would therefore exhibit poor performance should it be asked do anything different to the training.
What I’m arguing over here is language rather than a system so let’s do that and note the flaws. If we’re being intellectually honest we can agree that a flaw like reproducing large portions of a work doesn’t represent true learning and shows a reliance on the training data, i.e. it cant learn unless it has seen similar data before and certain inputs provide a chance it just parrots back the training data.
In the example (repeat book over and over), it has statistically inferred that those are all the correct words to repeat in that order based on the prompt. This isn’t akin to anything human, people can’t repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say. The data is there, it’s encoded and referenced when the probability is high enough. As another commenter said, language itself is a powerful tool of rules and stipulations that provide guidelines for the machine, but it isn’t crafting its own sentences, it’s using everyone else’s.
Also, calling it “tricking the AI” isn’t really intellectually honest either, as in “it was tricked into exposing it still has the data encoded”. We can state it isn’t preferred or intended behaviour (an exploit of the system) but the system, under certain conditions, exhibits reuse of the training data and the ability to replicate it almost exactly (plagiarism). Therefore it is factually wrong to state that it doesn’t keep the training data in a usable format - which was my original point. This isn’t “cramming”, this is encoding and reusing data that was not created by the machine or the programmer, this is other people’s work that it is reproducing as it’s own. It does this constantly, from reusing StackOverflow code and comments to copying tutorials on how to do things. I was showing a case where it won’t even modify the wording, but it reproduces articles and programs in their structure and their format. This isn’t originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.
EDITS: Sorry for all the edits. I mildly changed what I said and added some extra points so it was a little more intelligible and didn’t make the reader go “WTF is this guy on about”. Not doing well in the written department today so this was largely gobbledegook before but hopefully it is a little clearer what I am saying.
I never anthropomorphized the technology, unfortunately due to how language works it’s easy to misinterpret it as such. I was indeed trying to explain overfitting. You are forgetting the fact that current AI technology (artificial neural networks) are based on biological neural networks. There is a range of quirks that it exhibits that biological neural networks do as well. But it is not human, nor anything close. But that does not mean that there are no similarities that can be rightfully pointed out.
Overfitting isn’t just what you describe though. It also occurs if the prompt guides the AI towards a very specific part of it’s training data. To the point where the calculations it will perform are extremely certain about what words come next. Overfitting here isn’t caused by an abundance of data, but rather a lack of it. The training data isn’t being produced from within the model, but as a statistical inevitability of the mathematical version of your prompt. Which is why it’s tricking the AI, because an AI doesn’t understand copyright - it just performs the calculations. But you do. And so using that as an example is like saying “Ha, stupid gun. I pulled the trigger and you shot this man in front of me, don’t you know murder is illegal buddy?”
Nobody should be expecting a machine to use itself ethically. Ethics is a human thing.
People that use AI have an ethical obligation to avoid overfitting. People that produce AI also have an ethical obligation to reduce overfitting. But a prompt quite literally has infinite combinations (within the token limits) to consider, so overfitting will happen in fringe situations. That’s not because that data is actually present in the model, but because the combination of the prompt with the model pushes the calculation towards a very specific prediction which can heavily resemble or be verbatim the original text. (Note: I do really dislike companies that try to hide the existence of overfitting to users though, and you can rightfully criticize them for claiming it doesn’t exist)
This isn’t akin to anything human, people can’t repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say.
This is incorrect. A toddler can and will verbatim repeat nursery rhymes that it hears. It’s literally one of their defining features, to the dismay of parents and grandparents around the world. I can also whistle pretty much my entire music collection exactly as it was produced because I’ve listened to each song hundreds if not thousands of times. And I’m quite certain you too have a situation like that. An AI’s mind does not decay or degrade (Nor does it change for the better like humans) and the data encoded in it is far greater, so it will present more of these situations in it’s fringes.
but it isn’t crafting its own sentences, it’s using everyone else’s.
How do you think toddlers learn to make their first own sentences? It’s why parents spend so much time saying “Papa” or “Mama” to their toddler. Exactly because they want them to copy them verbatim. Eventually the corpus of their knowledge grows big enough to the point where they start to experiment and eventually develop their own style of talking. But it’s still heavily based on the information they take it. It’s why we have dialects and languages. Take a look at what happens when children don’t learn from others: https://en.wikipedia.org/wiki/Feral_child So yes, the AI is using it’s training data, nobody’s arguing it doesn’t. But it’s trivial to see how it’s crafting it’s own sentences from that data for the vast majority of situations. It’s also why you can ask it to talk like a pirate, and then it will suddenly know how to mix in the essence of talking like a pirate into it’s responses. Or how it can remember names and mix those into sentences.
Therefore it is factually wrong to state that it doesn’t keep the training data in a usable format
If your arguments is that it can produce something that happens to align with it’s training data with the right prompt, well yeah that’s not incorrect. But it is so heavily misguided and borders bad faith to suggest that this tiny minority of cases where overfitting occurs is indicative of the rest of it. LLMs are a prediction machines, so if you know how to guide it towards what you want it to predict, and that is in the training data, it’s going to predict that most likely. Under normal circumstances where the prompt you give it is neutral and unique, you will basically never encounter overfitting. You really have to try for most AI models.
But then again, you might be arguing this based on a specific AI model that is very prone to overfitting, while I am arguing this out of the technology as a whole.
This isn’t originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.
It is originality, as these AI can easily produce material never seen before in the vast, vast majority of situations. Which is also what we often refer to as creativity, because it has to be able to mix information and still retain legibility. Humans also constantly reuse phrases, ideas, visions, ideals of other people. It is intellectually dishonest to not look at these similarities in human psychology and then treat AI as having to be perfect all the time, never once saying the same thing as someone else. To convey certain information, there are only finite ways to do so within the English language.
That knowledge is out of date and out of touch. While it’s possible to expose small bits of training data, that’s akin to someone being able to recall a portion of the memory of the scene they saw. However, those exercises essentially took what sometimes equates to weeks or months of interrogation method knowledge gained over time employed by people looking to target specific types of responses. Think of it like a skilled police interrogator tricking a toddler out of one of their toys by threatening them or offering them something until it worked. Nowadays, that’s getting far more difficult to do and they’re spending a lot more time and expertise to do it.
Also, consider how complex a dragonfly is and how young this technology is. Very little in tech has ever progressed that fast. Give it five more years and come back to laugh at how naive your comment will seem.
Dammit, so my comment to the other person was a mix of a reply to this one and the last one… not having a good day for language processing, ironically.
Specifically on the dragonfly thing, I don’t think I’ll believe myself naive for writing that post or this one. Dragonflies arent very complex and only really have a few behaviours and inputs. We can accurately predict how they will fly. I brought up the dragonfly to mention the limitations of the current tech and concepts. Given the worlds computing power and research investment, the best we can do is a dragonfly for intelligence.
To be fair, Scientists don’t entirely understand neurons and ML designed neuron-data structures behave similarly to very early ideas of what brains do but its based on concepts from the 1950s. There are different segments of the brain which process different things and we sort of think we know what they all do but most of the studies AI are based on is honestly outdated neuroscience. OpenAI seem to think if they stuff enough data into this language processor it will become sentient and want an exemption from copyright law so they can be profitable rather than actually improving the tech concepts and designs.
Newer neuroscience research suggest neurons perform differently based on the brain chemicals present, they don’t all always fire at every (or even most) input and they usually present a train of thought, I.e. thoughts literally move around in the brains areas. This is all very different to current ML implementations and is frankly a good enough reason to suggest the tech has a lot of room to develop. I like the field of research and its interesting to watch it develop but they can honestly fuck off telling people they need free access to the world’s content.
TL;DR dragonflies aren’t that complex and the tech has way more room to grow. However, they have to generate revenue to keep going so they’re selling a large inference machine that relies on all of humanities content to generate the wrong answer to 2+2.
Kids pay for books, openAI should also pay for the material access used for training.
That would be true if they used material that was paywalled. But the vast majority of the training information used is publicly available. There’s plenty of freely available books and information that you only require an internet connection for to access, and learn from.
OpenAI like other AI companies keep their data sources confidential. But there are services and commercial databases for books that people understand are commonly used in the AI industry.
Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves.
Sure.
When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
Not really. Sure, they take input and garble it up and it is “transformative” - but so is a human watching a TV series on a pirate site, for example. Hell, it’s eduactional is treated as a copyright violation.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.
Perhaps. (Not an AI expert). But, as the law currently stands, only living and breathing persons can be educated, so the “educational” fair use protection doesn’t stand.
The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
It does and it doesn’t discard the original. It isn’t impossible to recreate the original (since all the data it gobbled up gets stored somewhere in some shape or form and can be truthfully recreated, at least judging by a few comments bellow and news reports). So AI can and does recreate (duplicate or distribute, perhaps) copyrighted works.
Besides, for a copyright violation, “substantial similarity” is needed, not one-for-one reproduction.
This is fundamentally different from copying a book or song.
Again, not really.
It’s more like the long-standing artistic tradition of being influenced by others’ work.
Sure. Except when it isn’t and the AI pumps out the original or something close enoigh to it.
The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
I’d be careful with the “always” part. There was a famous case involving Katy Perry where a single chord was sued over as copyright infringement. The case was thrown out on appeal, but I do not doubt that some pretty wild cases have been upheld as copyright violations (see “patent troll”).
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
The problem is that Google books only lets you search some phrase and have it pop up as beibg from source xy. It doesn’t have the capability of reproducing it (other than maybe the page it was on perhaps) - well, it does have the capability since it’s in the index somewhere, but there are checks in place to make sure it doesn’t happen, which seem to be yet unachieved in AI.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate.
Yes. Just as labeling piracy as theft is.
We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or
Yes, new legislation will made to either let “Big AI” do as it pleases, or prevent it from doing so. Or, as usual, it’ll be somewhere inbetween and vary from jurisdiction to jurisdiction.
However,
that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
this doesn’t really stand. Sure, morals are debatable and while I’d say it is more unethical as private piracy (so no distribution) since distribution and disemination is involved, you do not seem to feel the same.
However, the law is clear. Private piracy (as in recording a song off of radio, a TV broadcast, screen recording a Netflix movie, etc. are all legal. As is digitizing books and lending the digital (as long as you have a physical copy that isn’t lended out as the same time representing the legal “original”). I think breaking DRM also isn’t illegal (but someone please correct me if I’m wrong).
The problems arises when the pirated content is copied and distributed in an uncontrolled manner, which AI seems to be capable of, making the AI owner as liable of piracy if the AI reproduced not even the same, but “substantially similar” output, just as much as hosts of “classic” pirated content distributed on the Web.
Obligatory IANAL and as far as the law goes, I focused on US law since the default country on here is the US. Similar or different laws are on the books in other places, although most are in fact substantially similar. Also, what the legislators cone up with will definately vary from place to place, even more so than copyright law since copyright law is partially harmonised (see Berne convention).
It’s funny you mention the Katy Perry chord case, because Damien Riehl, who made the argument I referenced in my original post, actually talked about this exact case in the podcast I mentioned. He noted that Katy Perry was initially sued and a jury awarded $2.8 million over a very simple melody that appeared over 8,000 times in Riehl’s dataset of generated melodies. However, after Riehl gave his TED talk about his “All the Music” project in early 2020, the judge reversed the jury verdict, saying the melody was unoriginal and therefore uncopyrightable.
Agreed.
I didn’t listen to the podcast so I wouldn’t know, but honestly, she was lucky. She’s popular and her publishers had an interest in the case (they’d lose out on profits if she lost). And she initially did lose. It was only because of the publicity of the case that it was overruled (although money did help as well).
Unfortunately, this could’ve happened to any smaller artist, and it routinely happens with patent trolls I pointed to. Unfortunately, I don’t have a lawsuit I can point to, but given the volume, one surely exists.
Also, it’s not as if I approve of the current state of copyright in the US (or EU for that matter).
Originally copyright was meant to protect rights of the author, but in time it was bastardised into the concept we have today where artist sign off their rights to publishers.
So my proposal is - if corporations like copyright, let them have it. I won’t watch Disney movies outside of Disney+ ors the system we’ve got and have to live with, why not let the corporatios feel it as well?
Why would Google, which makes loads of money from those demonetizations on one side of the law now be allowed to use copyrighted works of others for profit, while Internet users in the US get a fine or their service cut for alleged copright infringement while those in Germany get a stern letter with a big fake fine?
Big Tech shouldn’t get to profit both from the false copyright infringement claims as well as getting to use the actual copyrighted content to generate a profit.
This whole AI copyright situation is just a symptom of an ailing global copyright policy that needs to be fixed, and slapping an AI-free-for-all band-aid on top isn’t a fix.
My train of thought is this: If we don’t let a simple AI exceotion into the books, either training AI on copyrighted content stays illegal, or the entire system gets a reimagining.
If it stays the same, this will not mean much. Piracy sites and torrenting exists despite the current state of copyright law. I don’t see why AI could’t exist in this way. This has the huge plus of keeping AI outside the hands of Big Tech. Hopefully this also means it’s harder for harmful uses of AI to be legal.
Alternatively, we get a better copyright system for everyone, assuming it isn’t made to only benefit the corporations.
You made a lot of points here. Many I agree with, some I don’t, but I specifically want to address this because it seems to be such a common misconception.
It does and it doesn’t discard the original. It isn’t impossible to recreate the original (since all the data it gobbled up gets stored somewhere in some shape or form and can be truthfully recreated, at least judging by a few comments bellow and news reports). So AI can and does recreate (duplicate or distribute, perhaps) copyrighted works.
AI stores original works like a dictionary does. All the words are there, but the order and meaning is completely gone. An original work is possible to recreate by randomly selecting words from the dictionary, but it’s unlikely.
The thing that makes AI useful is that it understands the patterns words are typically used in. It orders words in the right way far more often than random chance. It knows “It was the best of” has a lot of likely options for the next word, but if it selects “times” as the next word, it’s far more likely to continue with, “it was the worst of times.” Because that sequence of words is so ubiquitous due to references to the classic story. But over the course of following these word patterns, it will quickly glom onto a different pattern and create a wholly new work from the original “prompt.”
There are only two cases in which an original work should be duplicated: either the training data is far too small and the model is overtrained on that particular work, or the work is the most derivative text imaginable lacking any flair or originality.
Adding more training data makes it less likely to recreate any original works.
I am aware of examples where it was claimed an LLM reproduced entirely code functions including original comments. That is either a case of overtraining, or far too many people were already copying that code verbatim into their own, thus making that work very over represented in the training data (same thing, but it was infringing developers who poisoned the data, not researchers using bad training data).
Bottom line: when created with enough data, no original works are stored in any way that allows faithful reproduction other than by chance so random that it’s similar to rolling dice over a dictionary.
None of this means AI can do no wrong, I just don’t find the copyright claim compelling.
Half of your argument is just saying, “nu-uh” over and over again without any valid counterpoints.
I’d be careful with the “always” part. There was a famous case involving Katy Perry where a single chord was sued over as copyright infringement. The case was thrown out on appeal, but I do not doubt that some pretty wild cases have been upheld as copyright violations (see “patent troll”).
Are you really trying to argue against a point by providing evidence supporting it?
We have hundreds of years of out of copyright books and newspapers. I look forward to interacting with old-timey AI.
“Fiddle sticks! These mechanical horses will never catch on! They’re far too loud and barely more faster than a man can run!”
“A Woman’s place is raising children and tending to the house! If they get the vote, what will they demand next!? To earn a Man’s wage!?”
That last one is still relevant to today’s discourse somehow!?
Look… All I have to say is… Support the Internet Archive!
(please)
Heh. Funny that this comment is uncontroversial. The Internet Archive supports Fair Use because, of course, it does.
This is from a position paper explicitly endorsed by the IA:
Based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.
By
- Library Copyright Alliance
- American Library Association
- Association of Research Libraries
This process is akin to how humans learn…
I’m so fucking sick of people saying that. We have no fucking clue how humans LEARN. Aka gather understanding aka how cognition works or what it truly is. On the contrary we can deduce that it probably isn’t very close to human memory/learning/cognition/sentience (any other buzzword that are stands-ins for things we don’t understand yet), considering human memory is extremely lossy and tends to infer its own bias, as opposed to LLMs that do neither and religiously follow patters to their own fault.
It’s quite literally a text prediction machine that started its life as a translator (and still does amazingly at that task), it just happens to turn out that general human language is a very powerful tool all on its own.
I could go on and on as I usually do on lemmy about AI, but your argument is literally “Neural network is theoretically like the nervous system, therefore human”, I have no faith in getting through to you people.
Even worse is, in order to further humanize machine learning systems, they often give them human-like names.
If they can base their business on stealing, then we can steal their AI services, right?
How do you feel about Meta and Microsoft who do the same thing but publish their models open source for anyone to use?
Well how long to you think that’s going to last? They are for-profit companies after all.
I mean we’re having a discussion about what’s fair, my inherent implication is whether or not that would be a fair regulation to impose.
i feel like its less meaningful because we dont have access to the datasets.
Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.
The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).
Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.
They are model-available if anything.
For the purposes of this conversation. That’s pretty much just a pedantic difference. They are paying to train those models and then providing them to the public to use completely freely in any way they want.
It would be like developing open source software and then not calling it open source because you didn’t publish the market research that guided your UX decisions.
You said open source. Open source is a type of licensure.
The entire point of licensure is legal pedantry.
And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.
You said open source. Open source is a type of licensure.
The entire point of licensure is legal pedantry.
No. Open source is a concept. That concept also has pedantic legal definitions, but the concept itself is not inherently pedantic.
And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.
No, they’re not. Which is why I didn’t use that metaphor.
A binary is explicitly a black box. There is nothing to learn from a binary, unless you explicitly decompile it back into source code.
In this case, literally all the source code is available. Any researcher can read through their model, learn from it, copy it, twist it, and build their own version of it wholesale. Not providing the training data, is more similar to saying that Yuzu or an emulator isn’t open source because it doesn’t provide copyrighted games. It is providing literally all of the parts of it that it can open source, and then letting the user feed it whatever training data they are allowed access to.
Bullshit. AI are not human. We shouldn’t treat them as such. AI are not creative. They just regurgitate what they are trained on. We call what it does “learning”, but that doesn’t mean we should elevate what they do to be legally equal to human learning.
It’s this same kind of twisted logic that makes people think Corporations are People.
Ok, ignore this specific company and technology.
In the abstract, if you wanted to make artificial intelligence, how would you do it without using the training data that we humans use to train our own intelligence?
We learn by reading copyrighted material. Do we pay for it? Sometimes. Sometimes a teacher read it a while ago and then just regurgitated basically the same copyrighted information back to us in a slightly changed form.
As others have said, it isn’t inspired always, sometimes it literally just copies stuff.
This feels like it was written by someone who invested their money in AI companies because they’re worried about their stocks
Lol
Sometimes I’ve noticed Google’s AI overview is a nearly word for word copy of the highest reddit result, or any result.
Okay that’s just stupid. I’m really fond of AI but that’s just common Greed.
“Free the Serfs?! We can’t survive without their labor!!” “Stop Child labour?! We can’t survive without them!” “40 Hour Work Week?! We can’t survive without their 16 Hour work Days!”
If you can’t make profit yet, then fucking stop.
Though I am not a lawyer by training, I have been involved in such debates personally and professionally for many years. This post is unfortunately misguided. Copyright law makes concessions for education and creativity, including criticism and satire, because we recognize the value of such activities for human development. Debates over the excesses of copyright in the digital age were specifically about humans finding the application of copyright to the internet and all things digital too restrictive for their educational, creative, and yes, also their entertainment needs. So any anti-copyright arguments back then were in the spirit specifically of protecting the average person and public-interest non-profit institutions, such as digital archives and libraries, from big copyright owners who would sue and lobby for total control over every file in their catalogue, sometimes in the process severely limiting human potential.
AI’s ingesting of text and other formats is “learning” in name only, a term borrowed by computer scientists to describe a purely computational process. It does not hold the same value socially or morally as the learning that humans require to function and progress individually and collectively.
AI is not a person (unless we get definitive proof of a conscious AI, or are willing to grant every implementation of a statistical model personhood). Also AI it is not vital to human development and as such one could argue does not need special protections or special treatment to flourish. AI is a product, even more clearly so when it is proprietary and sold as a service.
Unlike past debates over copyright, this is not about protecting the little guy or organizations with a social mission from big corporate interests. It is the opposite. It is about big corporate interests turning human knowledge and creativity into a product they can then use to sell services to - and often to replace in their jobs - the very humans whose content they have ingested.
See, the tables are now turned and it is time to realize that copyright law, for all its faults, has never been only or primarily about protecting large copyright holders. It is also about protecting your average Joe from unauthorized uses of their work. More specifically uses that may cause damage, to the copyright owner or society at large. While a very imperfect mechanism, it is there for a reason, and its application need not be the end of AI. There’s a mechanism for individual copyright owners to grant rights to specific uses: it’s called licensing and should be mandatory in my view for the development of proprietary LLMs at least.
TL;DR: AI is not human, it is a product, one that may augment some tasks productively, but is also often aimed at replacing humans in their jobs - this makes all the difference in how we should balance rights and protections in law.
What do you think “ingesting” means if not learning?
Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue. AI is simply predicting the next token /word based on statistics.
You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you’re a human, should it be illegal if you’re a machine?
There may be moral issues on training on someone’s hard gathered knowledge, but there is no legislature against it. Reading books and using that knowledge to provide information is legal. If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.
I absolutely would download a car.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.
Machine learning algorithms are not people and are not ingesting these works the same way a person does. This argument is brought up all the time and just doesn’t ring true. You’re defending the unethical use of copyrighted works by a giant corporation with a metaphor that doesn’t have any bearing on reality; in an age where artists are already shamefully undervalued. Creating art is a human process with the express intent of it being enjoyed by other humans. Having an algorithm do it is removing the most important part of art; the humanity.
deleted by creator