Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.

This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.

Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

  • @arin@lemmy.world
    link
    fedilink
    English
    155 months ago

    Kids pay for books, openAI should also pay for the material access used for training.

    • @ClamDrinker@lemmy.world
      link
      fedilink
      English
      0
      edit-2
      5 months ago

      That would be true if they used material that was paywalled. But the vast majority of the training information used is publicly available. There’s plenty of freely available books and information that you only require an internet connection for to access, and learn from.

    • @FatCat@lemmy.worldOP
      link
      fedilink
      English
      -85 months ago

      OpenAI like other AI companies keep their data sources confidential. But there are services and commercial databases for books that people understand are commonly used in the AI industry.

  • JoshCodes
    link
    fedilink
    English
    195 months ago

    Studied AI at uni. I’m also a cyber security professional. AI can be hacked or tricked into exposing training data. Therefore your claim about it disposing of the training material is totally wrong.

    Ask your search engine of choice what happened when Gippity was asked to print the word “book” indefinitely. Answer: it printed training material after printing the word book a couple hundred times.

    Also my main tutor in uni was a neuroscientist. Dude straight up told us that the current AI was only capable of accurately modelling something as complex as a dragon fly. For larger organisms it is nowhere near an accurate recreation of a brain. There are complexities in our brain chemistry that simply aren’t accounted for in a statistical inference model and definitely not in the current gpt models.

    • @ClamDrinker@lemmy.world
      link
      fedilink
      English
      0
      edit-2
      5 months ago

      Your first point is misguided and incorrect. If you’ve ever learned something by ‘cramming’, a.k.a. just repeating ingesting material until you remember it completely. You don’t need the book in front of you anymore to write the material down verbatim in a test. You still discarded your training material despite you knowing the exact contents. If this was all the AI could do it would indeed be an infringement machine. But you said it yourself, you need to trick the AI to do this. It’s not made to do this, but certain sentences are indeed almost certain to show up with the right conditioning. Which is indeed something anyone using an AI should be aware of, and avoid that kind of conditioning. (Which in practice often just means, don’t ask the AI to make something infringing)

      • JoshCodes
        link
        fedilink
        English
        2
        edit-2
        5 months ago

        I think you’re anthropomorphising the tech tbh. It’s not a person or an animal, it’s a machine and cramming doesn’t work in the idea of neural networks. They’re a mathematical calculation over a vast multidimensional matrix, effectively solving a polynomial of an unimaginable order. So “cramming” as you put it doesn’t work because by definition an LLM cannot forget information because once it’s applied the calculations, it is in there forever. That information is supposed to be blended together. Overfitting is the closest thing to what you’re describing, which would be inputting similar information (training data) and performing the similar calculations throughout the network, and it would therefore exhibit poor performance should it be asked do anything different to the training.

        What I’m arguing over here is language rather than a system so let’s do that and note the flaws. If we’re being intellectually honest we can agree that a flaw like reproducing large portions of a work doesn’t represent true learning and shows a reliance on the training data, i.e. it cant learn unless it has seen similar data before and certain inputs provide a chance it just parrots back the training data.

        In the example (repeat book over and over), it has statistically inferred that those are all the correct words to repeat in that order based on the prompt. This isn’t akin to anything human, people can’t repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say. The data is there, it’s encoded and referenced when the probability is high enough. As another commenter said, language itself is a powerful tool of rules and stipulations that provide guidelines for the machine, but it isn’t crafting its own sentences, it’s using everyone else’s.

        Also, calling it “tricking the AI” isn’t really intellectually honest either, as in “it was tricked into exposing it still has the data encoded”. We can state it isn’t preferred or intended behaviour (an exploit of the system) but the system, under certain conditions, exhibits reuse of the training data and the ability to replicate it almost exactly (plagiarism). Therefore it is factually wrong to state that it doesn’t keep the training data in a usable format - which was my original point. This isn’t “cramming”, this is encoding and reusing data that was not created by the machine or the programmer, this is other people’s work that it is reproducing as it’s own. It does this constantly, from reusing StackOverflow code and comments to copying tutorials on how to do things. I was showing a case where it won’t even modify the wording, but it reproduces articles and programs in their structure and their format. This isn’t originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.

        EDITS: Sorry for all the edits. I mildly changed what I said and added some extra points so it was a little more intelligible and didn’t make the reader go “WTF is this guy on about”. Not doing well in the written department today so this was largely gobbledegook before but hopefully it is a little clearer what I am saying.

        • @ClamDrinker@lemmy.world
          link
          fedilink
          English
          15 months ago

          I never anthropomorphized the technology, unfortunately due to how language works it’s easy to misinterpret it as such. I was indeed trying to explain overfitting. You are forgetting the fact that current AI technology (artificial neural networks) are based on biological neural networks. There is a range of quirks that it exhibits that biological neural networks do as well. But it is not human, nor anything close. But that does not mean that there are no similarities that can be rightfully pointed out.

          Overfitting isn’t just what you describe though. It also occurs if the prompt guides the AI towards a very specific part of it’s training data. To the point where the calculations it will perform are extremely certain about what words come next. Overfitting here isn’t caused by an abundance of data, but rather a lack of it. The training data isn’t being produced from within the model, but as a statistical inevitability of the mathematical version of your prompt. Which is why it’s tricking the AI, because an AI doesn’t understand copyright - it just performs the calculations. But you do. And so using that as an example is like saying “Ha, stupid gun. I pulled the trigger and you shot this man in front of me, don’t you know murder is illegal buddy?”

          Nobody should be expecting a machine to use itself ethically. Ethics is a human thing.

          People that use AI have an ethical obligation to avoid overfitting. People that produce AI also have an ethical obligation to reduce overfitting. But a prompt quite literally has infinite combinations (within the token limits) to consider, so overfitting will happen in fringe situations. That’s not because that data is actually present in the model, but because the combination of the prompt with the model pushes the calculation towards a very specific prediction which can heavily resemble or be verbatim the original text. (Note: I do really dislike companies that try to hide the existence of overfitting to users though, and you can rightfully criticize them for claiming it doesn’t exist)

          This isn’t akin to anything human, people can’t repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say.

          This is incorrect. A toddler can and will verbatim repeat nursery rhymes that it hears. It’s literally one of their defining features, to the dismay of parents and grandparents around the world. I can also whistle pretty much my entire music collection exactly as it was produced because I’ve listened to each song hundreds if not thousands of times. And I’m quite certain you too have a situation like that. An AI’s mind does not decay or degrade (Nor does it change for the better like humans) and the data encoded in it is far greater, so it will present more of these situations in it’s fringes.

          but it isn’t crafting its own sentences, it’s using everyone else’s.

          How do you think toddlers learn to make their first own sentences? It’s why parents spend so much time saying “Papa” or “Mama” to their toddler. Exactly because they want them to copy them verbatim. Eventually the corpus of their knowledge grows big enough to the point where they start to experiment and eventually develop their own style of talking. But it’s still heavily based on the information they take it. It’s why we have dialects and languages. Take a look at what happens when children don’t learn from others: https://en.wikipedia.org/wiki/Feral_child So yes, the AI is using it’s training data, nobody’s arguing it doesn’t. But it’s trivial to see how it’s crafting it’s own sentences from that data for the vast majority of situations. It’s also why you can ask it to talk like a pirate, and then it will suddenly know how to mix in the essence of talking like a pirate into it’s responses. Or how it can remember names and mix those into sentences.

          Therefore it is factually wrong to state that it doesn’t keep the training data in a usable format

          If your arguments is that it can produce something that happens to align with it’s training data with the right prompt, well yeah that’s not incorrect. But it is so heavily misguided and borders bad faith to suggest that this tiny minority of cases where overfitting occurs is indicative of the rest of it. LLMs are a prediction machines, so if you know how to guide it towards what you want it to predict, and that is in the training data, it’s going to predict that most likely. Under normal circumstances where the prompt you give it is neutral and unique, you will basically never encounter overfitting. You really have to try for most AI models.

          But then again, you might be arguing this based on a specific AI model that is very prone to overfitting, while I am arguing this out of the technology as a whole.

          This isn’t originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.

          It is originality, as these AI can easily produce material never seen before in the vast, vast majority of situations. Which is also what we often refer to as creativity, because it has to be able to mix information and still retain legibility. Humans also constantly reuse phrases, ideas, visions, ideals of other people. It is intellectually dishonest to not look at these similarities in human psychology and then treat AI as having to be perfect all the time, never once saying the same thing as someone else. To convey certain information, there are only finite ways to do so within the English language.

    • @soul@lemmy.world
      link
      fedilink
      English
      15 months ago

      That knowledge is out of date and out of touch. While it’s possible to expose small bits of training data, that’s akin to someone being able to recall a portion of the memory of the scene they saw. However, those exercises essentially took what sometimes equates to weeks or months of interrogation method knowledge gained over time employed by people looking to target specific types of responses. Think of it like a skilled police interrogator tricking a toddler out of one of their toys by threatening them or offering them something until it worked. Nowadays, that’s getting far more difficult to do and they’re spending a lot more time and expertise to do it.

      Also, consider how complex a dragonfly is and how young this technology is. Very little in tech has ever progressed that fast. Give it five more years and come back to laugh at how naive your comment will seem.

      • JoshCodes
        link
        fedilink
        English
        15 months ago

        Dammit, so my comment to the other person was a mix of a reply to this one and the last one… not having a good day for language processing, ironically.

        Specifically on the dragonfly thing, I don’t think I’ll believe myself naive for writing that post or this one. Dragonflies arent very complex and only really have a few behaviours and inputs. We can accurately predict how they will fly. I brought up the dragonfly to mention the limitations of the current tech and concepts. Given the worlds computing power and research investment, the best we can do is a dragonfly for intelligence.

        To be fair, Scientists don’t entirely understand neurons and ML designed neuron-data structures behave similarly to very early ideas of what brains do but its based on concepts from the 1950s. There are different segments of the brain which process different things and we sort of think we know what they all do but most of the studies AI are based on is honestly outdated neuroscience. OpenAI seem to think if they stuff enough data into this language processor it will become sentient and want an exemption from copyright law so they can be profitable rather than actually improving the tech concepts and designs.

        Newer neuroscience research suggest neurons perform differently based on the brain chemicals present, they don’t all always fire at every (or even most) input and they usually present a train of thought, I.e. thoughts literally move around in the brains areas. This is all very different to current ML implementations and is frankly a good enough reason to suggest the tech has a lot of room to develop. I like the field of research and its interesting to watch it develop but they can honestly fuck off telling people they need free access to the world’s content.

        TL;DR dragonflies aren’t that complex and the tech has way more room to grow. However, they have to generate revenue to keep going so they’re selling a large inference machine that relies on all of humanities content to generate the wrong answer to 2+2.

  • @Veneroso@lemmy.world
    link
    fedilink
    English
    95 months ago

    We have hundreds of years of out of copyright books and newspapers. I look forward to interacting with old-timey AI.

    “Fiddle sticks! These mechanical horses will never catch on! They’re far too loud and barely more faster than a man can run!”

    “A Woman’s place is raising children and tending to the house! If they get the vote, what will they demand next!? To earn a Man’s wage!?”

    That last one is still relevant to today’s discourse somehow!?

  • @Capricorn_Geriatric@lemmy.world
    link
    fedilink
    English
    125 months ago

    Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves.

    Sure.

    When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.

    Not really. Sure, they take input and garble it up and it is “transformative” - but so is a human watching a TV series on a pirate site, for example. Hell, it’s eduactional is treated as a copyright violation.

    This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

    Perhaps. (Not an AI expert). But, as the law currently stands, only living and breathing persons can be educated, so the “educational” fair use protection doesn’t stand.

    The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.

    It does and it doesn’t discard the original. It isn’t impossible to recreate the original (since all the data it gobbled up gets stored somewhere in some shape or form and can be truthfully recreated, at least judging by a few comments bellow and news reports). So AI can and does recreate (duplicate or distribute, perhaps) copyrighted works.

    Besides, for a copyright violation, “substantial similarity” is needed, not one-for-one reproduction.

    This is fundamentally different from copying a book or song.

    Again, not really.

    It’s more like the long-standing artistic tradition of being influenced by others’ work.

    Sure. Except when it isn’t and the AI pumps out the original or something close enoigh to it.

    The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.

    I’d be careful with the “always” part. There was a famous case involving Katy Perry where a single chord was sued over as copyright infringement. The case was thrown out on appeal, but I do not doubt that some pretty wild cases have been upheld as copyright violations (see “patent troll”).

    Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

    The problem is that Google books only lets you search some phrase and have it pop up as beibg from source xy. It doesn’t have the capability of reproducing it (other than maybe the page it was on perhaps) - well, it does have the capability since it’s in the index somewhere, but there are checks in place to make sure it doesn’t happen, which seem to be yet unachieved in AI.

    While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate.

    Yes. Just as labeling piracy as theft is.

    We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or

    Yes, new legislation will made to either let “Big AI” do as it pleases, or prevent it from doing so. Or, as usual, it’ll be somewhere inbetween and vary from jurisdiction to jurisdiction.

    However,

    that doesn’t make the current use of copyrighted works for AI training illegal or unethical.

    this doesn’t really stand. Sure, morals are debatable and while I’d say it is more unethical as private piracy (so no distribution) since distribution and disemination is involved, you do not seem to feel the same.

    However, the law is clear. Private piracy (as in recording a song off of radio, a TV broadcast, screen recording a Netflix movie, etc. are all legal. As is digitizing books and lending the digital (as long as you have a physical copy that isn’t lended out as the same time representing the legal “original”). I think breaking DRM also isn’t illegal (but someone please correct me if I’m wrong).

    The problems arises when the pirated content is copied and distributed in an uncontrolled manner, which AI seems to be capable of, making the AI owner as liable of piracy if the AI reproduced not even the same, but “substantially similar” output, just as much as hosts of “classic” pirated content distributed on the Web.

    Obligatory IANAL and as far as the law goes, I focused on US law since the default country on here is the US. Similar or different laws are on the books in other places, although most are in fact substantially similar. Also, what the legislators cone up with will definately vary from place to place, even more so than copyright law since copyright law is partially harmonised (see Berne convention).

    • @FatCat@lemmy.worldOP
      link
      fedilink
      English
      05 months ago

      It’s funny you mention the Katy Perry chord case, because Damien Riehl, who made the argument I referenced in my original post, actually talked about this exact case in the podcast I mentioned. He noted that Katy Perry was initially sued and a jury awarded $2.8 million over a very simple melody that appeared over 8,000 times in Riehl’s dataset of generated melodies. However, after Riehl gave his TED talk about his “All the Music” project in early 2020, the judge reversed the jury verdict, saying the melody was unoriginal and therefore uncopyrightable.

      • @Capricorn_Geriatric@lemmy.world
        link
        fedilink
        English
        15 months ago

        Agreed.

        I didn’t listen to the podcast so I wouldn’t know, but honestly, she was lucky. She’s popular and her publishers had an interest in the case (they’d lose out on profits if she lost). And she initially did lose. It was only because of the publicity of the case that it was overruled (although money did help as well).

        Unfortunately, this could’ve happened to any smaller artist, and it routinely happens with patent trolls I pointed to. Unfortunately, I don’t have a lawsuit I can point to, but given the volume, one surely exists.

        Also, it’s not as if I approve of the current state of copyright in the US (or EU for that matter).

        Originally copyright was meant to protect rights of the author, but in time it was bastardised into the concept we have today where artist sign off their rights to publishers.

        So my proposal is - if corporations like copyright, let them have it. I won’t watch Disney movies outside of Disney+ ors the system we’ve got and have to live with, why not let the corporatios feel it as well?

        Why would Google, which makes loads of money from those demonetizations on one side of the law now be allowed to use copyrighted works of others for profit, while Internet users in the US get a fine or their service cut for alleged copright infringement while those in Germany get a stern letter with a big fake fine?

        Big Tech shouldn’t get to profit both from the false copyright infringement claims as well as getting to use the actual copyrighted content to generate a profit.

        This whole AI copyright situation is just a symptom of an ailing global copyright policy that needs to be fixed, and slapping an AI-free-for-all band-aid on top isn’t a fix.

        My train of thought is this: If we don’t let a simple AI exceotion into the books, either training AI on copyrighted content stays illegal, or the entire system gets a reimagining.

        If it stays the same, this will not mean much. Piracy sites and torrenting exists despite the current state of copyright law. I don’t see why AI could’t exist in this way. This has the huge plus of keeping AI outside the hands of Big Tech. Hopefully this also means it’s harder for harmful uses of AI to be legal.

        Alternatively, we get a better copyright system for everyone, assuming it isn’t made to only benefit the corporations.

    • @MagicShel@programming.dev
      link
      fedilink
      English
      2
      edit-2
      5 months ago

      You made a lot of points here. Many I agree with, some I don’t, but I specifically want to address this because it seems to be such a common misconception.

      It does and it doesn’t discard the original. It isn’t impossible to recreate the original (since all the data it gobbled up gets stored somewhere in some shape or form and can be truthfully recreated, at least judging by a few comments bellow and news reports). So AI can and does recreate (duplicate or distribute, perhaps) copyrighted works.

      AI stores original works like a dictionary does. All the words are there, but the order and meaning is completely gone. An original work is possible to recreate by randomly selecting words from the dictionary, but it’s unlikely.

      The thing that makes AI useful is that it understands the patterns words are typically used in. It orders words in the right way far more often than random chance. It knows “It was the best of” has a lot of likely options for the next word, but if it selects “times” as the next word, it’s far more likely to continue with, “it was the worst of times.” Because that sequence of words is so ubiquitous due to references to the classic story. But over the course of following these word patterns, it will quickly glom onto a different pattern and create a wholly new work from the original “prompt.”

      There are only two cases in which an original work should be duplicated: either the training data is far too small and the model is overtrained on that particular work, or the work is the most derivative text imaginable lacking any flair or originality.

      Adding more training data makes it less likely to recreate any original works.

      I am aware of examples where it was claimed an LLM reproduced entirely code functions including original comments. That is either a case of overtraining, or far too many people were already copying that code verbatim into their own, thus making that work very over represented in the training data (same thing, but it was infringing developers who poisoned the data, not researchers using bad training data).

      Bottom line: when created with enough data, no original works are stored in any way that allows faithful reproduction other than by chance so random that it’s similar to rolling dice over a dictionary.

      None of this means AI can do no wrong, I just don’t find the copyright claim compelling.

    • @soul@lemmy.world
      link
      fedilink
      English
      -25 months ago

      Half of your argument is just saying, “nu-uh” over and over again without any valid counterpoints.

    • @Michal@programming.dev
      link
      fedilink
      English
      -15 months ago

      I’d be careful with the “always” part. There was a famous case involving Katy Perry where a single chord was sued over as copyright infringement. The case was thrown out on appeal, but I do not doubt that some pretty wild cases have been upheld as copyright violations (see “patent troll”).

      Are you really trying to argue against a point by providing evidence supporting it?

    • @General_Effort@lemmy.world
      link
      fedilink
      English
      15 months ago

      Heh. Funny that this comment is uncontroversial. The Internet Archive supports Fair Use because, of course, it does.

      This is from a position paper explicitly endorsed by the IA:

      Based on well-established precedent, the ingestion of copyrighted works to create large language models or other AI training databases generally is a fair use.

      By

      • Library Copyright Alliance
      • American Library Association
      • Association of Research Libraries
  • @LANIK2000@lemmy.world
    link
    fedilink
    English
    205 months ago

    This process is akin to how humans learn…

    I’m so fucking sick of people saying that. We have no fucking clue how humans LEARN. Aka gather understanding aka how cognition works or what it truly is. On the contrary we can deduce that it probably isn’t very close to human memory/learning/cognition/sentience (any other buzzword that are stands-ins for things we don’t understand yet), considering human memory is extremely lossy and tends to infer its own bias, as opposed to LLMs that do neither and religiously follow patters to their own fault.

    It’s quite literally a text prediction machine that started its life as a translator (and still does amazingly at that task), it just happens to turn out that general human language is a very powerful tool all on its own.

    I could go on and on as I usually do on lemmy about AI, but your argument is literally “Neural network is theoretically like the nervous system, therefore human”, I have no faith in getting through to you people.

    • @ZILtoid1991@lemmy.world
      link
      fedilink
      English
      95 months ago

      Even worse is, in order to further humanize machine learning systems, they often give them human-like names.

  • lettruthout
    link
    fedilink
    English
    1115 months ago

    If they can base their business on stealing, then we can steal their AI services, right?

    • @masterspace@lemmy.ca
      link
      fedilink
      English
      55 months ago

      How do you feel about Meta and Microsoft who do the same thing but publish their models open source for anyone to use?

      • lettruthout
        link
        fedilink
        English
        125 months ago

        Well how long to you think that’s going to last? They are for-profit companies after all.

        • @masterspace@lemmy.ca
          link
          fedilink
          English
          25 months ago

          I mean we’re having a discussion about what’s fair, my inherent implication is whether or not that would be a fair regulation to impose.

      • @WalnutLum@lemmy.ml
        link
        fedilink
        English
        75 months ago

        Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.

        The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).

        Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

        They are model-available if anything.

        • @masterspace@lemmy.ca
          link
          fedilink
          English
          -25 months ago

          For the purposes of this conversation. That’s pretty much just a pedantic difference. They are paying to train those models and then providing them to the public to use completely freely in any way they want.

          It would be like developing open source software and then not calling it open source because you didn’t publish the market research that guided your UX decisions.

          • @WalnutLum@lemmy.ml
            link
            fedilink
            English
            35 months ago

            You said open source. Open source is a type of licensure.

            The entire point of licensure is legal pedantry.

            And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.

            • @masterspace@lemmy.ca
              link
              fedilink
              English
              1
              edit-2
              4 months ago

              You said open source. Open source is a type of licensure.

              The entire point of licensure is legal pedantry.

              No. Open source is a concept. That concept also has pedantic legal definitions, but the concept itself is not inherently pedantic.

              And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.

              No, they’re not. Which is why I didn’t use that metaphor.

              A binary is explicitly a black box. There is nothing to learn from a binary, unless you explicitly decompile it back into source code.

              In this case, literally all the source code is available. Any researcher can read through their model, learn from it, copy it, twist it, and build their own version of it wholesale. Not providing the training data, is more similar to saying that Yuzu or an emulator isn’t open source because it doesn’t provide copyrighted games. It is providing literally all of the parts of it that it can open source, and then letting the user feed it whatever training data they are allowed access to.

  • @dhork@lemmy.world
    link
    fedilink
    English
    225 months ago

    Bullshit. AI are not human. We shouldn’t treat them as such. AI are not creative. They just regurgitate what they are trained on. We call what it does “learning”, but that doesn’t mean we should elevate what they do to be legally equal to human learning.

    It’s this same kind of twisted logic that makes people think Corporations are People.

    • @masterspace@lemmy.ca
      link
      fedilink
      English
      -7
      edit-2
      5 months ago

      Ok, ignore this specific company and technology.

      In the abstract, if you wanted to make artificial intelligence, how would you do it without using the training data that we humans use to train our own intelligence?

      We learn by reading copyrighted material. Do we pay for it? Sometimes. Sometimes a teacher read it a while ago and then just regurgitated basically the same copyrighted information back to us in a slightly changed form.

  • Roflmasterbigpimp
    link
    fedilink
    English
    135 months ago

    Okay that’s just stupid. I’m really fond of AI but that’s just common Greed.

    “Free the Serfs?! We can’t survive without their labor!!” “Stop Child labour?! We can’t survive without them!” “40 Hour Work Week?! We can’t survive without their 16 Hour work Days!”

    If you can’t make profit yet, then fucking stop.

  • @auzy@lemmy.world
    link
    fedilink
    English
    205 months ago

    As others have said, it isn’t inspired always, sometimes it literally just copies stuff.

    This feels like it was written by someone who invested their money in AI companies because they’re worried about their stocks

  • @MeaanBeaan@lemmy.world
    link
    fedilink
    English
    145 months ago

    This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

    Machine learning algorithms are not people and are not ingesting these works the same way a person does. This argument is brought up all the time and just doesn’t ring true. You’re defending the unethical use of copyrighted works by a giant corporation with a metaphor that doesn’t have any bearing on reality; in an age where artists are already shamefully undervalued. Creating art is a human process with the express intent of it being enjoyed by other humans. Having an algorithm do it is removing the most important part of art; the humanity.

  • @PixelProf@lemmy.ca
    link
    fedilink
    English
    105 months ago

    As someone who researched AI pre-GPT to enhance human creativity and aid in creative workflows, it’s sad for me to see the direction it’s been marketed, but not surprised. I’m personally excited by the tech because I personally see a really positive place for it where the data usage is arguably justified, but we either need to break through the current applications of it which seems more aimed at stock prices and wow-factoring the public instead of using them for what they’re best at.

    The whole exciting part of these was that it could convert unstructured inputs into natural language and structured outputs. Translation tasks (broad definition of translation), extracting key data points in unstructured data, language tasks. It’s outstanding for the NLP tasks we struggled with previously, and these tasks are highly transformative or any inputs, it purely relies on structural patterns. I think few people would argue NLP tasks are infringing on the copyright owner.

    But I can at least see how moving the direction toward (particularly with MoE approaches) using Q&A data to support generating Q&A outputs, media data to support generating media outputs, using code data to support generating code, this moves toward the territory of affecting sales and using someone’s IP to compete against them. From a technical perspective, I understand how LLMs are not really copying, but the way they are marketed and tuned seems to be more and more intended to use people’s data to compete against them, which is dubious at best.