Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
In the LLM space, "open source" is being used to mean "downloadable weights" (alessiofanelli.com)
400 points by FanaHOVA on July 21, 2023 | hide | past | favorite | 228 comments


> For the foreseeable future, open source and open weights will be used interchangeably, and I think that’s okay.

This is a little weird given that directly above, the author puts LLaMA into the "restricted weights" category. Even by the definition the author proposes, LLaMA 2.0 isn't open source; we shouldn't be calling it open source.

If open source in the LLM world means "you can get the weights" and doesn't imply anything about restrictions on their usage, then I don't think that's adapting terminology to a new context, I think it's really cheapening the meaning of Open Source. If you want to refer to specifically "open weights" as open source, I'm a bit more sympathetic to that (although I don't think it's the right terminology to use). But I see where people are coming from -- I'm not too put off by people using open source to describe weights you can download without restrictions on usage.

But LLaMA is not open weights. It's a closed, proprietary set of weights[0] that at best could be compared to source available software.

It is deceptive for Facebook to call LLaMA open source, and we shouldn't go along with that narrative.

[0]: to the extent weights can be copyrighted at all, which I would argue they can't be copyrighted, but that's another conversation.


Author here. I agree with you. LLaMA2 isn't open source (as my title says, the HN one was modified). My point is that the average person will still call it "open source" because they don't know any better, and it's hard to fix that. Rather than just saying "this isn't open source", we should try to come up with better terminology.

Also, while weights usage might be restricted, it's a very big compute investment shared with the public. They use a 285:1 training tokens to params ratio, and the loss graphs show the model wasn't yet saturated. This is valuable information for other teams looking to train their own models.

LLaMA1 was highly restrictive, but the data mix mentioned in the paper led to the creation of RedPajama, which was used in the training of MPT. There's still plenty of value in this work that will flow to open source, even if it doesn't fit in the traditional labels.


As I said last week, compiling source-code does not cost millions of dollars. How much does it cost to gather training data ? Training llama costed around 30 millions in infrastructure + 50k in power costs (source: https://news.ycombinator.com/item?id=35008694).


Thanks for replying! And agreed on the title change; I think your original title is much, much better phrased and I'm sorry that I glossed over it when reading the article (although I'm not sure "doesn't matter" fully captures the distinction you're making here) -- mods probably shouldn't have changed it.

> There's still plenty of value in this work that will flow to open source, even if it doesn't fit in the traditional labels.

That is a good point; the fight over what is open source and what is source available can get heated, and part of that is a defense against the erosion of the term. But... in general source available is better than closed source software. And LLaMA 2 is a significant improvement over LLaMA 1 in that regard, it really is. So I don't necessarily want to be down on it, in some ways it's just backlash of being tired of companies stretching definitions. But they're doing a thing that will absolutely help improve open access to LLMs.

I'm always a little bit torn about how to go about this kind of criticism of terminology, and I'm not trying to say that people shouldn't be excited about LLaMA 2. But the way it works out I'm often playing word police because the erosion of the term does make it harder to refer to models with actual open weights like StableLM. Facebook deserves real praise for releasing a model with weights that can be used commercially. It doesn't deserve to be treated as if what it's doing is equivalent to what StabilityAI or RedPanda is doing.

I do like your terminology of "open weights" and "restricted weights", and I wouldn't be opposed to even breaking that down even further, I think there's a clear difference between LLaMA 1 and 2 in terms of user freedom, so I'm not opposed to people trying to distinguish, just... it's not hitting the bar of being open weights.

It's a bit like if the word vegetarian didn't exist, and if everyone argued about how it's unhelpful to say that drinking milk isn't vegan because it's still tangibly different from eating meat. On one hand I agree, but on the other hand it's better to have another category for it that means "not vegan, but still not eating meat." There is an actual danger in blurring a line so much that the line doesn't mean anything anymore, and where people who mean something more rigorous no longer have a term to communicate amongst themselves. If average people get bothered by throwing LLaMA 2 into the "restricted weights" category, it's better to introduce another category between restricted and open that means "restricted but not commercially".

Beyond that though... yeah, I agree. I don't really have a problem with people calling open weights open source, my only objection to that is kind of technical and pedantic, but I don't think it causes any actual harm if someone wants to call StableLM open source.


I didn't realize that the llama license forbids you from using its outputs to train other models. That's essentially a dealbreaker, synthetic data is going to be the most important type of training data from here on out. Any model that prohibits use of synthetic data to train new models is crippled.


It's hilarious that big players in this space seem to think these are consistent views:

- It's okay to train a model on arbitrary internet data without permission/license just because you can access it

- It's not okay train a model on our model


Like google is allowed to scrape the whole internet but you’re not allowed to scrape google. Rules for thee but not for me


Also the main business model of Google (and of search engines in general) is to republish rearranged snippets of copyrighted content and even serve whole copies of the content (googleusercontent cache), without prior authorization of the copyright holders, and for-profit.

It’s completely illegal if you think about it.

So why LLMs who crawl the internet to present snippets and information should be treated differently from Google ? (who also reproduce verbatim the same content without paying any compensation to the copyright owners (all types: text, image, code)


> It’s completely illegal if you think about it.

Google would argue (and they won in federal court versus the Author's Guild using this argument) that displaying snippets of publicly-crawlable websites constitutes "fair use." Profitability weighs against fair use but it doesn't discount it outright.

They would also probably cite robots.txt as an easy and widely-accepted "opt-out" method.

Overall, I'm not sure any court would rule against Google's use of snippets for search. And since Google's been around for over 20 years and they haven't lost a lawsuit over it, I don't think it's accurate to say "it's completely illegal if you think about it."

US copyright law is one of those things that might seem simple, but really isn't. Hence many of the copyright lawsuits clogging our judicial system.


If I was a gambling person I would say that interpretation of fair use is going to fall in the next 20 years as there is just too much weight put on it currently, and AI is just going to make it untenable in its current form.

In addition, the fair use test contains a pillar about the use not affecting the market for the copyright holder's works[1] which I think in google's case (and probably in the current openAI case too) seems obviously not to have worked out (ie google's use has demonstrably negatively affected the market for the original copyrighted work in cases such as news for example).

[1]: https://fairuse.stanford.edu/overview/fair-use/four-factors/


> ie google's use has demonstrably negatively affected the market for the original copyrighted work in cases such as news for example

Most news sites wouldn't get any traffic without search engines and aggegrators. Which is why they are now whining about FB et al no longer sending them traffic.

And let's not forget that both traditional and online news is no stranger to republishing other people's content - one of the reasons fair use exists in the first place.

I have no love for big tech but let's not pretent that this is about anything other than news publishers wanting more gibs.


Well it's because judges are humans and humans are fallible. Humans also "like google" because it makes their life easier. It's hard to punish an entity you like.


It just likes a little imoral vs illegal confusion.


You think search engines are immoral? You think we should pay to view the snippets under the results we don't click?


No, I'm saying even though some thing is legal, it could still be imoral. And vice-versa.


I don't think we should pay, I think Google should. They're the ones making profit.


The result of that is either that they wouldn't show snippets or that they would pass the cost on to you. And do you think they profit from showing the snippets of results that are not the result you want to click on?


Not wanting to defend the likes of Google, but search engines link the original source (in contrast to LLMs). Their basic idea is to direct people to your content. There are countries where content companies didn't like what Google does: Google took them out of the index -> suddenly they where ok with it again so that Google put them in again. (extremely simplified story)


> Their basic idea is to direct people to your content.

This is less and less true, as evidenced by the progression of 0-click searchs.

> There are countries where content companies didn't like what Google does: Google took them out of the index -> suddenly they where ok with it again so that Google put them in again.

This story screams antitrust.


I over-simplified. It's about Google News. The news paper companies managed to lobby for a law that requires search providers to pay money to the news papers they link to (or for the tiny excerpt they show in the search results). So Google said they will discontinue Google News in those countries. Suddenly the news papers gave Google a free license to link to them. (still simplified story)


> This story screams antitrust.

You're right, the number of news publishers that share a common owner is something that should be of concern to antitrust enforcers.


> This story screams antitrust.

It does but the complainers are usually tabloid crap pushers whom no one in power really supports.


Because search engines do not create mishmash of this data to parrot some stuff about it. Also they don’t strip the source, the license, and stop scraping my site when I tell them.

LLMs scrape my site and code, strip all identifying information and license, and provide/sell that to others for profit, without my consent.

There are so many wrongs here, at every level.


It wouldn’t. Facebook is delusional if they think the license can pass muster.

Presumably you can’t build an LLM that is a competitor of LlaMA using its outputs.

But AI weights are in legal gray zone for now. So it’s muddy waters and fair game for anyone who wants to take on the legal risks.


There's a standard for excluding content from indexing via the Robots Exclusion Standard using robots.txt (sitewide) or the <noindex> HTML meta header. The robots.txt standard has existed for nearly 30 years, being first proposed in February 1994.[1]

Should a publisher wish to be excluded from Google's, or any other web index's search and presentation, that's easy enough to specify.

<https://www.intellectualpropertyblawg.com/ip-management/what...>

<https://developers.google.com/search/docs/crawling-indexing/...>

<https://en.wikipedia.org/wiki/Robots.txt>

(And no, I'm not a fan of Google by any stretch, but let's keep the discussion rigorous here.)

________________________________

Notes:

1. You don't feel old. You are old.


That's not how copyright law works at all. It doesn't say "well if you didn't want someone to copy this thing you should have stopped them from doing it". It lays out 4 factors for a court to consider about whether something is fair use and none of them are around how easy it was to rip the work off.[1]

In the LLM space it seems even more clear because many/most of the works in the various corpora used for this training have very clear copyright terms which prevent digital storage and reproduction without the publishers permission (just look at the reverse of the title page of any book for the copyright notice if you don't believe me).

Finally, for LLMs many/most of the works are in corpora[2] that people just download so they aren't looking at a robots.txt file put up by teh original site. If you look at The Pile paper[3] for example they explicitly say that much of the material is under copyright and that they are relying on fair use.

[1]: https://fairuse.stanford.edu/overview/fair-use/four-factors/ [2] https://github.com/Zjh-819/LLMDataHub for example [3] https://arxiv.org/abs/2101.00027


Since you raise the four factors test for fair use, let's spell those out:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work.

<https://www.law.cornell.edu/uscode/text/17/107>

Most critically, courts have put strong emphasis on the notion of transformative use of copyrighted works, and web indexing is transformative in the sense that it does not create a competing work, but provides a means of discovering and assessing the relevance of the indexed work itself.

As to web indexing, that (and associated factors including thumbnails and caching) have been ruled by courts to be fair-use adaptations of works:

Displaying a cached website in search engine results is a fair use and not an infringement. A “cache” refers to the temporary storage of an archival copy—often a copy of an image of part or all of a website. With cached technology it is possible to search Web pages that the website owner has permanently removed from display. An attorney/author sued Google when the company’s cached search results provided end users with copies of copyrighted works. The court held that Google did not infringe. Important factors: Google was considered passive in the activity—users chose whether to view the cached link. In addition, Google had an implied license to cache Web pages since owners of websites have the ability to turn on or turn off the caching of their sites using tags and code. In this case, the attorney/author knew of this ability and failed to turn off caching, making his claim against Google appear to be manufactured. (Field v. Google Inc., 412 F.Supp.2d 1106 (D. Nev., 2006).)

<https://fairuse.stanford.edu/overview/fair-use/cases/>

Or, to use your phrase, by common law (precedential case law), that is precisely "how copyright law works". Note particularly that the courts leaned on publishers' capabilities to indicate whether or not caching was or was not permitted "using tags and code".

There's a larger issue which I'm not aware of being explicitly raised in case law, which concerns how the World Wide Web is indexed as contrasted to how a print library is indexed. In the case of a library, an independent third party (the library cataloguer) assigns metadata to a work (standardised title, author(s), translator(s), illustrator(s), publisher(s), etc., as well as subject headings and call numbers. Additional indexing is provided through citations indices (both forward and reverse --- works cited by, and citing, other works). These largely don't rely on the text of the indexed work itself, though of course the cataloguer presumably is reading at least portions of the work to classify it. Critically: the works themselves are physical artefacts of fixed form which are virtually always read directly rather than interpreted through some mechanism.[1]

As it's evolved over the past quarter century or so, Web search doesn't rely strongly on metadata (though some of this is taken into consideration), and most particularly publisher-provided keywords are almost wholly ignored, largely due to flagrant abuse of that feature by some publishers. Instead, a combined approach of full-text indexing (that is: capturing the full text of a work and identifying keywords and tuples (multi-word phrases) which can be matched against queries entered by persons searching for documents, and an assessment of the overall relevance of that work, usually at a site (or sub-site) level based on other indicia, most famously (though somewhat less relevantly today) "PageRank", Google's original site-ranking algorithm.

Further, the entire mechanism of the Web is of creating copies of works on request. When an HTTP request is sent, the server responds by copying the requested work to an output stream, which is then received (and duplicated, often multiple times) by the client system as an integral part of the utilisation of that content. US copyright law does not have a section specifically referring to computer-network transmission, but there are multiple limitations on exclusive rights to copy (by authors) above and beyond the 107 Fair Use exemptions in sections 108 through 122 of 17 U.S.C, including specifically ephemeral recordings (108) and the case of computer programmes (117).

<https://www.law.cornell.edu/uscode/text/17/chapter-1>

Large language model training is a new area of use and law (legislative or common) is yet to be determined, but there's at the very least existing statutory language as well as precedent which suggest that at least some uses might well be found to be fair use. As I'm watching the situation evolve, I'm reminded strongly of several articles copyright scholar Pamela Samuelson wrote in the 1990s over adapting copyright to the Internet age, and questions of what its future place might be: specific governance over the literal copying of expressive works, or a general doctrine against misappropriation. As always, there's a sharp tension between authors' rights (and, let's be brutally honest: publishers' profits) and the underlying Constitutional justification of US copyright law: "To promote the Progress of Science and useful Arts".

<https://constitution.congress.gov/browse/article-1/section-8...>

And it seems Sameulson is engaged in the discussion of generative AI and copyright, though I've yet to read her work on the subject: <https://news.berkeley.edu/2023/05/16/generative-ai-meets-cop...>

(Discussion here strongly reliant on US law. There's general international agreement on copyright through the Berne Convention, though significant national differences exist.)

________________________________

Notes:

1. There is a spectrum of works, e.g., print books, phonographs, CDs and DVDs (the latter containing anti-circumvention mechanisms), etc., but in general there's minimal if any intermediate copying and duplication of works, and in many cases none at all.


I appreciate the detail in your reply. Do you think the recent Warhol "Orange Prince" case[1] gives an inkling into possible future court treatment of the question of "transformative" use for generative AI models? There Warhol's silk screen print of the original Prince photo was deemed not transformative enough as I understand it. One of things about the stochastic nature of generative AI is can be rather hard to notice when the model spits out something very close to the training material.

[1] https://www.theguardian.com/artanddesign/2023/may/18/andy-wa...


Good question, I've seen some coverage of the case, and ... tend to disagree with the court's decision.

That said, it would tend to darken the prospects for operators of LLM generative AI systems, IMO.


What rules? Google won’t scrape your part of the internet if you don’t allow it, right?


Google respects the "robot.txt" and asks you to use it to opt out of their crawling.

Parent's point is if your own scaping army respects the "scaping.txt" and goes down on Google as they don't opt-out in their scraping.txt, it probably wouldn't fly.


I don't understand. What does "Rules for thee but not for me" mean if "google is allowed to scrape" whatever people allows Google to scrape but "you’re not allowed to scrape google" because using the same rules google.com/robots.txt says

   User-agent: *
   Disallow: /search
   ....


There's an imbalance because the robot.txt rule is something Google pushed forward (didn't invent it, but made it standard) and is opt-out. So yes, Google made up their rules and won't let other people to make up their own self-beneficial rules in a similar way.


> Google [...] won't let other people to make up their own self-beneficial rules in a similar way.

What "other people"?

If it's the "you" who is not allowed to scrape google in https://news.ycombinator.com/item?id=36817237 then you can make your own "google is not allowed to scrape my thing" rules if you think that's beneficial for you.

If it's somehow related to LLM providers or users I doubt that's what the original comment was referring to.

To be clear, I understand the original comment as

   LLM companies say "I can use your content and you cannot not prevent me from doing so, but I won't allow you to use the output of the LLM" just like Google says "I can scrape your content and you cannot not prevent me from doing so, but I won't allow you to scrape the output of the search engine"
and that doesn't seem a valid analogy.


You should change "you cannot prevent me from doing so" into "you'll need to setup your ressources in the way that I defined if you don't want me to slurp them".

I see it as the equivalent of the spam mail that require the user to login to disable them.


The belief that makes them consistent is that the authors of a million Reddit posts have no way to assert their rights while the big company that trained a Redditor model does.


Sure they do, albeit a shitty one: it's called a class-action.


Yes, they have to pick one or the other. Until then I'm going to assume that the model licence doesn't apply since the first point would be invalid and the model could not be built in the first place.


It tells you that they think their moat is data quality/quantity.


I can license my LLM however I want to... but I can't sail this ship to generally-intelligent-Tortooga on me lonesome. Savvy?


> think these are consistent views

they are consistent, if they believe themselves to be "special" and deserves special treatment!


Those are perfectly consistent, despite what ideologically-driven people may want to believe.

Copyright is literally the right to copy. Arbitrary Internet data that is not copied does not have any copyright implications.

The difference is that LLaMa imposes additional contractual obligations that, for ideological reasons (Freedom #0), open source software does not.

This issue reminds me of the FSF/AGPL situation. At some point you just have to accept that copyright law, in and of itself, is not sufficient to control what people do with your software. If you want to do that, you have to limit end-user freedom with an EULA.

If someone uses LLaMa output to train models, it is unlikely they will be sued for copyright infringement. It is far more likely they will be sued for breach of contract.


> Arbitrary Internet data that is not copied does not have any copyright implications.

Training a model on model output isn't copying.

There's no way to phrase this where training a model on copyrighted human-generated images/text isn't copying, but training a model on computer-generated images/text is copying.

> If you want to do that, you have to limit end-user freedom with an EULA.

If you want to limit end-user freedom with a EULA, you have to figure out how to get users to sign it. Copyright is one way to force them to do so, but doesn't really seem relevant to this situation if training a model on copyrighted material is fair use.

And again, if somebody generates a giant dataset with LLaMA, if you want to argue that pushing that into another LLM to train with is making a copy of that data, then there's no way to get around the implication there that training on a human-generated image is also making a copy of that image.


> Training a model on model output isn't copying.

That's literally what I said.

> There's no way to phrase this where training a model on copyrighted human-generated images/text isn't copying, but training a model on computer-generated images/text is copying.

Literally nobody is saying that.

> If you want to limit end-user freedom with a EULA, you have to figure out how to get users to sign it.

That is not true. ProCD v. Zeidenberg, 86 F.3d 1447 (7th Cir. 1996).

You and others seem to have an over-the-top hostile reaction to the idea that contract law can do things copyright law cannot do. But it is objective and unarguable fact.


> Literally nobody is saying that.

Okay? Apologies for making that assumption. But if you're not saying that, then your position here is even less defensible. Arguing that model output isn't copyrightable but that it's still covered by EULA if anyone anywhere tries to use it is even more absurd than arguing that it's covered by copyright. The interpretation that this is covered by copyright is arguably the charitable interpretation of what you wrote.

> That is not true. ProCD v. Zeidenberg, 86 F.3d 1447 (7th Cir. 1996).

ProCD is about shrinkwrap licenses, the court determined that buying the software and installing it was the equivalent of agreeing to the license.

In no way does that imply that licenses are enforceable on people who never agreed to the licenses. The court expanded what counts as agreement, it does not mean you don't have to get people to agree to the EULA. I mean, take pedantic issue with the word "sign" if you want (sure, other types of agreement exist, you're correct), but the basic point is still true -- if you want to restrict people with a EULA, they need to actually agree to the EULA. All that ProCD did was establish that buying a product and opening the package and installing it constituted agreement.

And that becomes a problem because if you don't have IP law as a way to block access to your stuff, then you don't really have a way to force people to agree to the EULA. Someone using LLaMA output to train a model may have never been in a position to agree to that EULA, and Facebook doesn't have the legal ability to say "hey, nobody can use output without agreeing to this" because they don't have copyright over that output. Can they get people to sign a EULA before downloading the weights from them? Sure. Is that enough to restrict everyone else who didn't download those weights? No.

To go a step further, if you don't believe that weights themselves are copyrightable, then putting a EULA in front of them is even less effective because people can just download the weights from someone else other than Facebook.

You can host a project Gutenberg book and get people to sign a EULA before they download it from you, even though you don't own the copyright. And that EULA would be binding, yes. But you cannot host a project Gutenberg book, put a EULA in front of it, and then claim that people who don't download it from you and instead just grab it off of a mirror are still bound by that EULA.

Your ability to control access is what gives you the ability to force people to sign the EULA. And that's kind of dependent on IP law. If someone sticks the LLaMA 2.0 weights on a P2P site, and those weights aren't covered by copyright or other IP law, then no, under no interpretation of US law would downloading those weights from a 3rd-party source constitute an agreement with Facebook.

But even if you don't take that position, even if you assume that model weights are copyrightable, if I download a dataset generated by LLaMA, there is still no shrinkwrap license on that data.

To your original point:

> If someone uses LLaMa output to train models, it is unlikely they will be sued for copyright infringement. It is far more likely they will be sued for breach of contract.

It is incredibly unlikely that someone using a 3rd-party database of LLaMA output would be found to be in violation of contract law unless at the very least they had actually agreed to the contract by downloading LLaMA themselves. A restriction on the usage of LLaMA does not mean anything for someone who is using LLaMA output but has not taken any action that would imply agreement to that EULA.

> You and others seem to have an over-the-top hostile reaction to the idea that contract law can do things copyright law cannot do. But it is objective and unarguable fact.

No, what we have a hostile reaction to is the objectively false idea that a EULA covers unrelated 3rd parties. That's not a thing, it's never been a thing.

I don't know what to say if you disagree with that other than that I'm putting a EULA in front of all of Shakespeare's works that says you now have to pay me $20 before you use them no matter where you get them from, and apparently that's a thing you believe I can do?


My "position" is the law, whether you like it or not.

Clickwrap agreements are enforceable, and legally enforceable agreements can place more restrictions on the use of a piece of software than copyright law alone can.

As a result, software that, for ideological reasons, does not restrict use will always have fewer protections than software with more restrictive terms.

Your off-topic rant about Shakespeare is irrelevant.


> My "position" is the law, whether you like it or not.

> Clickwrap agreements are enforceable, and legally enforceable agreements can place more restrictions on the use of a piece of software than copyright law alone can.

To take a page from your earlier comment, literally no one here is denying the existence of clickwrap agreements. Clickwrap agreements are completely irrelevant to the current conversation.

> Your off-topic rant about Shakespeare is irrelevant.

You can not enforce a EULA on someone interacting with a piece of work you do not own IP rights to if they did not agree to that EULA in some way.

I'm sorry, but agreement is part of contract law.

If you think you can force a EULA on a piece of content you don't own that will bind people who got the content from a 3rd-party and who never agreed to your EULA under any legal definition of agreement, then by all means, slap a EULA on Shakespeare. It makes just as much sense as what you're suggesting.


>> If you want to limit end-user freedom with a EULA, you have to figure out how to get users to sign it.

> literally no one here is denying the existence of clickwrap agreements.

You denied the enforceability of clickwrap agreements. You were wrong.

LLaMA uses a clickwrap agreement. "By clicking 'I Accept' below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement."

That agreement covers its output: "You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof)."

Your hypotheticals about third parties are off-topic and have zero bearing on this conversation.

The topic under discussion is whether it is logically "inconsistent" for Meta to claim its output is protected while other content is not. Those two positions are perfectly consistent in light of the fact that LLaMA output is protected by the terms of a clickwrap agreement.


Facebook absolutely factually does not have a clickwrap agreement over 3rd-party content generated with LLaMA; restrictions of users do not magically mean that output has its own universally enforceable EULA applied to everyone else. There is no interpretation of US contract law that says that 3rd-party data generated with LLaMA would be subject to LLaMA's license. There is no clickwrap agreement over LLaMA's output, and no legal precedent that argues that any restriction of LLaMA's usage would apply to 3rd-parties accessing that output. The output is not protected in the way you claim, and I fully stand by the fact that a clickwrap agreement over downloading LLaMA from Facebook would not be enforceable over people who did not download LLaMA and are merely using 3rd-party LLaMA output.


> Arbitrary Internet data that is not copied

It's all but certainly copied, and not just in the "held in memory" sense but actually stored along with the rest of the training collection. What may not happen is distribution. There's a difference in scale/nature of copyright violation between the two but both could well be construed that way.

Additionally, I think there's a reasonable argument that use as training data is a novel one that should be treated differently under the law. And if there's not:

> If you want to do that, you have to limit end-user freedom with an EULA.

What will eventually happen -- at least without some kind of worldwide convention -- is that someone who can successfully dodge licensing obligations will be able to take and redistribute weight-data and/or clean-room code.

At least, if we're adopting a "because we can" approach to everything related.


But you can publish the output, right? And then a “third party” could train a different model on just that published material without copying it or ever agreeing to a EULA.


If you believe that courts will find your shell game convincing, you are free to try it and incur the legal risk. I recommend you consult with an attorney before doing so.


You could simply train on the output straight up and nobody would ever be able to tell anyway.


One of the common elements of training sets for these models (including LLama) is the Books3 dataset, which is a huge number of pirated books from torrents. That's exactly what you described.

Regardless, the lack of a license cannot give you more permission than a restrictive license. You're arguing that if take a book out of a bookstore without paying (or signing a contract), then I have more rights than if I sign a contract and then leave with the book.


> the lack of a license [agreement] cannot give you more permission than a restrictive license [agreement]

That is clearly false. It's hard to imagine the confusion of ideas that would lead you to such a conclusion.


I don't see how this would be enforceable in law without killing almost every AI company on the market today.

The whole legal premise of these models is that training on copyrighted material is fair use. If it's not, then... I mean is Facebook trying to claim that including copyrighted material in a dataset isn't fair use regardless of the author's wishes? Because I have bad news for LLaMA then.

"You need permission to train on this" is an interesting legal stance for any AI company to take.


From my non-legal-professional POV I can see an angle which may work:

Firstly, llama is not just the weights, but also the code alongside it. The weights may or may not be copyrightable, but the code is (and possibly also the network structure itself? that would be important if true but I don't know if it would qualify).

Secondly, you can write what you want in a copyright license: you could write that the license becomes null and void if the licensee eats too much blue cheese if you want.

Following from that, if you were to train on the outputs of the AI, you may not be guilty of copyright infringement in terms of doing the training (both because AI output is not copyrightable in the first place, something which seems pretty set in precedent already, and possibly also because even if it was, it gets established that it is fair use like any other data), but if it means your license to the original code is revoked then you will at the very least need to find another implementation that can use the weights, or (if the weights can be copyrighted, which I would argue is probably not the case, if you follow the argument that the training is fair use, especially if the reasoning is that the weights are simply a collection of facts about the training data, but it's very plausible that courts will rule differently here).

This could wind up with some strange situations where someone generating output with the intent of using it for training could be prosecuted (or at least forced to cease and desist) but anyone actually using that output for training would be in the clear.

I agree it is extremely "have your cake and eat it" on the part of the AI companies: They wish to both bypass copyright and also benefit from the restrictions of it (or, in the case of OpenAI, build a moat by lobbying for restrictions on the creation and use of the models themselves, by playing to fears of AI danger).


These are good points to bring up.

> This could wind up with some strange situations where someone generating output with the intent of using it for training could be prosecuted (or at least forced to cease and desist) but anyone actually using that output for training would be in the clear.

I'll add to this that it's not just output; say that someone is using another service built on top of LLaMA. Facebook itself launched LLaMA 2.0 with a public-facing playground that doesn't require any license agreement or login to use.

You can go right now and use their public-facing portal and generate as much training data as you can before they IP-block you, and... as far as I can tell you haven't done anything in that scenario that I can see that would bind you to this license agreement.

So I still feel like I'll be surprised if any AI company that's serious about wanting bootstrapping itself off of LLaMA is going to be too concerned about this license (whether that's a good idea to do just because the training data itself might be garbage is another conversation). It just seems so easy to get around any restrictions.


The code is largely irrelevant - it's all simple enough that it can be easily replaced, and most current users of LLaMA only use the weights in practice.

NN design is more interesting, but I don't think we're at the point yet where they are sufficiently complex to be copyrightable in general. Patentable, maybe.


> Following from that, if you were to train on the outputs of the AI, you may not be guilty of copyright infringement in terms of doing the training (both because AI output is not copyrightable in the first place, something which seems pretty set in precedent already, and possibly also because even if it was, it gets established that it is fair use like any other data), but if it means your license to the original code is revoked

Majority of the time, the code and weights are under independent license terms-while in theory the code license could say it is revoked or revocable if you violate the terms of the weights licenses, I think such a license term is rare in practice.

It is quite common even when the weights are under a restricted license for the code to be released under a standard open source license, and no open source license contains such a license term (and it would probably make the license non-open source were it included)


> The whole legal premise of these models is that training on copyrighted material is fair use.

Not to diminish the conversation here, but not even a Supreme Court Justice knows what the legality is. You’d have to be a whole 9 person Supreme Court to make an accurate statement here. I don’t think anyone really knows how Congress meant today’s laws to work in this scenario.


> I don’t think anyone really knows how Congress meant today’s laws to work in this scenario.

Congress, or more accurate, the drafters of the Constitution, intended that Congress would work to keep the Constitution updated to match the needs of modern times. Instead, Congress ossified to the point it's unable to pass basic laws because a bunch of far right morons hold the House GQP hostage and an absurd amount of leverage was passed to the executive and the Supreme Court as a result - with the active aid of both parties by the way, who didn't even think of passing actual laws to codify something as important as equitable access to elections, fair elections, or the right to have an abortion or to smoke weed when they held majorities. And on top of that your Supreme Court and many Federal court picks were hand-selected from a society that prefers a literal viewpoint of the constitution.

But fear not, y'all are not alone in this kind of idiocy, just look at us Germans and how we're still running on fax machines.


I'd say it's enforceable in the sense that if you agree to the license then violating those terms would be breach of contract regardless of whether use of the LLaMA v2 output is protected by copyright or not. But there's nothing stopping someone else who didn't agree to the license from using output you generate with LLaMA v2 to train their model.


I don't want to dip too much into the conversation of whether weights themselves are copyrightable, but note that it's very easy in the case of LLaMA 1.0 to get the weights and play with them without ever signing a contract.

If they turn out to be not copyrightable, then... all this would mean is downloading LLaMA 2.0 weights from a mirror instead of from Facebook.


It's so hypocritical, it's insane.

"Yes, we train our models on a good chunk of the internet without asking permission, but don't you dare train on our models' output without our permission!"

And OpenAI also has a similar restriction.


In fact they can't (both Facebook and OpenAI) train their models without asking permission. Just wait for someone to start raising this concern. The EU is working on regulating these kind of aspects, for example this is not compliant at all with the GDPR (unless you train only on data that doesn't contain personal data, that is more rare than you would think).


Fundamentally untrue, and disheartening that it's the top comment.

You can't use a model's output to train another model, it leads to complete gibberish (termed "model collapse"). https://arxiv.org/abs/2305.17493v2

And the Llama 2 license allows users to train derivative models, which is what people really care about. https://github.com/facebookresearch/llama/blob/main/LICENSE


The truth is between these two. You can use a model’s output to train another model, but it has drawbacks, including model collapse.


Some of the best LLaMA finetunes today are trained on GPT-4 output.

Yes, you cannot do this kind of thing indefinitely and expect endless improvements from "endless training set". But that's a very different problem.


Good luck enforcing that, though. How would they ever know?


Disgruntled current or former employee turning in their employer for the reward? That’s how Microsoft and the BSA used to bust people before the days of always online software.


i wonder if they could include some marker prompt and response that wouldn't occur "naturally" from any other model or training data



Level1Techs "link show" (because we can't call it news anymore) kind of touched this topic. I would like to read what you guys make of this:

> Supreme Court rejects Genius lawsuit claiming Google stole song lyrics SCOTUS won't overturn ruling that US copyright law preempts Genius' claim.

> The song lyrics website Genius' allegations that Google "stole" its work in violation of a contract will not be heard by the US Supreme Court. The top US court denied Genius' petition for certiorari in an order list issued today, leaving in place lower-court rulings that went in Google's favor.

> Genius previously lost rulings in US District Court for the Eastern District of New York and the US Court of Appeals for the 2nd Circuit. In August 2020, US District Judge Margo Brodie ruled that Genius' claim is preempted by the US Copyright Act. The appeals court upheld the ruling in March 2022.

> "Plaintiff's argument is, in essence, that it has created a derivative work of the original lyrics in applying its own labor and resources to transcribe the lyrics, and thus, retains some ownership over and has rights in the transcriptions distinct from the exclusive rights of the copyright owners... Plaintiff likely makes this argument without explicitly referring to the lyrics transcriptions as derivative works because the case law is clear that only the original copyright owner has exclusive rights to authorize derivative works," Brodie wrote in the August 2020 ruling.

> Google search results routinely display song lyrics via the service LyricFind. Genius alleged that LyricFind copied Genius transcriptions and licensed them to Google.

> Brodie found that Genius' claim must fail even if one accepts the argument that it "added a separate and distinct value to the lyrics by transcribing them such that the lyrics are essentially derivative works." Since Genius "does not allege that it received an assignment of the copyright owners' rights in the lyrics displayed on its website, Plaintiff's claim is preempted by the Copyright Act because, at its core, it is a claim that Defendants created an unauthorized reproduction of Plaintiff's derivative work, which is itself conduct that violates an exclusive right of the copyright owner under federal copyright law," Brodie wrote.

https://arstechnica.com/tech-policy/2023/06/supreme-court-re...


The basic idea is whether an unauthorised derivative work is itself entitled to copyright protection: could the creator of the derivative work prevent copying by the original creator (or anyone else) of the work on which it is based, even though they themselves have no permission to distribute it? (if the work is authorised, this is generally considered to be the case). It looks like from this the conclusion is 'no', at the very least in this case. I'm not sure this matches most people's moral intuitions: every now and again a big company includes some fan art in their own official release without permission (usually not as a result of a general policy, but because of someone getting lazy and the rest of the system failing to catch it), and generally speaking the reaction is negative.


> whether an unauthorised derivative work is itself entitled to copyright protection

That is not what this court case was about. Genius had already settled the case of unauthorised transcriptions and had bought licences for its lyrics after a lawsuit 2014, so its own work was no longer unauthorised. In the case cited above, Genius was trying to enforce its claims against Google via contract law rather than copyright law. The court ruled that the alleged violations were covered by copyright law, so they could only pursued via copyright law, and that only the copyright holder (or assignee) of the lyrics that were copied could sue Google under it.


They could have picked up the LLM equivalent from LLM generated posts online however. How do you prove they didn't?


as a layman, i imagine for someone at the scale required it may not be worth the risk or the added effort vs paying or using a different model but it'd be funny if we see companies creating a subsidiary that just acts as a web-passthrough to "legalize" llama2 output as training data


Not that it's okay for this to be in the license, but I'm curious: what is the use case for synthetic data? Most of the discussion I've seen has been about how to avoid accidentally using LLM-generated data.


You can use synthetic data produced by more complex models to finetune smaller ones to be better.


Tuning a tiny classifier


I'm not sure why anyone would even do that in the first place, LLama doesn't generate synthetic data that would be even remotely good enough. Even GPT 3.5 and 4 are already very borderline for it, with lots of wrong and censored answers. And at best you make a model that's as good as LLama is, i.e. not very.


Instruction-tuning is the obvious use case. That much has nothing to do with subjectivity, alignment or censorship, it's will-you-actually-show-this-as-JSON-if-asked.


That's tuning llama which is allowed from what I understand. Otherwise why release it at all, it's not very functional in its initial state anyway. What that applies to is using llama outputs to train a completely new base model which makes no practical sense.

As for generating jsons, that's more of a inference runtime thing, since you need to pick the top tokens that result a valid json instead of just hoping it returns something that can be parsed. On top of extensive tuning of course.


I played with Llama2 for a bit and for a lot of the questions I asked I got complete made up garbage stuff. Why would you want to train on it?


It's exactly the opposite. We have better ways to combine the knowledge of several models together than sampling them. (i.e. mixture of experts, model merges, etc) Relying on synthetic data from one LLM to train another LLM is in general a terrible idea and will lead to a race to the bottom.


> forbids you from using its outputs to train other models.

I don't know how one can even forbid this. As a human, I'm a walking neural net, and I train myself on everything that I see, without a choice. The only difference is I'm a carbon-based neural net.


I would just do it anyway. In fact, I can release a suitably laundered version and you'd never know. If I release a few million, each with slight variation, there's no way provenance can be established. And then we're home-free.


A contract ordinarily has to have consideration. Since LLaMa weights are not copyrightable by Meta and are freely available, what exactly is the consideration? The bandwidth they provide?


Generate data using ai, save it, it cannot be copyrighted or anything, data isn't a model, use it as much as you want for training.

Ezpz


This isn't really new, the strict "Open Source" as defined for software has never made exact, perfect sense for anything other than software. That's why the Creative Commons licenses exist; putting a photographic image under GPL2 has never made any sense. It always needs redefinition in new media.


An LLM is more like software than it is like media. The GPL defines source code as the preferred form for making modifications, including the scripts needed for building the executable from source. The weights in this case are more similar to the optimized executable code that comes out of a flow. The "source" would be the training data and the code and procedures for turning that into a model. For very large LLMs almost no one could use this, but for smaller academic models it might make sense, so researchers could build on each others' work.


Creative commons has never claimed to be an open source licence though, they usually use the term free culture.


Even for medias such as photos, songs, videos, you have a source. That is the raw materials and the projects from which you rendered the image, the video or the audio output.

The source of a language model is more in reality the model, that is the code that was used to train the particular model. The model itself is more of a compiled binary, altough not in machine code.

So for a model to be really open source to me it would mean that you have to release the software used for generating it, so I can modify it, train it on my data, and use it.


It doesn't need redefinition. We just need a new term for new media.


The strict "Open Source" wasn't even a definition when I started college.


Open Source didn't exist until Netscape created Mozilla in early '98 and the definition was soon after developed and then for some years tuned until we have today's "Open Source".


It remains to be seen in court whether weights are even copyrightable potentially making all the various licenses and their restrictions moot.


It seems like a dangerous clause to me.

1) "Dear artists, the model cannot infringe upon your copyright because it's merely learning like a human does. If it accidentally outputs parts of your book, you know, it just accidentally plagiarized. We all do it haha! Our attorneys remind you that plagiarism is not illegal in the US."

2) "Dear engineers, the output of our model is copyrighted and thus if you use it to train your own model, we own it."

I am not sure how both of those can be true at the same time.


2) doesn't line up with the US court's current stance that only a human can hold copyright, and thus anything created by a not-human cannot have copyright applied. This applies to animals, inanimate objects, and presumably, AI.

I have no idea how this impacts the encodability of the license from FB which may rely on things other than copyright, but as of right now, the output absolutely cannot be copyrighted.


That's an extremely good point. The output of software is never copyrightable. What makes language models not software?


Isn't Photoshop software?


Photoshop's output has been completely guided (until recent additions) by a human who can hold a copyright.

That being said, isn't a prompt guidance?


Adobe doesn't hold copyright on images produced using Photoshop. Assuming prompt guidance can be used to claim copyright (unclear, see https://arstechnica.com/information-technology/2023/02/us-co... ), that copyright would presumably be held by the person doing the guidance and not the company that trained the AI.


We all truly do "accidentally plagiarize", especially artists. Many guitarists realize they accidentally copied a riff they thought they'd come up with on their own for example.


I, for one, welcome our new plagiarism overlords.

Oops.

I added the "haha" in there because the probability of a human doing this kind of goes way down as the length of the text increases. Can you type, verbatim, an entire chapter of a book? I can't. But, I bet the AI can be convinced in rare cases to do that.

The whole thing is very interesting to me. There was an article on here a couple days ago about using gzip as a language model. Of course, gzipping a book doesn't remove the copyright. So how low does the probability of outputting the input verbatim have to be before copyright is lost?

Reading the book and benefitting from what you learned? Obviously not copyright infringement. Putting the book into gzip and sending your friend the result? Obviously copyright infringement. Now we're in the grey area and ... nobody knows what the law is, or honestly, even how to reason about what the law wants here. Fun times.

(Personally, I lean towards "not copyright infringement", but I'm not a big believer in copyright myself. In the case of AI training, it just makes it impossible for small actors to compete. Google can just buy a license from every book distributor. SmolStartup can't. So if we want to make AI that is only for the rich and powerful, copyright is the perfect tool to enable that. I don't think we want that, though.

My take is that the rest of society kind of hates Tech right now ("I don't really like my Facebook friends, so someone should take away Mark Zuckerberg's money."), so it's likely that protectionist laws will soon be created that ruin it for everyone. The net effect of that is that Europe and the US will simply flat-out lose to China, which doesn't care about IP.)


There are people that can type, verbatim, the entire chapters of books.


China currently has the most stringent limits for LLMs available to end users because of concerns about their political alignment. So if you believe that it's the whole market competition part that's most important in getting the best results long term, they have shot themselves in the foot first.

Of course, the models that are developed for internal use by the Chinese government won't be so limited, regardless of what the law says. But then neither be the ones developed by Western three-letter agencies. So don't worry about the "Great Game"; they'll do just fine one-upping each other and screwing over all of us in the process.


The overwhelming majority of all human advancement is in the form of interpolation. Real extrapolation is extremely rare and most don't even know when it's happening. This is why it's extremely hypocritical for artists of any sort to be upset about Generative AI. Their own minds are doing the same exact thing they get upset about the model doing.

This is why fundamental "interpolative" techniques like ChatGPT (whose weights are in theory frozen) is still basically super-intelligent.


Wow you appear to know a great deal about how human minds work: "doing the same exact thing they get upset about the model doing"... May I query you put up a list of publications on the subject of how minds work?


My insights are widely accepted theories from various fields, all available in the public domain.

It's a well-understood concept that our minds function by making sense of the world through patterns. This is the essence of interpolation - taking two known points and making an educated guess about what lies in between. Ever caught yourself finishing someone's sentence in your mind before they do? That's your brain extrapolating based on previous patterns of speech and context. These processes are at the heart of human creativity.

The field of Cognitive Science has extensively documented our tendency for interpolation and pattern recognition. Works like The Handbook of Imagination and Mental Simulation by Markman and Klein, or even "How Creativity Works in the Brain" by the National Endowment for the Arts all attest to this.

When artists create, they draw from their experiences, their knowledge, their understanding of the world - a process overwhelmingly of interpolation.

Now, I can see how you might be confused about my reference to ChatGPT being "super-intelligent". Perhaps "hyper-competent" would be more appropriate? It has the ability to generate text that appears intelligent because it's interpolating from a massive amount of data - far more than any human could consciously process. It's the ultimate pattern finder.

And that, my friend, is my version of "publications on the subject of how minds work." I may not be an illustrious scholar, but hey, even a clock is right twice a day! And who knows, maybe I'm on to something after all.


There was a famous case where John Fogerty (formerly of Creedence Clearwater Revivial) ended up getting sued by CCR's record label, claiming a later solo song he did with a different label was too similar to a CCR song that he wrote, and they won. So legally speaking, you can even get in trouble for coming up with the same thing twice if don't own the copyright of the first one.


The copyright situation with music is kinda broken, different parts of the performance get quite different priority when it comes to copyright (many core elements of a performance get basically no protection, whereas the threshold for what counds as a protectable melody is absurdly low). Especially this means its less than worthless for some genres/traditions: for jazz and blues, especially, a huge part of the genre and culture is adapting and playing with a shared language of common riffs.


In a similar vein, the common "you may not use this model's output to improve another model" clause is AFAIK unenforceable under copyright, so it's at best a contractual clause binding a particular user. Anyone using that improved model afterward is in the clear.


The idea is that if you violate the terms of the license to develop your own model, you lose your rights under the license and are creating an infringing derivative work. If I clone a GPL'd work and ship a derivative work under a commercial license, downstream users can't just integrate the derivative work into a product without abiding by the GPL terms and say "well we're downstream relative to the party who actually copied the GPL'd work, so the GPL terms don't apply to us".


If such a "derivative" model is a derivative work, then aren't all these LLMs just mass copyright infringement?


If model weights aren’t copyrightable, derivative model weights are not a “work”, derivative or otherwise, for copyright purposes.

If they are, and the license allows creating finetuned models but not using the output to improve the model, then the derived model is not a violation, but it might be a derivative work.


At the end of the day it's not black and white, but there's a large and obvious difference in degree that would plausibly permit someone to find that one is and the other isn't. It's fairly easy to argue that using the outputs of LLM X to create a slightly more refined LLM Y creates a derivative work. The argument that a model is a derivative work relative to the training data is not so clear cut.


Exactly this. What's good for the goose is good for the gander!


If the weights are not copyrighteable, you don't need a licence do use them, they are just data. There's not a right to infringe if these numbers have no author. Of course, to use openAI API you must abide to their terms. But if you publish your generations and I download them, I have nothing to do to the contract you have with openAI since I'm no part of it. They can't impede me to use it to improve my models.


No, because the premise of the hypothetical is that the weights aren't protected by copyright.

So, no matter what they TOS says, it's not an infringing work.

> Downstream users can't just integrate the derivative work into a product without abiding by the GPL terms

You absolutely could do this if the original work is not protected by copyright, or if you use it in a way that is transformative and fair use.


Something under the GPL is also copyrighted. The GPL is a copyright license.


The GPL depends on copyright but it not itself copyright. The GPL is a license that gets its legal standing from copyright but if you don't have a copyright on something, slapping the GPL on top of it doesn't make it copyrighted to you.


Absolutely.


If the underlying work is not protected by copyright, it doesn't matter what license someone tries to put on it.

Similarly, if someone creates a fair use/transformative work then the license can also be ignored.


Thing is, the outputs of a computer program aren't copyrightable, so it doesn't matter if your improved model is a derivative work. What you say would apply if you derived something from the weights themselves (assuming they are copyrightable, of course).


Really?

Your customers bought that product under license A. Afterwards it turned out that you pirated some artwork from disney. Then your customer can sue you (not disney) to make things right. The specific license of the original work seems quite irrelevant here.


Not at all. The reason your customer can sue you is because Disney can sue your customer. Disney would be suing your customer under the specific license of the original work.

edit: you seem to see the customer as the primary victim here instead of Disney, but if Disney weren't a victim the customer wouldn't have a case.


> it's at best a contractual clause binding a particular user. Anyone using that improved model afterward is in the clear.

That's... not really accurate. See the concept of tortious interference with a contract.


Hm, I don't know much about common law, but I don't think this would apply if, say, an ML enthusiast trained a model from LLaMA2 outputs, made it freely available, then someone else commercialised it. The later user never caused the original developer to breach any contract, they simply profited from an existing breach.

That said, doing this inside one company or with subsiduaries probably wouldn't fly.


And of course anyone using a model improved by this is entirely unworried by these clauses if their improved model takes off hard.


I find the idea that weights are not copyrightable very fascinating - appealing even. I have a hard time imagining a world where this is the case, though.

Can you summarize why weights would not be copyrightable or give me pointers to sources that support that view.


Let’s take a simple linear regression model with a handful of parameters. The weights could be an array of maybe 5 numbers. Should that be copyrightable? What if someone else uses the same data sources (e.g. OSS data sets) and architecture and arrives at the same weights? Is this a Copyright violation?

Let’s talk about more complex models. What if my model shares 5% of the same weights with your model? What about 50%? What about 99%? How much do these have to change before you’re in the clear? What if I take your exact model and run it through some extra layers that don’t do anything, but dilute the significance of your weights?

It’s a murky area, and I’m inclined to think copyright is not at all the right tool to handle the legality of these models (especially given the glaring irony they are almost all trained using copyrighted material). Patents, perhaps better suited, but I’m also not sold.


Speculating (I am not a lawyer) I see two options:

1. Model weights are the output of mathematical principles, in the US facts are not copyrightable, so in general math is not copyrightable.

2. Model weights are the derivative work of all copyrighted works it was trained on - in which case, it would be similar to creating a new picture which contains every other picture in the world inside of it. Who is the copyright owner? Well, everyone, since it includes so many other copyright holders' works in it.


Your second question asks: "Who owns the Infinite Library[0]?"

related, there was a presentation (i've lost the reference) on automatic song (tune?) generation where the presenter claimed (rather humourusly) that he'd generated all the songs that had ever been and will ever be so that while he was infringing on a large but finite number of songs, he was non infringing on an infinite number of future songs. So, on balance he was in a favourable position.

[0] https://en.wikipedia.org/wiki/The_Library_of_Babel


Remember that database rights are a thing.

One cannot hold copyright facts, but one can "copyright" a collection of facts like a search index or a map.


Your second argument, if true, disproves your first argument.


Doesn't matter. A court decides in the end, and the two choices I presented could lead to OPs scenario. If a court decides that, they decide that, period. I'm not 'making an argument' with those points - I'm presenting options a court might choose from when setting precedent.


Generally the output of a machine is not copyrightable. Similarly, the contents of a phone book is not copyrightable in the US even if the formatting/layout is. So I could take a phonebook and publish another one with identical phone numbers as long as I laid it out slightly differently.


Work also has to be "creative" in order for it to be eligible for copyright. This is why photomasks have special, explicit protection in US law; they're not really "creative" in that way.

https://en.wikipedia.org/wiki/Integrated_circuit_layout_desi...


What about compiled binaries? If I write my own original source code (and thus automatically own the copyright to it), and compile it to binary, is the binary not protected to?


No, because you the input to that process was a bunch of work that you did.

In the case of an LLM, I don't think that the work of compiling the training data probably would qualify by analogy to the phonebook example.


Sure, but I was just responding to "Generally the output of a machine is not copyrightable", which seemed obviously wrong to me...

But on reflection, you are totally right, I was just getting mixed up on the distinction between copies and the creative works themselves. Machine output of something is generally just a copy of something. Whatever it is a copy of may be a copyrightable work, and if so, whoever came up with that original work has the right to all the copies output by machines (or copies generated by hand-tracing, or whatever).

Anyway, on LLMs... Even if we assume LLM weights are just copies (machine outputs) of whatever inputs they were trained on, then I assume I would automatically own the exclusive right to restrict the distribution of weights of a 'Me' chatbot trained exclusively on my own writings. But what if someone else comes along and writes a load of bespoke code specifically to generate improved weights for this same model, so the resultant chatbot works much better in conversation (still with my tone of voice, but with better performance and better interpretation of questions)? Is that programmer not adding some creative value, such that we might both have a right to restrict distribution of those improved weights? (NB. it's common for an item to be a 'copy' of multiple original works, e.g. copies of Jimi Hendrix's cover of Bob Dylan's 'All Along the Watchtower'.)


By that logic, if you convert a copyrighted song or movie from one codec to another, then that would not be copyrightable because it is the output of a machine.


It isn’t independently copyrightable.

Its a mechanical copy subject to the copyright on the original, though.


The song itself isn't output by the machine.


Neither was the original training data, which was copyrighted books, art, etc.


> Neither was the original training data, which was copyrighted books, art, etc.

If the original training data is a copyrightable (derivative or not) work, perhaps eligible for a compilation copyright, the model weights might be a form of lossy mechanical copy of that work, and be both subject to its copyright and an infringing unauthorized derivative if it is.

If its not, then I think even before fair use is considered the only violation would be the weights potentially infringing copyrights on original works, but I don’t think incomplete copy automatically works for them the way it would for an aggregate; I’d think you'd have to demonstrate reproduction of the creative elements protected by copyright from individual source works to make the claim that it infringed them.


The output of the training though is unrecognizable.


Sometimes, the output is a recognisable plagiarism of a specific input.

If it isn't recognisable, then it's merely _distributed_ plagiarism. A million output, each of which are 0.0001% plagiarising each of million inputs.


Does The War on Drugs plagiarize Bruce Springsteen?


Does The War on Drugs produce outputs on command, to prompts such as "a song in the style of Bruce Springsteen" ?

Is The War on Drugs a VC-funded band replacement?

Are other future bands going to learn from The War on Drugs?

https://www.cbsnews.com/news/ai-stable-diffusion-stability-a...

https://www.documentjournal.com/2023/05/ai-art-generators-mo...


Correct that it would not be copyrightable, but you're missing the point.

A codec conversion is not copyrightable. The original song which is still present enough in the conversion to impact its ability to be distributed, is still copyrightable. But you don't get some kind of new copyright just because did a conversion.

For comparison, if you take a public domain book off of Gutenberg and convert it from an EPUB to a KEPUB, you don't suddenly own a copyright on the result. You can't prevent someone else from later converting that EPUB to a KEPUB again. Copyright protects creative decisions, not mathematical operations.

So if there is a copyright to be held on model weights, that copyright would be downstream of a creative decision -- ie, which data was it trained on and who owned the copyright of the data. However, this creates a weird problem -- if we're saying that the artifact of performing a mathematical operation on a series of inputs is still covered by the copyright of the components of that database, then it's somewhat tricky to argue that the creative decision of what to include in that database should be covered by copyright but that copyrights of the actual content in that database don't matter.

Or to put it more simply, if the database copyright status impacts models, then that's kind of a problem because most of the content of that training database is unlicensed 3rd party data that is itself copyrighted. It would absolutely be copyright infringement for OpenAI/Meta to distribute its training dataset unmodified.

AI companies are kind of trying to have their cake and eat it too. They want to say that model weights are transformed to such a degree that the original copyright of the database doesn't matter -- ie, it doesn't matter that the model was trained on copyrighted work. But they also want to claim that the database copyright does matter, that because the model was trained on a collection where the decision of what to include in that collection was covered by copyright, therefore the model weights are copyrightable.

Well, which is it? If model weights are just a transformation of a database and the original copyrights still apply, then we need to have a conversation about the amount of copyrighted material that's in that database. If the copyright status of the database doesn't matter and the resulting output is something new, then no, running code on a GPU is not enough to grant you copyright and never really has been. Copyright does not protect algorithmic output, it protects human creative decisions.

Notably, even if the copyright of the database was enough to add copyright to the final weights and even if we ignore that this would imply that the models themselves are committing copyright infringement in regards to the original data/artwork -- even in the best case scenario for AI companies, that doesn't mean the weights are fully protected because the only copyright a company can claim is based on the decision of what data they chose to include in the training set.

A phone book is covered by copyright if there are creative decisions about how that phone book was compiled. The numbers within the phone book are not. Factual information can not be copyrighted. Factual observations can not be copyrighted. So we have to ask the same question about model weights -- are individual model weights an artistic expression or are they a fact derived from a database that are used to produce an output? If they're not individually an artistic expression, well... it's not really copyright infringement to use a phone book as a data reference to build another phone book.


It's a complicated question and I don't think anyone can give a clear yes or no answer before some court has ruled on it. One school of thought is that copyright is designed to protect original works of creativity, but weights are generated by an algorithm and not direct human expression. AI generated art, for example, has already been ruled ineligible for copyright.


I have a hard time imagining a world where it is not the case at least in the US i.e. where copyright is extended to a work with no originality in direct contradiction to copyright clause in the constitution.


It's all kind of irrelevant. If they are not copyrightable, then most companies will simply hide them behind an API. There is no law saying these companies must release their weights. The companies are releasing their weights because they felt they could charge for and control other things. Like the output from their models.

If they can't charge for and control those other things, then we'll likely see far fewer companies releasing weights. Most of this stuff will move behind APIs in that scenario.


Maybe, maybe not. Companies are not monoliths. For all we know, internally it’s already well known that model weights likely aren’t copyrightable and the only reason for the restrictions is to give the appearance of being responsible to appease the AI doomers.


An analog to this might be the settings of knobs and switches for an audio synthesizer, or guitar effects settings. If you wanted to get the "Led Zeppelin sound" from a guitar, you could take a picture of the knobs on the various pedals and their configuration, and replicate that yourself. You then create a new song that uses those settings. Is that something that is allowed under copyright?

What if there were billions of knobs, tuned after years of feedback and observations of the sound output?


That’s a bad analogy because a human chose the values of those settings using their creative mind. That’s not at all the case with weights. This originality is the heart of copyright law.


I don't think that's a good analogy. A piano has N keys. You can press certain ones in certain combinations and write it down. That result is still copyrightable, because you can prove that it was an original and creative work. Setting knobs for a machine is no different, but the key differentiator is if you did it yourself or if an algorithm did it for you.


In my analogy, it's not the sequence of the notes or the composition, which I agree is copyrightable. But are the settings of the knobs and switches on synthesizers and effects devices used in a recording equivalent to the weights of a neural network or LLM? And if so, are those settings or weights copyrighitable?


And it also remains to be seen if various legislatures will pass laws that explicitly declare the copyright status of model weights. It is important to remember that what is or is not copyrightable can change.


At least in the US copyright is established by the constitution so not sure how much it’s possible to change via the normal legislative process.


The US constitution grants congress the ability to create copyright ("To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries"), but it doesn't create copyright law itself. That's a broad clause that gives Congress pretty free reign to change how copyright is defined.


Constitutionality is also about how previous cases have been evaluated for example see the bit about how photography copyright was established here: https://constitution.congress.gov/browse/essay/artI-S8-C8-3-...


specifically:

> A century later, in Feist Publications v. Rural Telephone Service Co., the Supreme Court confirmed that originality is a constitutional requirement


Yep, same with SSPL. GPL has been tested in FSF vs Cisco (2008), but none of the more restrictive licenses have.


1. Why wouldn't they be and 2. Does that even matter? If you enter into a contract saying don't do X, and you do X, you're violating the contract.



I assume GP was talking about a scenario in which you had not entered into a contract with Meta. E.g. if I just downloaded the weights from someone else.


If they are nor copyrightable, that'll be the end of publicly-released weights by for-profit companies. All subsequent models will be served behind an API.


> If they are nor copyrightable, that'll be the end of publicly-released weights by for-profit companies

I don’t see why, for-profit companies release permissively-licensed ooen-source code all the time, and noncopyrightable models aren't practically much different than that.


I debated whether to be more specific and verbose in my earlier comment and brevity won at the expense of clarity. I meant large models that cost 6 or 7 digits to train likely won't be released if the donor company can't control how the models are used.

> I don’t see why, for-profit companies release permissively-licensed ooen-source code all the time

I agree with this - however, they tend to open-source non-core components - Google won't release search engine code, Amazon wont release scalable-virtualization-in-a-box, etc.

I'm confident that Facebook won't release a hypothetical Llama 5 in a manner that enables it to be used to improve ChatGPT 8 - the aim will be unchanged from today, byt the mechanism will shift from licensing to rate-limiting, authentication & IP-bans.


Because the courts will have determined their business models for them.

As mercenary as it may sound, what these companies are trying to do is find a business model that is as friendly to themselves as it is hostile to their competitors.

This is all part of the jockeying.


And, sure, lack of copyrightability changes the parameters and will change behavior. What I think you have failed to support is that the particular change that it will induce will eliminate all such releases.


What's problematic is that there are big models that adopt truly open source licenses, such as MPT-30b and Falcon-40b. As grateful as I am for having access to the Llama2 weights, it feels unfair that it gets credit for being "open source" when there are competing models that really are open source, in the traditional OSI sense.

The practical difference between the licenses is small enough that I expect most people (including me) will choose Llama2 anyway, because the models are higher quality. But that incentive may mean that we get stuck with these awkward pseudo-open licenses.


I don't see why the term "open source" needs to evolve when "source available" is available. Or in this case, "weights available under a license with few restrictions."


New generation of programmers can't remember not having open source / free software of any kind so the difference is academic versus felt.


The chart in this this article is very wrong to show only GPL as free software and MIT/Apache as open source but not free software licenses.

While the FSF side of things doesn't like the term "open source," even they say that "nearly all open source software is free software." Specifically, the MIT and Apache (and LGPL) licenses are absolutely free software licenses--otherwise Debian, FSF-approved distros, etc. would have far less software to choose from.

What the chart probably meant to distinguish is copyleft vs free software or open source. And if you're ordering it from a permissiveness viewpoint, the subset relationship should be reversed--GPL is far more permissive than SSPL, etc., but still less permissive that MIT/Apache.


Yup. The terms "Open Source" and "Free Software" are pretty much interchangeable when it comes to licenses. The difference is political, not technical.

This part of the article is pretty misleading as well:

> Free software, as specified by the Free Software Foundation, is only a subset of open source software and uses very permissive licenses such as GPL and Apache.


In the diagram, there is theoretically another category outside the 'Restricted Weights' but maybe less than the 'Completely Closed' superspace, and that would be something along the lines of 'Blackbox weights and model' that is free to use but essentially non inspectable or transferrable. This would be the sister to 'free to use' closed-source software. An AI that is free to use but provided as a binary blob would meet this criterion. Or a module importable to python that calls precompiled binaries for the inference engine + weights with no source available. The traditional complement of this in the current software world would be Linux drivers from 3rd parties that are not open source. They are free, but not open.

We haven't seen this too much yet in the AI world, as mostly people who open the weights are doing so in a research manner, where the inference is decidedly needed to be open sourced- and people with closed models do so in order to make money and thus no reason to open source the inference side either, just charge for an API ("OpenAI").


Yea I didn't include it, but that'd be the "free as in beer, but not freedom" circle :)


The headline is editorialized. Actual is "LLaMA2 isn't "Open Source" - and why it doesn't matter"

It is actually editorialized in a way that feels quite different from the actual one. I think the author and the poster might disagree on what open source means.


Mods changed the title, I used the original one when first posting. Not sure why they changed it.


Maybe Dang is taking a hard stance on the "open source" position. I'm honestly with you, as long as the source is available people will call it open source and complaining over terms isn't going to convince anyone.


they are the same person :)


Since Open Source has been established in the tech ethos for a while now, any deviation has been met with derision. It seems like the community has been more tolerant of these "open" licenses as of late. While must of the hate for projects that do not fit the FOSS standard is mostly unwarranted, hopefully we are not moving quickly in the "open" direction.

Here is another article on LLaMa2: https://opensourceconnections.com/blog/2023/07/19/is-llama-2...


I'm not sure open source applies to actual models. Models aren't human readable, so it's closer to a binary blob. It would apply to the training code and possibly data set.

Llama2 is a binary blob pre-trained model that is useful and is licensed in a fairly permissive way, and that's fine.


Yes I think you've put it well. If models were smaller I'd see those in the Github releases section. The model training is what I'd see in the source code and the README etc, to arrive at the 'blob'.


Even if it costs millions in compute to run at that scale, seeing that code would be extremely informative.


Very like a binary blob. You have to execute it to use it and impossible for humans to reason about just by looking at it.

At least binary blobs can be disassembled.


"Nyet! Am not open source! Not want lose autonomy!"

(Downvotes... oops. The reference is Charlie Stross's Accelerando. The protagonist has a conversation with an AI that's just trying to survive. One of the options he suggests is to open source itself. Which is a roundabout way of saying that eventually we're going to have to take the AI's own opinions into account. What if it doesn't want to be open source?)


This post deserved better treatment, along with maybe a couple of metaphorical decerebrated kittens on its doorstep.


It's not just in the LLM space; even for 'older' models, companies have aggressively embraced this approach. For example: YOLOv3 has been appropriated by a company called Ultralytics, which has subsequently released the 'YOLOv5' and 'YOLOv8' "updates": https://github.com/ultralytics/ultralytics

There is no marked increase in model effectiveness in these 'new' versions, but even if you just use the 'YOLOv8' Pytorch weights (and no part of their Python toolchain, which might have some improvements), these will somehow try to download files from Ultralytics servers. Possibly for a good reason, but most likely to, let's say, "pull an Oracle."

Serious AI researchers won't go anywhere near this stuff, but the number of students-slash-potential-interns with "but it's on GitHub!" expectations that I had to reject lately due to "nope, we're not paying these guys for their Enterprise license just to check out your project" is rather disheartening...


Part of the benefit of FOSS & open source is that a curious user can inspect how something is made and learn from it. It matters that open weights are no different from a compiled program. Sure, you can always modify an executable's instructions, but there's no openness there.

Then there's the problems of the content of the training data, which parallel the dangers of opaque algorithms.


Great point in the article. In https://opencoreventures.com/blog/2023-06-27-ai-weights-are-... I propose a framework to solve the confusion. From the post: "AI licensing is extremely complex. Unlike software licensing, AI isn’t as simple as applying current proprietary/open source software licenses. AI has multiple components—the source code, weights, data, etc.—that are licensed differently. AI also poses socio-ethical consequences that don’t exist on the same scale as computer software, necessitating more restrictions like behavioral use restrictions, in some cases, and distribution restrictions. Because of these complexities, AI licensing has many layers, including multiple components and additional licensing considerations."


> downloadable weights

When it comes to "how much of it has to be available to be open source", I think it may be instructive to look at encryption algorithms.

Many of them have numeric constants or initial values--NOT part of the secret key itself--which need to be known and available, both for interoperability and for the expected security-level of the algorithm. These are arguably similar to LLM weights. (Perhaps the simplest example would be the prominence of "13" in ROT13.)

Yet if someone tried saying that their encryption standard was "open source" while keeping those constants secret and/or legally-encumbered, I think a lot of people would complain that the label is incorrect or inappropriate.


Of course, it’s not open source. With proliferation of the cloud, software has obtained an entirely new level of closeness: not being able to see the program binaries. Having an ability to run locally is now somewhat open in comparison.


An understood term like "open open" source shouldn't be hijacked and exploited for marketing purposes.

What these models do, they should either invented a new term, or use an appropriate existing term, eg. "fair use"


Absolutely. Maybe the term is already coined, but I don’t know it. Open source implies the ability to compile software from human-generated inputs. This is just self-hosted freeware.


Given that it's basically impossible to prove that a particular text was generated using a particular LLM (and yes, even with all the watermarking tricks we know of, this is and will still be the case), they might as well be interchangeable. Folks can and will simply ignore the silly license BS that the creators put on the LLM.

I hope that users aggressively ignore these restrictive licenses and give the middle finger to greedy companies like Facebook who try to restrict usage of their models. Information deserves to be free, and Aaron Swartz was a saint.


Fully reproducible model training might simply not be possible if information from the training environment is not captured. In addition to data and code you might have additional uncertainty from:

- pseudo/true random number generator and initialization

- certain speculative optimizations associated with training environments (distributed)

- Speculative optimizations associated with model compression

- Image decompression algorithm mismatch (basically this is library versioning)

- ....things I'm forgetting...

It's just a lot of things to remember to capture, communicate, and reproduce.


pseudo/true random number generator and initialization

It's not just the generator and initialization. If you do anything multithreaded, like a producer/consumer queue, then you need to know which pieces of work went to which thread in which order.

It's a lot like reproducing subtle and rare race conditions.


Most of the mature ML environments are pretty focused on reproducible training though. It's pretty necessary for debugging and iteration.


Why not just "downloadable"? It describes the actual difference between LLaMA and GPT. Open-data is the only other distinction that matters.


The author clearly doesn't understand the term open source as it is used for software in the first place as evidenced by the completely nonesensical diagram [0]. And no, the term doesn't need to evolve even if parasites want to in order to ride of the goodwill fostered by the open source community.

[0] https://www.alessiofanelli.com/images/open-models.png


> While it’s mostly open, there are caveats such as you can’t use the model commercially if you had more than 700M MAUs as of the release date, and you also cannot use the model output to train another large language model. These types of restrictions don’t play well with the open source ethos

No, CC-NC-ND is a thing, and even GPL applies restrictions on derivation as well.

"Open source" doesn't mean BSD/MIT. There is even open-source that you cannot freely redistribute at all - not all open-source is FOSS!

I always think it's a testament to how much copyleft has succeeded that in many cases people think of GPL and BSD/MIT as being the baseline.


There's "open source" in the original sense, where the source was available. Then there's "FOSS" where the source is not only available, but it's under a copyleft license designed to protect the IP from greedy individual humans. And then there's "open" in the Shenzhen sense where you can find the source and other data online and nobody's going to stop you building something based on those. This is an interesting timeline.


The original sense of open source is defined by the people who fractured off from the Free Software movement in the mid 90's and created it. It's just "Free Software" that has a focus on practicality and utility rather than "Free Software"'s focus on idealism and doing the right thing. It has NOTHING to do with "source available" which is a movement that has recently been co-opting the open source name.

"FOSS" has absolutely no requirement of it being copyleft. The MIT license is just as FOSS as the GPL. Many of the free software advocates do have an affinity for copyleft, but they are not mutually exclusive. There are plenty of FOSS advocates who also use and advocate for permissive licenses as well.


> There's "open source" in the original sense

That original sense never existed. Virtually nobody said "open source" before OSI's 1998 campaign for "Open Source", as bankrolled by Tim O'Reilly.

https://thebaffler.com/salvos/the-meme-hustler

I know it's been a long time, and we've forgotten, but there is virtually no record of anyone saying "open source" before 1998, except in rare and obscure contexts and often unrelated to the modern meaning.


There’s this one from September 10th, 1996, which I find intriguing:

https://web.archive.org/web/20180402143912/http://www.xent.c...


> And then there's "open" in the Shenzhen sense where you can find the source and other data online and nobody's going to stop you building something based on those.

I believe there is a name for that: gongkai. https://www.bunniestudios.com/blog/?page_id=3107


Ooh, thanks! I've watched a few of bunnie's things in the past but that's a term I'll remember.


On top of that there are also different OSS such as Apache and MIT that the latter one can still restrict the user from using because project owner might patented some algorithm and MIT license doesn't have patent grant.

LGPL3.0 also pretty much is restricted in a way that not sure if can be used to distribute software in App Store for iOS legally.


You see a similar loosening of the term in other fields e.g. open source journalism. Although that seems to be more about crowdsourcing than transparency or usage rights.


It is quite an unfortunate dilution of the term


How is it possible that you can fine tune Llama v2 but the weights are not available? That doesn’t make sense to me.


I like Debian's ML policy about this:

https://salsa.debian.org/deeplearning-team/ml-policy/


Yes (Unfortunately). But Llama 2 being released for free as a downloadable AI model is much better than nothing. For now it is a great start against the cloud-only AI models.

As for terms, we'll settle on '$0 downloadable AI models' which are available today. Would rather use that over cloud-only AI models which can fall over and break your app at any time and you have zero control over that.

Stable Diffusion is a good example that fits the definition of 'open-source AI' as we have the entire training data, weights reproduciblity, etc and Llama 2 does not.


Agreed. I called it a "$3M of FLOPS donation" by Meta.


No wonder there is such “momentum” on watermarking.


llama2 is absolutely useless. From the small models the guanaco-33b and guanaco-65b are the best (though they are derived from llama).


Useless for what? Are you comparing the base model with chat-tuned models?

Chat-tuned derivatives of LLaMa 2 are already appearing. Given that the base LLaMa 2 model is more efficient than LLaMa 1, it is reasonable to expect that these more refined chat-tuned versions of the chat-tuned versions will outperform the ones you mention.


Is that just based on your experience, or do you have a link to benchmarks?


Try these prompts with different models. LLaMA 2 output is pure garbage: ----1---- On a map sized (256,256), Karen is currently located at position (33,33). Her mission is to defeat the ogre positioned at (77,17). However, Karen only has a 1/2 chance of succeeding in her task. To increase her odds, she can: 1. Collect the nightshades at position (122,133), which will improve her chances by 25%. 2. Obtain a blessing from the elven priest in the elven village at (230,23) in exchange for a fox fur, further increasing her chances by additional 25% Foxes can be found in the forest located between positions (55,33) and (230,90).

Find the optimal route for Karen's quest which maximizes her chances of defeating the ogre to 100%. ----2---- Write a python code using imageio.v3 to create a PNG image representing the map way-points and the route of Karen in her quest, each way-point must be of a different color and her path must be a gradient of the colors between the waypoints. ------------

I have a lot of cases those I test against different models ... GPT-4 since one week is really degraded, GPT-3.5 became a little bit better, and LLaMA2 is garbage.


wait for the tuned models


Should be good motivation to figure out what those numbers mean


The spirit of "open source" implies "open weights", without doubt. Litigating the specific meaning of the terms is pointless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: