Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's not how copyright law works at all. It doesn't say "well if you didn't want someone to copy this thing you should have stopped them from doing it". It lays out 4 factors for a court to consider about whether something is fair use and none of them are around how easy it was to rip the work off.[1]

In the LLM space it seems even more clear because many/most of the works in the various corpora used for this training have very clear copyright terms which prevent digital storage and reproduction without the publishers permission (just look at the reverse of the title page of any book for the copyright notice if you don't believe me).

Finally, for LLMs many/most of the works are in corpora[2] that people just download so they aren't looking at a robots.txt file put up by teh original site. If you look at The Pile paper[3] for example they explicitly say that much of the material is under copyright and that they are relying on fair use.

[1]: https://fairuse.stanford.edu/overview/fair-use/four-factors/ [2] https://github.com/Zjh-819/LLMDataHub for example [3] https://arxiv.org/abs/2101.00027



Since you raise the four factors test for fair use, let's spell those out:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work.

<https://www.law.cornell.edu/uscode/text/17/107>

Most critically, courts have put strong emphasis on the notion of transformative use of copyrighted works, and web indexing is transformative in the sense that it does not create a competing work, but provides a means of discovering and assessing the relevance of the indexed work itself.

As to web indexing, that (and associated factors including thumbnails and caching) have been ruled by courts to be fair-use adaptations of works:

Displaying a cached website in search engine results is a fair use and not an infringement. A “cache” refers to the temporary storage of an archival copy—often a copy of an image of part or all of a website. With cached technology it is possible to search Web pages that the website owner has permanently removed from display. An attorney/author sued Google when the company’s cached search results provided end users with copies of copyrighted works. The court held that Google did not infringe. Important factors: Google was considered passive in the activity—users chose whether to view the cached link. In addition, Google had an implied license to cache Web pages since owners of websites have the ability to turn on or turn off the caching of their sites using tags and code. In this case, the attorney/author knew of this ability and failed to turn off caching, making his claim against Google appear to be manufactured. (Field v. Google Inc., 412 F.Supp.2d 1106 (D. Nev., 2006).)

<https://fairuse.stanford.edu/overview/fair-use/cases/>

Or, to use your phrase, by common law (precedential case law), that is precisely "how copyright law works". Note particularly that the courts leaned on publishers' capabilities to indicate whether or not caching was or was not permitted "using tags and code".

There's a larger issue which I'm not aware of being explicitly raised in case law, which concerns how the World Wide Web is indexed as contrasted to how a print library is indexed. In the case of a library, an independent third party (the library cataloguer) assigns metadata to a work (standardised title, author(s), translator(s), illustrator(s), publisher(s), etc., as well as subject headings and call numbers. Additional indexing is provided through citations indices (both forward and reverse --- works cited by, and citing, other works). These largely don't rely on the text of the indexed work itself, though of course the cataloguer presumably is reading at least portions of the work to classify it. Critically: the works themselves are physical artefacts of fixed form which are virtually always read directly rather than interpreted through some mechanism.[1]

As it's evolved over the past quarter century or so, Web search doesn't rely strongly on metadata (though some of this is taken into consideration), and most particularly publisher-provided keywords are almost wholly ignored, largely due to flagrant abuse of that feature by some publishers. Instead, a combined approach of full-text indexing (that is: capturing the full text of a work and identifying keywords and tuples (multi-word phrases) which can be matched against queries entered by persons searching for documents, and an assessment of the overall relevance of that work, usually at a site (or sub-site) level based on other indicia, most famously (though somewhat less relevantly today) "PageRank", Google's original site-ranking algorithm.

Further, the entire mechanism of the Web is of creating copies of works on request. When an HTTP request is sent, the server responds by copying the requested work to an output stream, which is then received (and duplicated, often multiple times) by the client system as an integral part of the utilisation of that content. US copyright law does not have a section specifically referring to computer-network transmission, but there are multiple limitations on exclusive rights to copy (by authors) above and beyond the 107 Fair Use exemptions in sections 108 through 122 of 17 U.S.C, including specifically ephemeral recordings (108) and the case of computer programmes (117).

<https://www.law.cornell.edu/uscode/text/17/chapter-1>

Large language model training is a new area of use and law (legislative or common) is yet to be determined, but there's at the very least existing statutory language as well as precedent which suggest that at least some uses might well be found to be fair use. As I'm watching the situation evolve, I'm reminded strongly of several articles copyright scholar Pamela Samuelson wrote in the 1990s over adapting copyright to the Internet age, and questions of what its future place might be: specific governance over the literal copying of expressive works, or a general doctrine against misappropriation. As always, there's a sharp tension between authors' rights (and, let's be brutally honest: publishers' profits) and the underlying Constitutional justification of US copyright law: "To promote the Progress of Science and useful Arts".

<https://constitution.congress.gov/browse/article-1/section-8...>

And it seems Sameulson is engaged in the discussion of generative AI and copyright, though I've yet to read her work on the subject: <https://news.berkeley.edu/2023/05/16/generative-ai-meets-cop...>

(Discussion here strongly reliant on US law. There's general international agreement on copyright through the Berne Convention, though significant national differences exist.)

________________________________

Notes:

1. There is a spectrum of works, e.g., print books, phonographs, CDs and DVDs (the latter containing anti-circumvention mechanisms), etc., but in general there's minimal if any intermediate copying and duplication of works, and in many cases none at all.


I appreciate the detail in your reply. Do you think the recent Warhol "Orange Prince" case[1] gives an inkling into possible future court treatment of the question of "transformative" use for generative AI models? There Warhol's silk screen print of the original Prince photo was deemed not transformative enough as I understand it. One of things about the stochastic nature of generative AI is can be rather hard to notice when the model spits out something very close to the training material.

[1] https://www.theguardian.com/artanddesign/2023/may/18/andy-wa...


Good question, I've seen some coverage of the case, and ... tend to disagree with the court's decision.

That said, it would tend to darken the prospects for operators of LLM generative AI systems, IMO.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: