Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Strange to see no mention of potential copyright violations found in LLM-generated code (e.g. LLMs reproducing code from Github verbatim without respecting the license). I would think that would be a pretty important consideration for any software development company, especially one that produces so much free software.




Also since LLM generated content is not copyrightable what happens to code you publish as Copyleft license? The entire copyleft system is based on the idea of a human holding copyright to copyleft code. Is a big chunk of it, the LLM part, basically public domain? How do you ensure theres enough human content to make it copyrightable and hence copyleftable….

> since LLM generated content is not copyrightable

That's not how it works. If you ask an LLM to write Harry Potter and it writes something that is 99% the same as Harry Potter, it isn't magically free of copyright. That would obviously be insane.

The legal system is still figuring out exactly what the rules are here but it seems likely that it's going to be on the LLM user to know if the output is protected by copyright. I imagine AI vendors will develop secondary search thingies to warn you (if they haven't already), and there will probably be some "reasonable belief" defence in the eventual laws.

Either way it definitely isn't as simple as "LLM wrote it so we can ignore copyright".


>it seems likely that it's going to be on the LLM user to know if the out is protected by copyright.

To me, this is what seems more insane! If you've never read Harry Potter, and you ask an LLM to write you a story about a wizard boy, and it outputs 80% Harry Potter - how would you even know?

> there will be probably be some "reasonable belief" defence in eventual laws.

This is probably true, but it's irksome to shift all blame away from the LLM producers, using copy-written data to peddle copy-written output. This simply turns the business into copyright infringement as a service - what incentive would they have to actually build those "secondary search thingies" and build them well?

> it definitely isn't as simple as "LLM wrote it so we can ignore copyright".

Agreed. The copyright system is getting stress tested. It will be interesting to see how our legal systems can adapt to this.


I think the poster is looking at it from the other way: purely machine-generated content is not generally copryrightable, even if it can violate copyright. So it's more a question of can a coplyleft license like GPL actually protect something that's original but primarily LLM generated? Should it do so?

(From what I understand, the amount of human input that's required to make the result copyrightable can be pretty small, perhaps even as little as selecting from multiple options. But this is likely to be quite a gray area.)


Has anything like this worked its way through the courts yet?

Yes, training is considered fair use, and output is non-copyrightable / public domain. With many asterix and footnotes, of course.

Don't see how output being public domain makes sense when they could be outputting copyrighted code.

Shouldn't the right's extend forward and simply require the LLM code to be deleted?


With many asterix and footnotes. One of which being that if it literally output the exact code, of course that would be copyright infringement. Something that greatly resembled but with minor changes would be a gray area.

Those kinds of cases, although they do happen, are exceptional. In a typical output that doesn't not line-for-line resemble a single training input, it is considered a new, but non-copyrightable work.


First, you have to prove it that it produced the copyrighted code. The question is what copyrighted code is in the first place? Literal copy-paste from source is easy but I think 99% of the time this isn't the case.

Is there current generation LLMs do this? I suppose I mean "do this any more than human developers do".


>> Here's my question: why did the files that you submitted name Mark Shinwell as the author?

> Beats me. AI decided to do so and I didn't question it. I did ask AI to look at the OxCaml implementation in the beginning.

This shows that the problem with AI is philosophical, not practical


...what a remarkable thread.

Right? If this is really true, that some random folk without compiler engineering experience, implemented a completely new feature in ocaml compiler by prompting the LLM to produce the code for him, then I think it really is remarkable.

Oh wow, is that what you got from this?

It seems more like a non experienced guy asked the LLM to implement something and the LLM just output what and experienced guy did before, and it even gave him the credit


Copyright notices and signatures in generative AI output are generally a result of the expectation created by the training data that such things exist, and are generally unrelated to how much the output corresponds to any particular piece of training data, and especially to who exactly produced that work.

(It is, of course, exceptionally lazy to leave such things in if you are using the LLM to assist you with a task, and can cause problems of false attribution)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: