Michiel Van Lerbeirghe
Legal counsel
Foundation models play a pivotal role as the foundation of many advanced artificial intelligence systems. Simply put, a foundation model is a large-scale AI model trained using vast amounts of data, that serves as a basis for further specialisation or application in varied domains. One prominent example of such systems are generative AI models, which have the capability to autonomously produce content, be it text (like ChatGPT), images (like Midjourney), audio, or video.
Generative AI models, due to their very nature, rely on extensive datasets for training. These datasets often contain giant amounts of images, text snippets, and other forms of data gathered from diverse sources. The sheer volume and variety of data these models consume can sometimes overshadow the origin of the data, some of which might be protected under copyright laws.
The EU AI Act aims to regulate foundation models and generative AI systems. Pursuant to one of the obligations imposed by the current version of the Act by the European Parliament (the text is not finalised yet), providers of foundation models used in generative AI systems should "document and make publicly available a sufficiently detailed summary of the use of training data protected under copyright law" (amendment 399, article 28b of the current text). In other words, companies such as OpenAI (as the provider of ChatGPT) would be obliged to document and disclose the copyright-protected data they used to train their models.
The goal of the obligation is clear and logical, namely providing transparency and ensuring that stakeholders have visibility into the workings of these influential AI systems. However, and whilst we support more transparency, we will raise two reasons in this blogpost why the envisaged obligation could prove to be a very difficult (not to say impossible) task for many companies working on developing and deploying these kinds of models:
As the consequences of non-compliance could lead to severe penalties, we believe that, if the provision would actually be accepted, more guidance is needed on how providers of foundation models can comply with this obligation.
Copyright protection applies to "works of art and literature", which is an autonomous concept of Union law to be interpreted uniformly throughout the European Union.
For the interpretation of the term, we need to look at the case law of the Court of Justice of the European Union ("CJEU"). According to the case law, we can speak of a protected work when two conditions are met: (1) the work needs to be original and (2) there needs to be an expression (for example CJEU 12 September 2019, Cofemel, C-683/17, para. 29):
First, the concept “work” entails that there exists an original subject matter, in the sense of being the author’s own intellectual creation.
It follows from the Court’s settled case-law that, if a subject matter is to be capable of being regarded as original, it is both necessary and sufficient that the subject matter reflects the personality of its author, as an expression of his free and creative choices (see for example CJEU 1 December 2011, Painer, C‑145/10, paragraphs 88, 89 and 94).
Secondly, classification as a work is reserved to the elements that are the expression of such intellectual creation. It must be possible to clearly and precisely identify the subject matter, without having an element of subjectivity. This condition is for example the reason why 'a taste' cannot be protected by copyright, as a taste will be subjective, and cannot be objectively identified (CJEU 13 November 2018, Levola Hengelo, C‑310/17, paragraphs 33 and 35 to 37).
When subject matter complies with these two European conditions, it will be protected by copyright. The conditions are sufficient, which implies that no other conditions may be imposed for protection, making terms such as 'novelty', 'inventiveness', 'aesthetic or artistic character' or 'a certain level of effort or expertise' irrelevant when determining whether subject matter is protected by copyright or not.
Following these European conditions, a lot of subject matter could be considered to be a work in the sense of copyright law. Examples from case law prove that copyright protection can reach far and that the concept of a “work” is broadly interpreted (next to the obvious subject matter such as books, images, music works, videos, and so forth):
The broad scope of protection has the consequence that the envisaged obligation of the AI Act could be very extensive, leading to extreme administrative burdens. For example, if there is a picture of a functional object, it could be that (i) the picture as such is protected by copyright as well as (ii) the design of the functional object.
It goes without saying that a provider of Generative AI systems would have a huge task in documenting and disclosing information regarding the copyright protected subject matter, knowing that a Generative AI system can be trained on millions of images, text snippets, drawings, books, ...
Further, it is important to note that copyright exists upon creation and without registration, which means that there is also not a copyright register that providers of generative AI systems could consult to check whether or not certain data is copyright protected or not.
In practice, it will be up to a judge (usually in the context of litigation) to rule whether or not a certain work meets the conditions. In doing so, it is also up to the alleged right holder to prove that the subject matter is a work of art and literature.
We notice that there is a lot of subjectivity regarding the interpretation of the conditions, making the applicability of copyright protection unpredictable.
The examples above show that there is a lot of subjectivity regarding copyright protection, and that even the opinion of judges may vary regarding the subject matter.
Obviously, if even judges’ opinions may vary, it is very difficult for a provider of generative AI systems to assess whether or not certain data is copyright protected or not. All the more when there is no copyright register where this can be checked.
It goes without saying that we are in favour of more transparency on data for foundation model providers. For example, we fully support the idea to regulate foundation models and the envisaged transparency obligations to disclose compute (model size, computer power, training time), the capabilities and limitations of the model, the results of internal and external testing and so forth. However, and while the provision on transparency regarding copyrighted material is logical as well, we believe that the current provision is difficult to comply with from a practical point of view for the reasons mentioned in this blogpost.
If the provision would actually be implemented, we believe that more guidance is needed on how providers can actually meet the envisaged obligation of the AI Act.
Moreover, more guidance on how the concept of a “sufficiently detailed summary” should be interpreted, would be welcome. The question arises how sufficiently detailed the disclosure must be, and what is meant with a summary.
The importance and the need of guidance is clear, as non-compliance with the new provisions may expose generative AI system providers to liabilities if insufficient summaries regarding training datasets are in place. A failure to comply with these disclosure obligations could lead to potential fines of up to €10 million or 2% annual turnover, whichever is higher.