October 31, 2023

EU AI act compliance for AI models on copyright & training data

How can AI model providers comply with regulation concerning training data & copyright?

Foundation models

Foundation models play a pivotal role as the foundation of many advanced artificial intelligence systems. Simply put, a foundation model is a large-scale AI model trained using vast amounts of data, that serves as a basis for further specialisation or application in varied domains. One prominent example of such systems are generative AI models, which have the capability to autonomously produce content, be it text (like ChatGPT), images (like Midjourney), audio, or video. 

Generative AI models

Generative AI models, due to their very nature, rely on extensive datasets for training. These datasets often contain giant amounts of images, text snippets, and other forms of data gathered from diverse sources. The sheer volume and variety of data these models consume can sometimes overshadow the origin of the data, some of which might be protected under copyright laws.

The EU AI Act Regulation

The EU AI Act aims to regulate foundation models and generative AI systems. Pursuant to one of the obligations imposed by the current version of the Act by the European Parliament (the text is not finalised yet), providers of foundation models used in generative AI systems should "document and make publicly available a sufficiently detailed summary of the use of training data protected under copyright law" (amendment 399, article 28b of the current text). In other words, companies such as OpenAI (as the provider of ChatGPT) would be obliged to document and disclose the copyright-protected data they used to train their models.

The goal of the obligation is clear and logical, namely providing transparency and ensuring that stakeholders have visibility into the workings of these influential AI systems. However, and whilst we support more transparency, we will raise two reasons in this blogpost why the envisaged obligation could prove to be a very difficult (not to say impossible) task for many companies working on developing and deploying these kinds of models:

  1. Copyright can go very far and a lot of different content could potentially be protected by copyright (books, images, text snippets, design objects, the design of functional objects,...). The obligation would therefore create huge administrative burdens on providers of foundation models as an incredible huge amount of content should be documented and disclosed.

  2. The question whether or not the conditions for copyright protection are met is subjective, making the applicability of copyright protection unpredictable. Moreover, the providers of foundation models are not in the best position to rule whether or not the conditions are met.

As the consequences of non-compliance could lead to severe penalties, we believe that, if the provision would actually be accepted, more guidance is needed on how providers of foundation models can comply with this obligation. 

Copyright can reach far

Copyright protection applies to "works of art and literature", which is an autonomous concept of Union law to be interpreted uniformly throughout the European Union.

For the interpretation of the term, we need to look at the case law of the Court of Justice of the European Union ("CJEU"). According to the case law, we can speak of a protected work when two conditions are met: (1) the work needs to be original and (2) there needs to be an expression (for example CJEU 12 September 2019, Cofemel, C-683/17, para. 29):

Condition 1: Original Subject Matter

First, the concept “work” entails that there exists an original subject matter, in the sense of being the author’s own intellectual creation.

It follows from the Court’s settled case-law that, if a subject matter is to be capable of being regarded as original, it is both necessary and sufficient that the subject matter reflects the personality of its author, as an expression of his free and creative choices (see for example CJEU 1 December 2011, Painer, C‑145/10, paragraphs 88, 89 and 94).

Condition 2: Expression of intellectual creation

Secondly, classification as a work is reserved to the elements that are the expression of such intellectual creation. It must be possible to clearly and precisely identify the subject matter, without having an element of subjectivity. This condition is for example the reason why 'a taste' cannot be protected by copyright, as a taste will be subjective, and cannot be objectively identified (CJEU 13 November 2018, Levola Hengelo, C‑310/17, paragraphs 33 and 35 to 37).

Copyright protection in Europe

When subject matter complies with these two European conditions, it will be protected by copyright. The conditions are sufficient, which implies that no other conditions may be imposed for protection, making terms such as 'novelty', 'inventiveness', 'aesthetic or artistic character' or 'a certain level of effort or expertise' irrelevant when determining whether subject matter is protected by copyright or not. 

Following these European conditions, a lot of subject matter could be considered to be a work in the sense of copyright law. Examples from case law prove that copyright protection can reach far and that the concept of a “work” is broadly  interpreted (next to the obvious subject matter such as books, images, music works, videos, and so forth):

  • For example, a Belgian court already ruled that a guide for using IT equipment was protected by copyright (Brussels Court of Appeal, 28 January 1997).
  • Or, according to the CJEU eleven consecutive words can potentially be a “work” and therefore protected by copyright (CJEU 16 July 2009, Infopaq, C-5/08).
  • And, also the design of very functional objects could be considered to be a work of art and literature. For example in case law it was ruled that the design pictured below (the support of a waffle iron) was protected by copyright (Court of appeal Brussels 25 October 2011, no. 2011/AR/119):

The copyright impact for AI models

The broad scope of protection has the consequence that the envisaged obligation of the AI Act could be very extensive, leading to extreme administrative burdens. For example, if there is a picture of a functional object, it could be that (i) the picture as such is protected by copyright as well as (ii) the design of the functional object. 

It goes without saying that a provider of Generative AI systems would have a huge task in documenting and disclosing information regarding the copyright protected subject matter, knowing that a Generative AI system can be trained on millions of images, text snippets, drawings, books, ...

Copyright is subjective

Further, it is important to note that copyright exists upon creation and without registration, which means that there is also not a copyright register that providers of generative AI systems could consult to check whether or not certain data is copyright protected or not.

In practice, it will be up to a judge (usually in the context of litigation) to rule whether or not a certain work meets the conditions. In doing so, it is also up to the alleged right holder to prove that the subject matter is a work of art and literature.

We notice that there is a lot of subjectivity regarding the interpretation of the conditions, making the applicability of copyright protection unpredictable.

Examples prove copyright assessments can go either way

  • In a case regarding pictures of football players and football games (portrait pictures, pictures of players in action, pictures of the stadions and atmosphere) the court of appeal of Brussels ruled that these kinds of pictures were protected by copyright because the photographer was able to make several free and creative choices regarding the pictures, for example regarding the angle of the picture, the point of view, the lighting, the timing of the picture, the adjustment of the camera and so forth. The photographer therefore made "free and creative choices" while making the pictures, which means that these were original (Court of appeal Brussels, 3 October 2017, no. 2013/AR/860).
  • In another case however the court of appeal of Ghent ruled that the pictures used on a specific real estate website as well as the accompanying texts describing the properties were not protected by copyright. The court ruled that pictures could potentially be protected but that the originality was not proven in this specific case (Court of appeal Ghent, 25 June 2018, no. 2016/AR/470).
  • And even regarding the same subject matter, a different court can have a different opinion, for example: 
  • Multiple courts already ruled that the design of the handbag Le Pliage from Longchamp was protected by copyright (inter alia Court of appeal Brussels on 18 May 2006, no. 2003/AR/880 and on 20 April 2012, no. 2012/AR/2910), while there is also case law ruling that the design is not protected (court of appeal Ghent, 20 October 2014, no. 2013/AR/1945).
  • Or regarding the following object, in first instance it was ruled that the design was protected by copyright (commercial court Ghent, 11 January 2018, no. A/16/02910) while the court of appeal ruled that it wasn’t (court of appeal Ghent, 1 February 2021, no. 2018/AR/254):

The examples above show that there is a lot of subjectivity regarding copyright protection, and that even the opinion of judges may vary regarding the subject matter. 

Obviously, if even judges’ opinions may vary, it is very difficult for a provider of generative AI systems to assess whether or not certain data is copyright protected or not. All the more when there is no copyright register where this can be checked. 

Conclusion

It goes without saying that we are in favour of more transparency on data for foundation model providers. For example, we fully support the idea to regulate foundation models and the envisaged transparency obligations to disclose compute (model size, computer power, training time), the capabilities and limitations of the model, the results of internal and external testing and so forth. However, and while the provision on transparency regarding copyrighted material is logical as well, we believe that the current provision is difficult to comply with from a practical point of view for the reasons mentioned in this blogpost.

If the provision would actually be implemented, we believe that more guidance is needed on how providers can actually meet the envisaged obligation of the AI Act. 

Moreover, more guidance on how the concept of  a “sufficiently detailed summary” should be interpreted, would be welcome. The question arises how sufficiently detailed the disclosure must be, and what is meant with a summary.

The importance and the need of guidance is clear, as non-compliance with the new provisions may expose generative AI system providers to liabilities if insufficient summaries regarding training datasets are in place. A failure to comply with these disclosure obligations could lead to potential fines of up to €‎10 million or 2% annual turnover, whichever is higher.