June 29, 2023
AI image generation without copyright infringement
Building a Creative Commons dataset for retraining Stable Diffusion using the open source dataset creation framework Fondant
AI image generation has taken the world by storm with services like Midjourney, DALL-E and Dream Studio (Stable Diffusion). Their underlying generative AI models have been trained on hundreds of millions of images from the public internet. Many of these images are not free from copyright and this creates legal uncertainties for the users of image generation systems as it is unclear if copyright laws apply, especially internationally. Additionally, legal action has been announced by copyright holders against providers like Midjourney and Stability AI.
These issues have made businesses cautious when it comes to incorporating generative AI models in their workflow. This is unfortunate as AI Image generation can lead to significant gains in productivity, for example by assisting designers in conceptualising and prototyping to speed up design cycles and improve quality.
One possible solution is to train a model on just proprietary data like Shutterstock’s image generator. However, only a few players possess hundreds of millions of images with meaningful descriptions. Another option is to start from permissively licensed, publicly available images.
The largest repository of free-to-use images is Openverse. It holds approximately 600 million Creative Commons (CC) licensed works but it has several limitations in that it relies on a small selection of websites and restricts users to only 10.000 downloads per day.
The solution we present in this post aims to overcome the above limitations by collecting a large, permissively licensed dataset from the public internet.
Creative Commons
When it comes to permissive licences, various options are available based on your specific use case. One widely recognized and utilised category is the Creative Commons (CC) licences. These licences are generally considered to hold international legal validity and are commonly used for images. They were established by the Creative Commons organisation, an esteemed international nonprofit entity. It is reported that over 2 Billion creative works have already been licensed under Creative Commons licence
With its widespread adoption by platforms such as Wikimedia and Flickr, Creative Commons licences enable us to potentially amass a substantial collection of copyright-free images, making them suitable for various image generation tools.
The primary licence categories we focus on are as follows:
If you are interested in all licence categories or a more detailed description of each licence, you can look at the official CC licence page: About CC Licences. The CC licences could have any combination of previously mentioned categories. Example: BY-SA-NC-ND.
For our purposes, we are primarily looking to use the Public Domain, BY and BY-SA licences. These licences grant us the ability to modify the images and use them for commercial purposes, aligning with our use case.
Our Approach
Many large open source datasets have been created by scanning Common Crawl files for images. Those files are archived web pages of websites that contain the HTML code of each web page. By scanning a large number of archived web pages, it is possible to extract image URLs and pair them with their corresponding textual content, thus generating a comprehensive dataset of image URL-to-text pairs.
Once the list of image URLs is obtained, the dataset can be made available to the public and anyone could download the images and train their own AI models. However, these datasets contain little to no image licence information. As a result, if we want a significantly large dataset of images with their licence information included, we would have to take additional steps.
Our approach consists of selectively retrieving images that are accompanied by Creative Commons (CC) licences on the respective web pages. By filtering the dataset based on this criterion, we should end up with images that can be used for training an AI image generation model similar to Stable Diffusion without infringing upon anyone’s intellectual property.
To achieve this, we examine specific sections of the web page, namely footers, aside tags, or sidebar tags. The licences located within the aside and sidebar sections are only collected if the licence is nested within an HTML tag no more than five times. This allows us to capture pertinent details related to the licence. Gathering all the image URLs present on the web page enables us to create a comprehensive dataset of image-licence pairs.
What is a footer, aside or sidebar?
For those less familiar with HTML code, here’s a brief explanation:
- Footer tags: These are HTML blocks used to encapsulate information placed at the bottom of a web page. Typically, footers contain contact information, hyperlinks to other pages within the website, or even licences.
- Aside or sidebar tags: Similar to footers, these HTML tags serve a similar purpose. They often contain hyperlinks to other web pages or serve as side notes. As the name suggests, sidebars and asides are usually located on the left or right side of a website’s layout.
In the images below, you can see a general structure of a web page and an example of how those sections can appear.
We noticed that websites which place their CC licence in these locations have it apply to their whole web page. We tested this hypothesis by creating a small test dataset of 1032 web pages. The dataset was created by randomly selecting 5 different Common Crawl files from the past 3 years and extracting the web pages which have a CC licence. The total number of Common Crawl files we could select from was approximately 1.920.000 and each file contained about 36.300–49.500 useful web pages.
We manually labelled the dataset based on the licence location on the web page and whether the licence could be interpreted to represent the entire web page or not.
We notice that most of the CC licences are located in the footer of the webpage. The local licence refers to an image which is within the body (main content/middle) of a web page. This licence usually only refers to 1 specific image on the web page. There can also be different licences on the same web page each referring to a different image.
Our initial approach was to establish a direct link between CC licences and individual images. However, this proved to be exceedingly challenging due to the diverse range of methods used for image referencing. Furthermore, the licences did not always refer to images.
A problematic example we discovered were “poem” blogs. People often used a copyrighted image at the top of their poem but referenced a CC licence below it that was actually referring to the poem, not the image. This is only one of the few examples which caused our algorithm to be only 80% accurate at detecting local licences.
Because of those limitations, we decided to focus on identifying web pages where the entire content falls under a single licence. We examined the percentage of correctly interpreted CC licences and achieved a 96.32% accuracy with our first draft of our algorithm if we only take footers, asides, and sidebars into account.
It is important to note, however, that caution should be exercised when interpreting the accuracy of this solution. Our dataset consisted of only 1,032 web pages, which may limit the extent to which our findings can be generalised.
Now that we have an algorithm that can detect CC licensed web pages with relatively high accuracy, the next step is gathering the image URLs of those web pages. Once we have gathered all the image URLs from CC licenced web pages, we store it in a parquet file.
Scaling up our solution
As our solution is intended to scan over billions of web pages, we need scalability. We chose Fondant as framework to allow for easy scaling and local development. We will contribute our components and pipeline on the Fondant repository in the near future (ping us on the Fondant Discord in case you are interested in preview access to this pipeline and components).
Here is a high level overview of the architecture of the Fondant pipeline and (custom) components.
Downloading the images
Once you’ve run our tool, you’ll end up with a dataset containing a list of image URLs and metadata associated with it (image size, licence type, licence location …). You can then filter on the metadata to easily choose a specific subset of data that aligns with your specific use case.
To download the list of image URLs, Fondant offers the download_images component. This component can be used to download, resize, and package 100+ million image URLs within 20 hours on one machine or scale it on multiple machines.
Next Steps
We are currently in the process of polishing the free-to-use dataset tool as different components and as a standard pipeline in Fondant, a framework that facilitates collecting and preprocessing data for fine tuning foundation models. It enables building pipelines with different reusable components for data extraction, transformation, filtering, captioning and more.
The aim is to create our own Creative Commons licenced dataset which will be made publicly available. For this, we will also be looking to further increase accuracy, for example through watermark detection, as well as options for opt-out mechanisms. Furthermore, we are planning to add extra features to the dataset, including not safe for work (NSFW) image detection and aesthetics assessment.
Conclusion
The need for a large free-to-use dataset for AI image generation tools is pertinent. Our solution addresses this need by collecting a large-scale Creative Commons permissively licensed dataset while at the same time being scalable, cloud agnostic and accurate.
We hope this will significantly enhance the legal security of image-to-text datasets compared to existing solutions. By prioritising adherence to licensing regulations, we strive to create a more reliable and trustworthy environment for data utilisation.