Bert Christiaens
Machine Learning Engineer
With the world of e-commerce expanding rapidly since the pandemic, being able to present your product in an attractive and original way is more important than ever. As customers cannot touch or try your product, letting your product excel in stunning, professional and detailed images is the first step to lure customers into making a purchase. Studies show that 76.5% of customers acknowledge the significant impact of high-quality product photography and its role in influencing their purchase decision. However, getting high quality, professional product pictures does not come for free. Let alone the cost and overhead of setting up a studio and getting your product to the location, professionals can charge on average between 35$-50$ per image to even over 400$ on the high end. A lot of small business owners simply do not have the funds to do this for every product, but even for big players, who have funds, might not have the time to set this all up as their product catalogue changes rapidly.
What if this long, expensive process were all unnecessary? What if we could just take an amateur picture of our product, let some AI magic do its thing and get professional looking images online in no time?
This blogpost shows the development of a tool, leveraging several generative AI models and obtaining some remarkable results as shown below! If you want to try it out for yourself, the demo and code are available on HuggingFace for free!
For those who cannot wait to try the demo can immediately access it here.
Examples:
A first step in building this demo was getting up to speed with current best practices in the field of Gen AI, especially for image generation. The Hugging Face Hub is a great platform offering a wide range of open source models, datasets and demo’s to test and learn about new, powerful models. This project combines 3 important models: SAM, Stable Diffusion (+Inpainting) and ControlNet. Figure {1} presents the general workflow for the generation process. In what follows we will dive deeper into every aspect of this process.
The Segment Anything Model by Meta AI Research proposed a new model for image segmentation. The model, shown in figure {2}, proposed a solution to the promptable segmentation task which aims at providing a valid segmentation mask given a certain prompt. The prompt specifies what to segment in the image. In this case it consists of spatial information in the form of point(s) or box(es) indicating the object we want to mask. The model also allows for text and mask inputs, but this was not used in this setting.
The general model architecture consists of 3 main parts. A heavy image encoder, a lightweight prompt encoder and fast mask decoder. This architecture provides for flexible, real-time use which is perfectly suited for our setting. The image encoder consists of a MAE pre-trained Vision Transformer (ViT). Different sizes: Huge, Large and Base ViT SAM models exist. In this context we used the base model as it is lighter, faster and not far behind the larger models in performance. This image encoder is run once per image and can then be combined with different prompts as input for the mask decoder. The mask decoder efficiently maps both image and prompt embeddings to mask probabilities for each location on the image. As prompts are ambiguous (e.g. You want a mask for a t-shirt or the full person wearing the t-shirt), the mask encoder compares multiple valid masks and outputs the one with the highest associated confidence score.
This part is where the true magic happens. Stable diffusion is a latent text-to-image diffusion model created by CompVis, Stability AI and LAION. The model is based on the ideas proposed in the paper by Robin Rombach et al. Stable Diffusion is capable of generating photo-realistic images given any text input. It consists of three main building blocks: a text encoder, a U-net and an autoencoder (VAE). Visualisation of these components can be found in figure {3}.
Stable Diffusion uses a pre-trained text-encoder named CLIPTextModel. This converts the input prompt, e.g. “dog on the moon” to a numerical representation, also known as the image embedding, that can be used as input for the U-net. The U-Net is where the diffusion itself takes place. It is a neural network architecture with an encoder and decoder part composed of ResNet blocks. Encoder and decoder parts are connected via skip connections in order to not lose any semantics learned in early stages of the network.
In image generation, the U-net will take in the prompt embeddings together with an image containing pure random Gaussian noise. It will iteratively predict less noisy images in a predefined number of steps (eg. 25) from its previous iteration until it becomes a clear, final image.
Lastly, the Autoencoder allows you to go back and forth between latent and pixel space. It can convert a normal image into a lower dimensional space where the diffusion process can take place. This allows for the diffusion process to be way less memory and time consuming. Working in latent space also allows the U-net to focus more on general concepts of the image like shapes, positions, colours and less about having to predict every single pixel. After the diffusion process, the autoencoder will project the predicted image from the latent space to pixel space through the VAE decoder. The autoencoder in Stable Diffusion has a reduction factor of 8. This means that an image of shape (3,512,512) becomes (3,64,64) in latent space, meaning 8x8 = 64 times less memory!
In our case we used stable-diffusion-2-inpainting. This model essentially starts from the normal stable-diffusion-2-base checkpoint and is then trained for another 200k training steps using the mask-generation strategy presented in LAMA. This allows us to have additional conditioning to our diffusion process and keep certain parts, such as a personal object, the same. As our demo also allows for inpainting in the sense of regenerating a small (corrupted) part of an image, we mostly use it for outpainting. Outpainting is actually the opposite and means that we will extend a picture or, in our case, a product image beyond its regular borders. The model we used here allows for both in- and outpainting. Examples can be seen in figure {4}.
Controlnet presents a solution to give more control in the generation process of large text-to-image models like Stable Diffusion. It is based on the idea of hypernetworks where we train a small network to influence the weights of a larger one. Controlnet clones the weights of a large model into a “locked” and “trainable” copy. Since we leave the original diffusion model untouched, it does not lose any of its knowledge learned during training on billions of images. In the case of Stable Diffusion, Controlnet only copies the weights of the U-net encoder into a trainable copy. Both models are then connected through special zero-convolutional layers. Lastly, the trainable copy can be trained on a certain task-specific dataset to learn additional control. Figure {5} shows the architecture of Controlnet with Stable Diffusion.
A lot of different Controlnet models are available on HuggingFace, each adding certain extra conditioning, such as depth perception, pose of a person, segmentation, and more, to enhance the generation process. In our Demo we only use “Canny edges” as conditioning. This model allows an extra conditioning image as input. The conditioning image is a monochrome image with white edges on a black background. These edges will then be preserved during the generation process. In our case, this is valuable as we aim to maintain the original product’s edges without introducing unrealistic additional features or visuals.
An example of Stable Diffusion with Controlnet Canny Edge detection can be found in figure {6}.
To make this demo accessible to the broad public, all previous models were wrapped up in a user-friendly demo using Streamlit so everyone can test it for themselves! Streamlit allows for fast, flexible front-end development by embedding certain lines of code in your Python scripts, thus requiring no knowledge of HTML, CSS or JavaScript. The power of Streamlit comes from its ability to automatically rerun the script and update outputs each time an input is changed. However, this also means that Streamlit has some drawbacks and should be used carefully. As the entire script reruns each time from top to bottom, variables are also created from scratch each iteration. Streamlit implements a session_state concept to maintain variables between iterations. You need to carefully consider using a normal variable or save the variable in the session_state to make sure that the full script can be executed each time a certain variable changes.
After building the application it is possible to run and maintain your app on the Hugging Face website. They built a platform called Hugging Face Spaces, which allows you to run Streamlit, Gradio and Docker containers on a VM managed by Hugging Face, allowing other people to access them. However, running your app in a public place also raises some problems. Since multiple users can access the VM at the same time, you should carefully handle the different threads, in order to not exhaust the available resources causing performance to drop dramatically. In our case the Stable Diffusion model is quite memory intensive, since it uses GPU memory for the image generation. Allowing multiple users to generate images at the same time could cause the demo to crash (giving everyone the infamous CUDA out of memory errors). To address this issue, we implemented a queueing system that leverages the caching techniques integrated in streamlit, along with a mechanism for buffering parallel inference requests.
Streamlit provides a convenient way to optimise resource usage by applying the st.cache_resource decorator to your functions. This decorator saves function outputs in memory and works in a similar way like lru_cache, a more well-known decorator within programming. When the function is called again with the same inputs, it retrieves the stored output instead of rerunning the function. This mechanism significantly boosts app performance, especially for time-consuming functions such as retrieving the weights of a stable diffusion model. The queueing system, however, uses this concept to share the same instance of a custom class WaitingQueue across all different users. Figure {7} visualises this system.
The class stores a queue and a dictionary. The queue determines what position the user is in. The dictionary is used to store additional user information. This information is used to check if the predecessor in the queue is still active. Two mechanisms are in place to ensure there are no deadlocks and the process continues to work smoothly. First of all, if a user in the queue does not update their timestamp every 10 seconds while not generating images, the user behind them will remove them from the queue. Secondly, if a user is generating for longer than 30 seconds, their successor will also assume something went wrong and remove their predecessor from the queue.
Now that we have explained every component of the general workflow in figure {1}, we will now give a simple example of how this all comes together in the demo. As this project was part of an internship at ML6, we will try to make professional looking images including some of their merchandise. The UI of the demo is divided into 3 different tabs, each presenting a certain part of the workflow. We have “Mask creation”, “Place mask(s)” and “Generate Images” (see figure 8 for screenshots of the workflow).
First, we would have to take some amateur pictures of our products and load them into the demo. Here we will try some examples with a coffee mug, a water bottle and a t-shirt. Once loaded into the demo, we will start with tab 1 to create the masks of the different objects. This corresponds with adding one or multiple points and/or boxes to the images that can be served as input for our SAM model. Figure {8} shows this process.
After extracting all the necessary products out of our images, we will go over to the second tab. Here we can resize, rotate and place our objects onto a blank canvas. We can also set the size of the blank canvas. This will equal the size of the generated images. Figure {9} visualises this process.
Lastly, we will go over to the 3rd tab. Here we can specify how many images we want, add a positive and negative prompt and select a controlnet guidance scale. A positive prompt should be something you want your images to look like “e.g. A coffee mug on the table”. The negative prompt on the other hand should contain words of things or concepts you want the model to stay away from when generating e.g. “ugly, bad quality, deformed hands,...”. The controlnet guidance scale indicates how strong the Controlnet model should influence Stable Diffusion. When set to 0, Controlnet will do nothing. When set to 1, Stable Diffusion will only generate edges in places where they were present on the conditioning image. Figure {10} visualises this workflow of the last tab. This whole process only took us about 5 minutes!
Here are some of the results after some playing around! Notice some of the shadows and reflections Stable Diffusion is able to generate around the objects.
While there is still a noticeable gap between these results and real professional images, they clearly show the potential of these text-to-image models. This tool could already serve a great purpose for e.g. small businesses, some basic low-cost products, or as a temporary solution until the professional images are available.
As the models used in this example are all available off-the-shelf, fine-tuning towards certain product(s) or environments can further greatly improve performance (for an example, check this interior design demo which uses fine-tuned controlnet for high-quality generation).
The AI community is evolving at a blistering pace, with new models/tools/applications coming out every day. Some exciting new open source projects in the making include BLIP-Diffusion and Unicontrolnet. BLIP-diffusion, like Dreambooth and Textual Inversion, allow for the model to learn personalised objects based on images. The model should then be able to regenerate these objects itself. A huge challenge here is being able to train the objects quickly and generate them accurately. Unicontrolnet is a novel approach to apply multiple controlnet conditions at once, allowing for even more control over the generating process.
Considering the trade-off between quality and cost/time, generative AI is and certainly will be a great contender and change the world of professional product photography!
This Blog post is written by ML6 intern Clement Viaene. You can also read it on Medium platform here.