August 30, 2022

Potential applications of discrete diffusion models

Bert Christiaens

Machine Learning Engineer

Introduction

In this blogpost, we’ll examine the applications of using VQGAN combined with Discrete Absorbing Diffusion models from the amazing paper:

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes.

If you are curious about the clever techniques that made this possible, the paper itself and our technical blogpost will definitely be something for you! In the technical blogpost we gave an overview of the previous SOTA generative models, to then arrive at a new class of models, the diffusion models. We went in depth on the architecture of the Discrete Absorbing Diffusion models. In this blogpost, we will take a step back and look how we can leverage the characteristics of this model for creative purposes.

‍

Recap technical blogpost

In the technical blogpostI we discussed two models:

VQGAN, a model that represents data in a 16x16 grid of discrete latent codes and is able to decode them back to image space
the Discrete Absorbing Diffusion model, a transformer model that learns which combinations of discrete latent codes result in realistic and consistent images

In the technical blogpost we saw that a discrete diffusion model doesn’t generate these latent codes from left-to-right, such as an autoregressive model, but can generate them in a random order.

‍

This out-of-order and bidirectional sampling allows us to use novel techniques to edit and generate images:

Image generation from scratch
Conditional image generation
Image inpainting
Stitching images
Generating larger images

‍

Image generation from scratch

As in most generative models, we can generate images from scratch. The diffusion model starts with an empty 1I6x16 grid, denoted by MASK tokens, and iteratively generates tokens to fill up the grid. Once every code is generated, the VQGAN decoder uses these generated codes to create globally coherent images competitive with other generative models.

‍

Faces generated from scratch by sampling from a discrete diffusion model trained on the FFHQ dataset

Churches generated from scratch by sampling from a discrete diffusion model trained on the LSUN churches dataset

‍

Conditional image generation

‍

Now let’s talk about conditional image generation. This is a fancy way to say that we choose a part of the image, to steer the generation of the rest of the image.

With the existing autoregressive models, it is possible to use conditional image generation based on a partial image. The idea here is to give the model the top part of an image as context and then further complete the content by predicting the next pixels one-by-one in a top-to-bottom, left-to-right fashion. Instead of predicting pixels, we could also predict discrete codes in a unidirectional manner, which is an idea that was explored in the paper: Taming transformers for high-resolution image synthesis.

Impressive, right? However, it is only possible to condition on the top part of an image, due to the unidirectional nature of autoregressive models.

‍

Unidirectional prediction in autoregressive models

‍

That’s where our diffusion model can do better! We can encode an image with the VQGAN encoder, giving us a grid of latent variables. Then we choose which parts of the image we want to keep and let the model generate everything around it.

This is a big improvement in flexibility compared to autoregressive models since we are no longer restricted to specific parts of an image. As a demonstration, in the following figure we want to keep the tower in the middle with different generated backgrounds. We keep the codes in the middle region 🏯, replace the other codes with the MASK value and ask the model to generate multiple new images.

‍

Left: generated image with fixed region in the middle. Right: full generated image

‍

As you can see, the content and structure of the middle region stays fixed for the most part, only adapting slightly to better fit the surrounding pixels. Feels great to tell the model what to do 😎!

‍

Regenerating image regions (inpainting)

‍

Similar to the conditional image generation, we can also perform image inpainting 🎨🖌.

Let’s suppose we generate an image and we like most of it, except for one region that feels just not quite right. No problem, we can simply mask out the latent codes from the grid that correspond to this unwanted region and let the diffusion model regenerate them. To show how this works, let’s generate some new mouths for! By regenerating these codes multiple times, we can get many variations.

‍

Regeneration of a region of the image while keeping the other content fixed

‍

Very cool! We get 9 new images with exactly the same hair, eyes and background but with completely different mouths 👄 .

‍

Stitching images together

Now the real fun can start. Due to the grid structure of the VQGANs latent space, the codes learned by the VQGAN are highly spatially correlated to the content of the generated images. This means that the latent codes that correspond to the region of the eyes of a generated face will contain information about the eyes.

Okay, but now what? Let’s take the latent codes of the 👀 of image A and codes of the 👄 of image B. We now paste them in a grid and mask out all the other tokens and have our diffusion model do what it’s best at.

‍

The latent variables corresponding to the mouth and eyes are taken from the left faces and put into an empty grid with masked tokens. The masked tokens are predicted by the diffusion model and decoded into the face on the right.

‍

We see that the diffusion model has nicely filled in the masked regions to create a coherent face while staying true to the original look of the mouth and eyes.

Comparison of the eye and mouth regions from the source (left) and generated (right) images.

‍

Let’s get creative and apply this idea to a model trained on churches. The pope wants you to build a new church and particularly loves the base of the famous Notre Dame in Paris and the tower from the magnificent Sint-Baafs cathedral in Ghent. No problem, you can just take out the codes corresponding to wanted regions and paste them onto an empty latent space.

All that is left is to ask the diffusion model to fill in the empty regions and decode them with VQ-VAE and you can easily generate an endless amount of new churches that comply with the constraints.

Left: generated image with highlighted masked region. Right: full generated image

‍

As you can see, the content of the towers and the base stays the same and the model realistically fills in the rest. The pope is very happy with the results and gives you a VIP ticket to skip the line at Saint Peter’s gates. Job well done! If you want some crazier results, you can adjust the sampling temperature of the diffusion model to get some more variation (while trading off some global consistency).

‍

Generating larger images

‍

The last cool application of this model is that it allows us to generate images that are larger than the images the model was trained on. This is accomplished by dividing the latent space of the larger image into multiple overlapping grids that match the original 16x16 shape. At each prediction step we compute the probabilities of new tokens and aggregate them across the different grids.

‍

The process of producing larger images than trained on. The latent grid is divided into smaller grids that each calculate the probabilities of their masked tokens. These probabilities are aggregated across the grids to get a probability map of the original size, from which a new code can be sampled. Afterwards, the full grid of codes is decoded by the VQGAN decoder.

‍

This trick allows us to generate globally consistent images, even though the model was never trained for it.

‍

‍

Conclusion

In this blogpost we have discussed how the bidirectional and iterative nature of the recently emerging diffusion models, combined with the discrete representations of VQGANs and the long-range modelling capabilities of transformers allows us to have more control over the latent space. This architecture produces high-quality and consistent images while adding the ability to edit images in a conceptual discrete space.

Keep an eye out, because it’s not the last you’ll see of these diffusion models (in fact, various amazing new papers have been released while writing this post, such as DALL-E 2, ImageGen, Stable Diffusion…). And don’t forget to check out the technical blogpost to learn what happens behind the scenes!🤓

‍