At the highest level, DALL-E 2’s works very simply:
First, a text prompt is input into a text encoder that is trained to map the prompt to a representation space. Next, a model called the prior maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding. Finally, an image decoding model stochastically generates an image which is a visual manifestation of this semantic information.
The link between textual semantics and their visual representations in DALL-E 2 is learned by another OpenAI model called CLIP (Contrastive Language-Image Pre-training). ## CLIP
CLIP is trained on hundreds of millions of images and their associated captions, learning how much a given text snippet relates to an image. That is, rather than trying to predict a caption given an image, CLIP instead just learns how related any given caption is to an image. This contrastive rather than predictive objective allows CLIP to learn the link between textual and visual representations of the same abstract object. The entire DALL-E 2 model hinges on CLIP’s ability to learn semantics from natural language, so take a look at how CLIP is trained to understand its inner workings.
CLIP Training
The fundamental principles of training CLIP are quite simple:
- First, all images and their associated captions are passed through their respective encoders, mapping all objects into an m-dimensional space.
- Then, the cosine similarity of each (image, text) pair is computed.
- The training objective is to simultaneously maximize the cosine similarity between N correct encoded image/caption pairs and minimize the cosine similarity between \(N^{2}\) - N incorrect encoded image/caption pairs.
CLIP’s contrastive objective allows it to understand semantic information in a way that convolution models that learn only feature maps cannot. This disparity can easily be observed by contrasting how CLIP, used in a zero-shot manner, performs across datasets relative to an ImageNet-trained ResNet-101. In particular, contrasting how these models compare on ImageNet vs ImageNetSketch shows this.
CLIP and an ImageNet-trained ResNet-101 perform with similar accuracy on ImageNet, but CLIP outperforms the ResNet-101 significantly on ImageNet Sketch. This is true despite CLIP being used in a zero-shot manner and not using any of the 1.3 million ImageNet images for training.
This result is significant because it shows that CLIP learns the semantic link between text descriptions of objects and their corresponding visual manifestations. Rather than relying on specific details of image instances, like the yellow color of bananas, to identify them as a convolutional ResNet might, CLIP learns the semantic “Platonic ideal” of what a banana “is”, allowing it to better identify sketches of bananas. Understanding the fact that textual descriptions and visual features can map to the same “Platonic ideal” is crucial for text-conditional image generation, and this is why CLIP is so important to the DALL-E 2 paradigm.
Step 2 - Generating Images from Visual Semantics
After training, the CLIP model is frozen and DALL-E 2 moves onto its next task - learning to reverse the image encoding mapping that CLIP just learned. CLIP learns a representation space in which it is easy to determine the relatedness of textual and visual encodings, but our interest is in image generation. We must therefore learn how to exploit the representation space to accomplish this task.
In particular, OpenAI employs a modified version of another one of its previous models, GLIDE, to perform this image generation. The GLIDE model learns to invert the image encoding process in order to stochastically decode CLIP image embeddings.
As depicted in the image above, it should be noted that the goal is not to build an autoencoder and exactly reconstruct an image given its embedding, but to instead generate an image which maintains the salient features of the original image given its embedding. In order perform this image generation, GLIDE uses a Diffusion Model.
What is a Diffusion Model?
Diffusion Models are a thermodynamics-inspired invention that have significantly grown in popularity in recent years. Diffusion Models learn to generate data by reversing a gradual noising process. Depicted in the figure below, the noising process is viewed as a parameterized Markov chain that gradually adds noise to an image to corrupt it, eventually (asymptotically) resulting in pure Gaussian noise. The Diffusion Model learns to navigate backwards along this chain, gradually removing the noise over a series of timesteps to reverse this process.
If the Diffusion Model is then “cut in half” after training, it can be used to generate an image by randomly sampling Gaussian noise and then de-noising it to generate a photorealistic image. Some may recognize that this technique is highly reminiscent of generating data with Autoencoders, and Diffusion Models and Autoencoders are, in fact, related.
GLIDE Training
While GLIDE was not the first Diffusion Model, its important contribution was in modifying them to allow for text-conditional image generation. In particular, one will notice that Diffusion Models start from randomly sampled Gaussian noise. It at first unclear how to tailor this process to generate specific images. If a Diffusion Model is trained on a human face dataset, it will reliably generate photorealistic images of human faces; but what if someone wants to generate a face with a specific feature, like brown eyes or blonde hair?
GLIDE extends the core concept of Diffusion Models by augmenting the training process with additional textual information, ultimately resulting in text-conditional image generation.
DALL-E 2 uses a modified GLIDE model that incorporates projected CLIP text embeddings in two ways. The first way is by adding the CLIP text embeddings to GLIDE’s existing timestep embedding, and the second way is by creating four extra tokens of context, which are concatenated to the output sequence of the GLIDE text encoder.
Significance of GLIDE to DALL-E 2
GLIDE is important to DALL-E 2 because it allowed the authors to easily port over GLIDE’s text-conditional photorealistic image generation capabilities to DALL-E 2 by instead conditioning on image encodings in the representation space. Therefore, DALL-E 2’s modified GLIDE learns to generate semantically consistent images conditioned on CLIP image encodings. It is also important to note that the reverse-Diffusion process is stochastic, and therefore variations can easily be generated by inputting the same image encoding vectors through the modified GLIDE model multiple times.
Step 3 - Mapping from Textual Semantics to Corresponding Visual Semantics
While the modified-GLIDE model successfully generates images that reflect the semantics captured by image encodings, how do we go about actually go about finding these encoded representations? In other words, how do we go about injecting the text conditioning information from our prompt into the image generation process?
Recall that, in addition to our image encoder, CLIP also learns a text encoder. DALL-E 2 uses another model, which the authors call the prior, in order to map from the text encodings of image captions to the image encodings of their corresponding images. The DALL-E 2 authors experiment with both Autoregressive Models and Diffusion Models for the prior, but ultimately find that they yield comparable performance. Given that the Diffusion Model is much more computationally efficient, it is selected as the prior for DALL-E 2.
Prior Training
The Diffusion Prior in DALL-E 2 consists of a decoder-only Transformer. It operates, with a causal attention mask, on an ordered sequence of
- The tokenized text/caption.
- The CLIP text encodings of these tokens.
- An encoding for the diffusion timestep.
- The noised image passed through the CLIP image encoder.
- Final encoding whose output from Transformer is used to predict the unnoised CLIP image encoding.
Step 4 - Putting It All Together
At this point, we have all of DALL-E 2’s functional components and need only to chain them together for text-conditional image generation:
- First the CLIP text encoder maps the image description into the representation space.
- Then the diffusion prior maps from the CLIP text encoding to a corresponding CLIP image encoding.
- Finally, the modified-GLIDE generation model maps from the representation space into the image space via reverse-Diffusion, generating one of many possible images that conveys the semantic information within the input caption.
Summary
In this article we covered how the world’s premier textually-conditioned image generation model works under the hood. DALL-E 2 can generate semantically plausible photorealistic images given a text prompt, can produce images with specific artistic styles, can produce variations of the same salient features represented in different ways, and can modify existing images.
While there is a lot of discussion to be had about DALL-E 2 and its importance to both Deep Learning and the world at large, we draw your attention to 3 key takeaways from the development of DALL-E 2
First, DALL-E 2 demonstrates the power of Diffusion Models in Deep Learning, with both the prior and image generation sub-models in DALL-E 2 being Diffusion-based. While only rising to popular use in the past few years, Diffusion Models have already proven their worth, and those tuned-in to Deep Learning research should expect to see more of them in the future.
The second point is to highlight both the need and power of using natural language as a means to train State-of-the-Art Deep Learning models. This point does not originate with DALL-E 2 (in particular, CLIP demonstrated it previously), but nevertheless it is important to appreciate that the power of DALL-E 2 stems ultimately from the absolutely massive amount of paired natural language/image data that is available on the internet. Using such data not only removes the developmental bottleneck associated with the laborious and painstaking process of manually labelling datasets; but the noisy, uncurated nature of such data better reflects real-world data that Deep Learning models must be robust to.
Finally, DALL-E 2 reaffirms the position of Transformers as supreme for models trained on web-scale datasets given their impressive parallelizability.