AI systems such as DALL·E or it’s open source counterpart Stable Diffusion, are captivating the masses with their stunning, maybe even eerie capability of translating text to images. A text prompt like “a potato wearing a wig” can be translated into a convincing image within seconds, even when combining a multitude of dissimilar distinctive concepts like ‘wig’ and ‘potato’. Also smooth animations between these often surreal looking AI generated images is possible. But how?
Fundamental to the text-to-image implementation used in this project, is a neural network called CLIP. Neural networks are computational architectures that can extract correlations/knowledge from large datasets when exposed to them. The process of exposing a neural network to data is generally referred to as ‘training’. Training a neural network is an iterative process that gradually changes how it behaves when confronted with new data. As CLIP is a multimodal neural network, it was trained on data of two different modalities: images and text. During the training of this architecture, correlations between and among images and text are extracted and stored in the form of encoded representations along with their recipe to decode them. These representations are called ‘’embeddings’.
Specifically the text-to-image implementation used in this project, CLIP was trained on a massive dataset derived from LAION-5B. This dataset contains roughly 5 billion pairs of images and their individual captions that were programmatically copied from the internet in a process called ‘web-scraping’. This ‘snapshot of the internet’ was gradually condensed into embeddings that were structured in a hyperdimensional space where similar data are clustered closer together: where similarities – semantic and visual – are encoded in spatial proximity. This space is also called the latent space. This learned internal structure enables CLIP to translate – ‘infer’ a textual description when confronted with an image. Yet CLIP by itself cannot infer an image from a text prompt.
Here we need the other part of the process: A generative Model. In this case an image Diffusion Model! A neural network that was trained on images and their derivatives which were gradually dissolved by means of adding random pixels. By this means, the neural network ‘learnt’ how to restore digital images that are dissolved, partially obscured or carry some sort of data loss. It can reconstruct missing data. CLIP combined with generative neural networks that have reconstructive abilities, form a very powerful combination: CLIP enables reconstructive algorithms to be guided or steered by a text prompt, which means that an image gets gradually reconstructed and modified until CLIP deems it appropriate to the text prompt. This process is also called: CLIP guided diffusion.