At Kano, we can’t get enough of how amazing this is cool is. DALL-E isn’t just your standard creative output.
Created by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, and Scott Gray, DALL·E (named using a portmanteau of the artist Salvador Dalí and Pixar’s WALL·E) is a 12-billion parameter version of GPT-3. Trained to generate images from text descriptions, using a dataset of text-image pairs.
It is capable of amazing creations, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.
Overview
Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens and is trained using maximum likelihood to generate all of the tokens, one after another.
A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.
The images are preprocessed to 256x256 resolution during training. Similar to VQVAE,1415 each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE1011 that we pre-trained using continuous relaxation.1213 We found that training using relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.
This training procedure allows DALL·E to not only generate an image from scratch but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt.
What Can It Do?
DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language.
Don’t just take it from us, here are some of the images it’s AI can generate:
GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity.
Dall-E allows us to create visual images by using language.
The team behind Dall-E in the future, plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology.
Authors
Aditya Ramesh was the project lead: he developed the approach, trained the models, and wrote most of the blog copy.
Aditya Ramesh, Mikhail Pavlov, and Scott Gray worked together to scale up the model to 12 billion parameters and designed the infrastructure used to draw samples from the model.
Aditya Ramesh, Gabriel Goh, and Justin Jay Wang worked together to create the interactive visuals for the blog.
Mark Chen and Aditya Ramesh created the images for Raven’s Progressives Matrices.
Rewon Child and Vedant Misra assisted in writing the blog.
Pamela Mishkin, Gretchen Krueger, and Sandhini Agarwal advised on the broader impacts of the work and assisted in writing the blog.
Ilya Sutskever oversaw the project and assisted in writing the blog.