OpenAI's DALL·E borrows from the GPT-3 and creates high-fidelity images from text

Last year, OpenAI released GPT-3, the largest transformer model to date with over 175 billion parameters. The model demonstrated great prowess in generating text from a given context and OpenAI licensed it exclusively to Microsoft for providing the computational backend required to host and run the model for its customers.

Building on this, OpenAI have announced a distilled, 12-billion parameter version of GPT-3 today. Dubbed DALL·E, the new transformer model borrows heavily from GPT-3 but combines its abilities with ImageGPT (a model that completed half-complete images provided as input to the model). As such, DALL·E specializes in generating images from a given caption.

The name DALL·E is a portmanteau of the artist Salvador Dalí and Pixar’s famous WALL·E. The model receives a stream of up to 1280 tokens containing both the text and image as a single stream of data. After preprocessing this, the model is then trained using maximum likelihood to generate all tokens sequentially. Once trained, DALL·E creates images for a variety of sentences that explore the compositional structure of language. Some of the samples for that are shown below.

As demonstrated by the samples above, DALL·E extracts the essential information from a sentence and translates that into images. Impressively, OpenAI also provides a handy interface on its blog post to generate images of your own liking. We tried out a bunch of them all with promising results. For example, when the input to the model was "a triangular yellow manhole cover in the shape of a triangle", DALL·E produced the following array of images.

Impressively, the model can extract temporal as well as geographical information from the provided text. Similarly, OpenAI says that the model offers a degree of controllability over the attributes and positions of a small number of objects as well. To illustrate, for the caption "a capybara made of voxels sitting in a field", the model produced the following results.

And for "a plain white cube looking at its own reflection in a mirror. a plain white cube gazing at itself in the mirror", the model generated the following images.

DALL·E also borrows the zero-shot learning capabilities of the GPT-3 and extends those to the visual domain. It can perform several kinds of image-to-image translation tasks when prompted in the right way. When prompted to create "the exact same cat as the top as a sketch on the bottom", the model produced the following images.

How well the model encapsulates provided information, depends on the phrasing of the input caption. However, for a greater number of objects, DALL·E starts confusing the associations between the objects and their colors, and the success rate decreases sharply. In such cases, even the rephrasing of captions does not yield better results.

Moving forward, OpenAI plans to provide more details on DALL·E's training and architecture in an upcoming paper. But the firm's work shows that transformer models can be a worthy competitor to Generative Adversarial Networks (GANs) in the visual domain. "We’ve found that [DALL·E] has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images," the firm wrote in its blog post.