
An image generated by OpenAI's DALL-E model, from the prompt "an illustration of a baby daikon radish in a tutu walking a dog." Credit: OpenAI
The machine learning company OpenAI is developing models that improve computer vision and can produce original images from a text prompt.
Why it matters: The new models are the latest steps in ongoing efforts to create machine learning systems that exhibit elements of general intelligence, while performing tasks that are actually useful in the real world — without breaking the bank on computing power.
What's happening: OpenAI today is announcing two new systems that attempt to do for images what its landmark GPT-3 model did last year for text generation.
- DALL-E is a neural network that can "take any text and make an image out of it," says Ilya Sutskever, OpenAI co-founder and chief scientist. That includes concepts it would never have encountered in training, like the drawing of an anthropomorphic daikon radish walking a dog shown above.
- Flashback: DALL-E operates somewhat similarly to GPT-3, the huge transformer model that can generate original passages of text based on a short prompt.
- CLIP, the other new neural network, "can take any set of visual categories and instantly create very strong and reliable visually classifiable text descriptions," says Sutskever, improving on existing computer vision techniques with less training and expensive computational power.
What they're saying: "Last year, we were able to make substantial progress on text with GPT-3, but the thing is that the world isn't just built on text," says Sutskever. "This is a step towards the grander goal of building a neural network that can work in both images and text."
How it works: DALL-E — a name OpenAI picked as a portmanteau of the surrealist artist Salvador Dali and the fatally cute Pixar robot WALL-E — is the model that jumps out because it aims to fulfill the Star Trek dream of simply being able to tell a computer, using regular language, what to create.
- For example: Enter the prompt "a can of soup that has the word 'skynet' on it" and you'll get images like the one below.

- "It can take unrelated concepts that are nothing alike and put them together into a functional object," says Aditya Ramesh, the leader of the DALL-E team.
- CLIP can identify images with comparatively little training, allowing it to caption pictures it encounters.
- The model's real advantage is its efficiency, which is becoming a bigger issue in the field as the computational cost of training machine learning models only grows.
Yes, but: Like GPT-3, the new models are far from perfect, with DALL-E, in particular, dependent on exactly how the text prompt is phrased if it's to be able to generate a coherent image.
The bottom line: Artificial general intelligence may be getting closer, one doodle at a time.