We tested ChatGPT's new support for images and voice search

Ina Fried

ChatGPT is moving beyond text — letting some users include images as part of a query and give spoken rather than typed instructions to the chatbot, which can also speak its answers.

Why it matters: The new features give the chatbot more utility right now and point to a future where AI tools understand the world around them, not just the online data on which they've been trained.

The big picture: OpenAI said in the spring when it released GPT-4 that such "multimodal" support was part of the new model. But it's taken until now for the company to ensure that the new input methods didn't make it easier to bypass the company's safety policies.

How it works: The new features will be available to ChatGPT Plus and enterprise customers over the next two weeks. Once the new features are enabled in the ChatGPT app, users can add an image to any text prompt, either by taking a new picture or uploading from a photo library.

To use audio in ChatGPT, you press a headphone button to speak a prompt and hear a response.

Our thought bubble: Just switching from typing to voice and hearing a response isn't super interesting in today's ChatGPT app and website, but could be a much bigger deal if, say, ChatGPT were built into a speaker or car system.

For my testing, I tried to think of a few cases where having a photo would help with a query.

Tight squeeze: A friend had posted a photo on Facebook wondering if all her storage boxes would fit in a 2001 Honda Odyssey. I know just who to ask, I told her, downloading her photo and then including it in a query to ChatGPT. The chatbot estimated the size of her contents and looked up the cargo space in the Honda minivan and reported that her boxes should all fit, though it could be a tight squeeze.
What to cook: ChatGPT offered only generic advice when I asked what I could make with the contents of our refrigerator. That's one of the touted uses for the new feature, though our fridge is admittedly packed — and if I can't readily see what's hiding in there neither can ChatGPT, which has to work off a single image.
Math help: I took a page of fifth-grade multiplication problems and asked OpenAI to solve them. The chatbot had no problem returning the answers. While OpenAI notes that ChatGPT can help break down complex math questions or provide strategies, my testing showed it was also willing to just offer up the answers.

What's next: These new features are part of a pathway to allowing ChatGPT and other AI models to start incorporating more robust multimodal support.

OpenAI said last week that ChatGPT will also be able to generate images soon, thanks to integration with DALL-E 3.
OpenAI is also working directly with a handful of customers, such as Spotify, which plans to use the text-to-speech capabilities to allow podcasters to translate their work into other languages in their own voice.

Yes, but: OpenAI's safeguards are about to face the challenge of millions of users trying to break them. We'll see how well they prevent misuses, such as the creation of explicit content or using an image to get the tool to answer a question it wouldn't with text.

The bottom line: The additions of voice and image search are nice-to-have features in today's chatbot, but critical to a more capable future in which AI systems understand not just the abstract world, but also our immediate surroundings.

Add Axios on Google

We tested ChatGPT's new support for images and voice search

What to read next