ChatGPT can see, hear, and speak now — here's how to make the best of it

OpenAI just opened up a bunch of new ways to interact with ChatGPT. Now, the AI chatbot can not only read what you write, but see the pictures you upload and hear the words you say—and even speak back.

In my view, this isn’t just about new features; it's a glimpse into the future of AI. We're finally getting access to what are called multimodal systems, systems that blend different types of understanding together. It's like having a conversation with a friend who doesn't just listen but also observes and interprets what they see—so if you pull out your phone to share a picture of your cat, they can see how cute Sparky is too. 

The tools aren't fully multimodal yet. For now, they're more like translators between different data forms. But even still, these capabilities are the first step towards a whole new slate of systems that will mesh various kinds of understanding. It's worthwhile to get used to thinking multimodally and mastering this different way of interacting with AI now so you'll be ready when even more complex multimodal systems come.

Screenshot of ChatGPT mobile chat window with new multimodal buttons highlighted
Red = "See," or add image; Blue = "Hear," or record prompt; Green = "Speak," or start audio conversation

Visual ChatGPT: How to make ChatGPT “see”

With ChatGPT's new image analysis features, instead of trying to find the perfect words to explain images to ChatGPT, now you can just show them. 

If you've got a picture already, upload it to the prompt textbox using the paper clip icon (desktop) or the plus sign (mobile). Alternatively, if you have it copied to the clipboard, you can paste it right in. 

Taking a live pic with your phone works too. When adding an image live, you can circle parts of it to guide ChatGPT's focus. For instance, you might circle something mysterious and ask "what's this?" You're not limited to one image; upload a bunch if you need to. 

This feature has some of the most exciting possibilities. Don't know what something is?  Ask ChatGPT to identify it. Need to describe an image in words? It can give you a pretty good starting point. Wrestling with a confusing graph?  ChatGPT can act as your personal data analyst.

Screenshot of ChatGPT describing a golden statue of Joan of Arc

But my favorite feature? Thematic analysis. Basically, it can analyze an image to see if it would fit with a theme. It’s like having a second set of eyes with a flair for design. 

Use it to pick between options of images that could accompany a social media post, thumbnail, or on a webpage. My testing showed a pretty good understanding of both the content and the tone of the image. 

But it's even more powerful when you set it up to simulate a certain persona. Then you can ask it which image resonates the most with your target audience or ideal client.

You're not alone if you think this sounds complicated. This function does take a bit of time to get the hang of. We’ll be writing about these features in detail soon.

🤖 Tip: ChatGPT can also read text and math formulas from an image. 

Voice control for ChatGPT: How to make ChatGPT “hear”

The "Hear" feature lets you use voice input for your prompts rather than using the keyboard—that is, you can now talk to ChatGPT. You'll need your phone ready, though: the "hear" feature is only available on iOS and Android for the moment. To use this feature, hit the audio icon to the right of the prompt box. 

At the moment, the voice recognition feature works like a transcription tool: it just turns your speech into text in the prompt window. It uses OpenAI’s Whisper API. 

The benefits of this are familiar for anyone who uses speech-to-text for composing texts and emails on their phone: you talk a lot faster than you can type, you don’t have to fumble with keyboards, and if you're somewhere cold (like I am), you minimize the amount of time you have to take your gloves off in the middle of your walk when you just have to know what juice Snoop drinks with his gin (probably orange, if you're wondering).

My favorite way to use ChatGPT’s speech recognition is as a data input.  You can read it an ingredient list or get it to transcribe a few lines of a book as part of your prompt, or let it listen to video or podcast audio directly (this doesn’t work as well, in my tests). 

This feature has its limitations; it’s not great at deciphering accents or picking up on musical tones, for example. It’s hardly a voice assistant. The best way to use it is for quick-hit short prompts.

Learn more: Using ChatGPT data analysis to interpret charts & diagrams

How to use ChatGPT’s text-to-speech function (Make it “speak”)

Although ChatGPT can "hear" you when you record the prompt, it doesn't feel as much like a back and forth conversation. If you're itching for a human-AI chat that's more like that, ChatGPT's new "Speak" feature is for you. 

To have a conversation, click on the headphones icon beside the prompt textbox. From there, you can choose between five different voices (You can always change your mind later.) This starts a voice/audio conversation with ChatGPT. 

The chat style is the same as you know and love from regular ChatGPT, and you can streamline it using custom instructions or advanced prompts. The AI voice then reads you ChatGPT's output. 

As with the "hear" features, the feature records and then transcribes your prompt. Patience is key here, as the transcription isn't instant—it takes a few minutes to process. If you want to re-record your prompt, you can tap to cancel and give it another go. Don't worry about having to retain everything on one listen, the transcribed chat is all available in your message history.

At first, I felt put on the spot to keep coming up with prompts as a back and forth conversation with ChatGPT. But I found that I didn’t need to be perfect—it easily overcomes pronunciation problems, transcription errors, and user mistakes. 

For instance, when I asked about OSFI (usually pronounced "Oss-fee"), it made a pretty good guess about what I said (OSPI).  And despite this transcription error, ChatGPT searched for the information it thought I wanted, figured out the transcription had an error, then searched for the correct thing, found it, and then gave me the right answer, complete with citations. Not bad.

Screenshot of ChatGPT responding to a voice command

A heads-up: if ChatGPT needs to browse the web, it takes a lot longer and it won't signal that it's in the midst of browsing. When it speaks the answer, it also doesn't give you an indication of whether there’s a citation or not. To check for citations, you'll need to switch back to text chat.

Remember, like regular ChatGPT chats, it tracks past conversations, but might veer off track if you don't reference previous points.

It works well as an information gatherer, but my favorite way to use it is to set ChatGPT up as a persona. This is the perfect place to set up a persona to have a conversation back and forth. 

🤖 Tip: Try to stick to short prompts, as it sometimes cuts off or cuts out pieces in the middle of your recording.

Image generation for ChatGPT: How to make ChatGPT draw

It didn’t make the cut for the “see, hear, and speak” tagline, but thanks to its DALL-E integration, ChatGPT can now generate images. You can prompt ChatGPT to create images in any interaction mode—typing, voice recording, or during a conversation. All you have to do is ask it to generate an image for you; you don't need to specify a prompt. Instead, it’ll generate based on the context of the conversation.

All this is hidden from you. When it generates an image, all it tells you is "generating image. In order to see the prompt, you need to explicitly ask for it. 

But here's a twist: The AI's self-generated prompts can be quite elaborate, often too complex for DALL-E to interpret accurately. And if you try giving it a specific prompt, it tends to go its own way, using something different. Despite several tries, I couldn’t get it to generate an image with the exact prompt I specified.

A quirk to note: ChatGPT can't “see” the images it creates. If you want it to analyze an image it's made, you'll have to re-upload it into the prompt window.

It's a pretty easy way to generate images, but if you plan to do a lot of image generation, my suggestion is to develop your own image prompting skills and do it where you have control over the prompts. Doing it here is handy, but it’s difficult to get what you want.

It's a bit of a wildcard. You never know what you're going to get, but sometimes the results are pretty cool, especially with obscure terms image generators don't necessarily know.

Screenshot of ChatGPT generating an image


While it's still early, these features show a lot of promise. Multimodal mastery is going to be an essential skill of any active AI user, so experiment with these new features to see how well they fit into your workflow. Who knows, you could find whole new ways of working with the tools.

Featured articles:

No items found.

Articles you might find interesting

Product Updates

New in Descript: Custom branding on published pages and embeds

With today’s Descript update we’ve added custom branding to publishing, so you can add name and logo to your published pages and embedded videos (Pro and Enterprise accounts only). Plus you can now pause screen recordings in macOS, in case you need to take a break, answer the doorbell, or shush the dogs playing poker in your living room.


How to trim a video: 4 easy methods for beginners

Trimming your video is the first step to tightening up your content. Learn 4 quick and easy video trimming methods for beginners in this guide.


8 types of podcasts: Discover the right podcast format for your show

Picking the best format for your own podcast involves several factors, including how much you want to spend on your podcast studio, how you prefer to work, and, of course, your topic.

Other stuff

7 best tablets for video editing in 2023

This article will cover the top seven tablets for video editing and factors to consider when choosing one, and show you a comparison chart that can help you decide which will be best for your needs.

Tips & Tricks

How to convert video to audio: A step-by-step guide

There are many reasons you’d want to convert video to audio effortlessly. Luckily, there are also many ways to do it in our step-by-step guide.


3 ways to market your brand-new podcast from scratch

A brand new show doesn’t have any listeners yet, so traditional promotional approaches won't work. But that's ok — in fact, a new show has a leg up on the competition just by being new, and there are ways to lean into that newness.

Related articles:

Share this article

Get started for free →