ChatGPT can see, hear, and speak now — here's how to make the best of it

ChatGPT can see, hear, and speak now. Here's how to use the new features to talk to the chatbot, analyze images, and generate graphics.
January 16, 2024
Briana Brownell
In this article
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions
Start editing audio & video
This makes the editing process so much faster. I wish I knew about Descript a year ago.
Matt D., Copywriter
Sign up

What type of content do you primarily create?

Videos
Podcasts
Social media clips
Transcriptions

OpenAI just opened up a bunch of new ways to interact with ChatGPT. Now, the AI chatbot can not only read what you write, but see the pictures you upload and hear the words you say—and even speak back.

In my view, this isn’t just about new features; it's a glimpse into the future of AI. We're finally getting access to what are called multimodal systems, systems that blend different types of understanding together. It's like having a conversation with a friend who doesn't just listen but also observes and interprets what they see—so if you pull out your phone to share a picture of your cat, they can see how cute Sparky is too. 

The tools aren't fully multimodal yet. For now, they're more like translators between different data forms. But even still, these capabilities are the first step towards a whole new slate of systems that will mesh various kinds of understanding. It's worthwhile to get used to thinking multimodally and mastering this different way of interacting with AI now so you'll be ready when even more complex multimodal systems come.


Screenshot of ChatGPT mobile chat window with new multimodal buttons highlighted
Red = "See," or add image; Blue = "Hear," or record prompt; Green = "Speak," or start audio conversation


Visual ChatGPT: How to make ChatGPT “see”

With ChatGPT's new image analysis features, instead of trying to find the perfect words to explain images to ChatGPT, now you can just show them. 

If you've got a picture already, upload it to the prompt textbox using the paper clip icon (desktop) or the plus sign (mobile). Alternatively, if you have it copied to the clipboard, you can paste it right in. 

Taking a live pic with your phone works too. When adding an image live, you can circle parts of it to guide ChatGPT's focus. For instance, you might circle something mysterious and ask "what's this?" You're not limited to one image; upload a bunch if you need to. 

This feature has some of the most exciting possibilities. Don't know what something is?  Ask ChatGPT to identify it. Need to describe an image in words? It can give you a pretty good starting point. Wrestling with a confusing graph?  ChatGPT can act as your personal data analyst.

Screenshot of ChatGPT describing a golden statue of Joan of Arc

But my favorite feature? Thematic analysis. Basically, it can analyze an image to see if it would fit with a theme. It’s like having a second set of eyes with a flair for design. 

Use it to pick between options of images that could accompany a social media post, thumbnail, or on a webpage. My testing showed a pretty good understanding of both the content and the tone of the image. 

But it's even more powerful when you set it up to simulate a certain persona. Then you can ask it which image resonates the most with your target audience or ideal client.

You're not alone if you think this sounds complicated. This function does take a bit of time to get the hang of. We’ll be writing about these features in detail soon.

🤖 Tip: ChatGPT can also read text and math formulas from an image. 

Voice control for ChatGPT: How to make ChatGPT “hear”

The "Hear" feature lets you use voice input for your prompts rather than using the keyboard—that is, you can now talk to ChatGPT. You'll need your phone ready, though: the "hear" feature is only available on iOS and Android for the moment. To use this feature, hit the audio icon to the right of the prompt box. 

At the moment, the voice recognition feature works like a transcription tool: it just turns your speech into text in the prompt window. It uses OpenAI’s Whisper API. 

The benefits of this are familiar for anyone who uses speech-to-text for composing texts and emails on their phone: you talk a lot faster than you can type, you don’t have to fumble with keyboards, and if you're somewhere cold (like I am), you minimize the amount of time you have to take your gloves off in the middle of your walk when you just have to know what juice Snoop drinks with his gin (probably orange, if you're wondering).

My favorite way to use ChatGPT’s speech recognition is as a data input.  You can read it an ingredient list or get it to transcribe a few lines of a book as part of your prompt, or let it listen to video or podcast audio directly (this doesn’t work as well, in my tests). 

This feature has its limitations; it’s not great at deciphering accents or picking up on musical tones, for example. It’s hardly a voice assistant. The best way to use it is for quick-hit short prompts.

Learn more: Using ChatGPT data analysis to interpret charts & diagrams

How to use ChatGPT’s text-to-speech function (Make it “speak”)

Although ChatGPT can "hear" you when you record the prompt, it doesn't feel as much like a back and forth conversation. If you're itching for a human-AI chat that's more like that, ChatGPT's new "Speak" feature is for you. 

To have a conversation, click on the headphones icon beside the prompt textbox. From there, you can choose between five different voices (You can always change your mind later.) This starts a voice/audio conversation with ChatGPT. 

The chat style is the same as you know and love from regular ChatGPT, and you can streamline it using custom instructions or advanced prompts. The AI voice then reads you ChatGPT's output. 

As with the "hear" features, the feature records and then transcribes your prompt. Patience is key here, as the transcription isn't instant—it takes a few minutes to process. If you want to re-record your prompt, you can tap to cancel and give it another go. Don't worry about having to retain everything on one listen, the transcribed chat is all available in your message history.

At first, I felt put on the spot to keep coming up with prompts as a back and forth conversation with ChatGPT. But I found that I didn’t need to be perfect—it easily overcomes pronunciation problems, transcription errors, and user mistakes. 

For instance, when I asked about OSFI (usually pronounced "Oss-fee"), it made a pretty good guess about what I said (OSPI).  And despite this transcription error, ChatGPT searched for the information it thought I wanted, figured out the transcription had an error, then searched for the correct thing, found it, and then gave me the right answer, complete with citations. Not bad.

Screenshot of ChatGPT responding to a voice command


A heads-up: if ChatGPT needs to browse the web, it takes a lot longer and it won't signal that it's in the midst of browsing. When it speaks the answer, it also doesn't give you an indication of whether there’s a citation or not. To check for citations, you'll need to switch back to text chat.

Remember, like regular ChatGPT chats, it tracks past conversations, but might veer off track if you don't reference previous points.

It works well as an information gatherer, but my favorite way to use it is to set ChatGPT up as a persona. This is the perfect place to set up a persona to have a conversation back and forth. 

🤖 Tip: Try to stick to short prompts, as it sometimes cuts off or cuts out pieces in the middle of your recording.

Image generation for ChatGPT: How to make ChatGPT draw

It didn’t make the cut for the “see, hear, and speak” tagline, but thanks to its DALL-E integration, ChatGPT can now generate images. You can prompt ChatGPT to create images in any interaction mode—typing, voice recording, or during a conversation. All you have to do is ask it to generate an image for you; you don't need to specify a prompt. Instead, it’ll generate based on the context of the conversation.

All this is hidden from you. When it generates an image, all it tells you is "generating image. In order to see the prompt, you need to explicitly ask for it. 

But here's a twist: The AI's self-generated prompts can be quite elaborate, often too complex for DALL-E to interpret accurately. And if you try giving it a specific prompt, it tends to go its own way, using something different. Despite several tries, I couldn’t get it to generate an image with the exact prompt I specified.

A quirk to note: ChatGPT can't “see” the images it creates. If you want it to analyze an image it's made, you'll have to re-upload it into the prompt window.

It's a pretty easy way to generate images, but if you plan to do a lot of image generation, my suggestion is to develop your own image prompting skills and do it where you have control over the prompts. Doing it here is handy, but it’s difficult to get what you want.

It's a bit of a wildcard. You never know what you're going to get, but sometimes the results are pretty cool, especially with obscure terms image generators don't necessarily know.


Screenshot of ChatGPT generating an image

Conclusion

While it's still early, these features show a lot of promise. Multimodal mastery is going to be an essential skill of any active AI user, so experiment with these new features to see how well they fit into your workflow. Who knows, you could find whole new ways of working with the tools.

Briana Brownell
Briana Brownell is a Canadian data scientist and multidisciplinary creator who writes about the intersection of technology and creativity.
Start creating
The all-in-one video & podcast editor, easy as a doc.
Sign up
Start creating—for free
Sign up
Join millions of others creating with Descript

ChatGPT can see, hear, and speak now — here's how to make the best of it

OpenAI just opened up a bunch of new ways to interact with ChatGPT. Now, the AI chatbot can not only read what you write, but see the pictures you upload and hear the words you say—and even speak back.

In my view, this isn’t just about new features; it's a glimpse into the future of AI. We're finally getting access to what are called multimodal systems, systems that blend different types of understanding together. It's like having a conversation with a friend who doesn't just listen but also observes and interprets what they see—so if you pull out your phone to share a picture of your cat, they can see how cute Sparky is too. 

The tools aren't fully multimodal yet. For now, they're more like translators between different data forms. But even still, these capabilities are the first step towards a whole new slate of systems that will mesh various kinds of understanding. It's worthwhile to get used to thinking multimodally and mastering this different way of interacting with AI now so you'll be ready when even more complex multimodal systems come.


Screenshot of ChatGPT mobile chat window with new multimodal buttons highlighted
Red = "See," or add image; Blue = "Hear," or record prompt; Green = "Speak," or start audio conversation


Visual ChatGPT: How to make ChatGPT “see”

With ChatGPT's new image analysis features, instead of trying to find the perfect words to explain images to ChatGPT, now you can just show them. 

If you've got a picture already, upload it to the prompt textbox using the paper clip icon (desktop) or the plus sign (mobile). Alternatively, if you have it copied to the clipboard, you can paste it right in. 

Taking a live pic with your phone works too. When adding an image live, you can circle parts of it to guide ChatGPT's focus. For instance, you might circle something mysterious and ask "what's this?" You're not limited to one image; upload a bunch if you need to. 

This feature has some of the most exciting possibilities. Don't know what something is?  Ask ChatGPT to identify it. Need to describe an image in words? It can give you a pretty good starting point. Wrestling with a confusing graph?  ChatGPT can act as your personal data analyst.

Screenshot of ChatGPT describing a golden statue of Joan of Arc

But my favorite feature? Thematic analysis. Basically, it can analyze an image to see if it would fit with a theme. It’s like having a second set of eyes with a flair for design. 

Use it to pick between options of images that could accompany a social media post, thumbnail, or on a webpage. My testing showed a pretty good understanding of both the content and the tone of the image. 

But it's even more powerful when you set it up to simulate a certain persona. Then you can ask it which image resonates the most with your target audience or ideal client.

You're not alone if you think this sounds complicated. This function does take a bit of time to get the hang of. We’ll be writing about these features in detail soon.

🤖 Tip: ChatGPT can also read text and math formulas from an image. 

Voice control for ChatGPT: How to make ChatGPT “hear”

The "Hear" feature lets you use voice input for your prompts rather than using the keyboard—that is, you can now talk to ChatGPT. You'll need your phone ready, though: the "hear" feature is only available on iOS and Android for the moment. To use this feature, hit the audio icon to the right of the prompt box. 

At the moment, the voice recognition feature works like a transcription tool: it just turns your speech into text in the prompt window. It uses OpenAI’s Whisper API. 

The benefits of this are familiar for anyone who uses speech-to-text for composing texts and emails on their phone: you talk a lot faster than you can type, you don’t have to fumble with keyboards, and if you're somewhere cold (like I am), you minimize the amount of time you have to take your gloves off in the middle of your walk when you just have to know what juice Snoop drinks with his gin (probably orange, if you're wondering).

My favorite way to use ChatGPT’s speech recognition is as a data input.  You can read it an ingredient list or get it to transcribe a few lines of a book as part of your prompt, or let it listen to video or podcast audio directly (this doesn’t work as well, in my tests). 

This feature has its limitations; it’s not great at deciphering accents or picking up on musical tones, for example. It’s hardly a voice assistant. The best way to use it is for quick-hit short prompts.

Learn more: Using ChatGPT data analysis to interpret charts & diagrams

How to use ChatGPT’s text-to-speech function (Make it “speak”)

Although ChatGPT can "hear" you when you record the prompt, it doesn't feel as much like a back and forth conversation. If you're itching for a human-AI chat that's more like that, ChatGPT's new "Speak" feature is for you. 

To have a conversation, click on the headphones icon beside the prompt textbox. From there, you can choose between five different voices (You can always change your mind later.) This starts a voice/audio conversation with ChatGPT. 

The chat style is the same as you know and love from regular ChatGPT, and you can streamline it using custom instructions or advanced prompts. The AI voice then reads you ChatGPT's output. 

As with the "hear" features, the feature records and then transcribes your prompt. Patience is key here, as the transcription isn't instant—it takes a few minutes to process. If you want to re-record your prompt, you can tap to cancel and give it another go. Don't worry about having to retain everything on one listen, the transcribed chat is all available in your message history.

At first, I felt put on the spot to keep coming up with prompts as a back and forth conversation with ChatGPT. But I found that I didn’t need to be perfect—it easily overcomes pronunciation problems, transcription errors, and user mistakes. 

For instance, when I asked about OSFI (usually pronounced "Oss-fee"), it made a pretty good guess about what I said (OSPI).  And despite this transcription error, ChatGPT searched for the information it thought I wanted, figured out the transcription had an error, then searched for the correct thing, found it, and then gave me the right answer, complete with citations. Not bad.

Screenshot of ChatGPT responding to a voice command


A heads-up: if ChatGPT needs to browse the web, it takes a lot longer and it won't signal that it's in the midst of browsing. When it speaks the answer, it also doesn't give you an indication of whether there’s a citation or not. To check for citations, you'll need to switch back to text chat.

Remember, like regular ChatGPT chats, it tracks past conversations, but might veer off track if you don't reference previous points.

It works well as an information gatherer, but my favorite way to use it is to set ChatGPT up as a persona. This is the perfect place to set up a persona to have a conversation back and forth. 

🤖 Tip: Try to stick to short prompts, as it sometimes cuts off or cuts out pieces in the middle of your recording.

Image generation for ChatGPT: How to make ChatGPT draw

It didn’t make the cut for the “see, hear, and speak” tagline, but thanks to its DALL-E integration, ChatGPT can now generate images. You can prompt ChatGPT to create images in any interaction mode—typing, voice recording, or during a conversation. All you have to do is ask it to generate an image for you; you don't need to specify a prompt. Instead, it’ll generate based on the context of the conversation.

All this is hidden from you. When it generates an image, all it tells you is "generating image. In order to see the prompt, you need to explicitly ask for it. 

But here's a twist: The AI's self-generated prompts can be quite elaborate, often too complex for DALL-E to interpret accurately. And if you try giving it a specific prompt, it tends to go its own way, using something different. Despite several tries, I couldn’t get it to generate an image with the exact prompt I specified.

A quirk to note: ChatGPT can't “see” the images it creates. If you want it to analyze an image it's made, you'll have to re-upload it into the prompt window.

It's a pretty easy way to generate images, but if you plan to do a lot of image generation, my suggestion is to develop your own image prompting skills and do it where you have control over the prompts. Doing it here is handy, but it’s difficult to get what you want.

It's a bit of a wildcard. You never know what you're going to get, but sometimes the results are pretty cool, especially with obscure terms image generators don't necessarily know.


Screenshot of ChatGPT generating an image

Conclusion

While it's still early, these features show a lot of promise. Multimodal mastery is going to be an essential skill of any active AI user, so experiment with these new features to see how well they fit into your workflow. Who knows, you could find whole new ways of working with the tools.

Featured articles:

No items found.

Articles you might find interesting

Product Updates

New in Descript: New ways to paste media in compositions, Studio Sound improvements, bug fixes

Today’s release brings you new, better ways to paste media into your composition, makes it easier to enhance audio with Studio Sound, and fixes a handful of bugs.

Video

Looking for New How To Video Ideas? Here’s How to Get Started

How-to videos demonstrate a particular way to create something or accomplish a given task. They are incredibly popular—as a whole, “how-to” is the fourth most-watched category on Youtube.

Product Updates

New: Quick Editor for screen recordings, timeline improvements, and more

There’s a new version of Descript out today with some great new stuff. Watch the video for a demo and a sneak peek, or read on for the highlights.

Video

Making the Best Out of Your YouTube Banner

A YouTube channel banner is an essential part of your self-presentation on the platform. Your banner art also gives you an opportunity to provide a snapshot of your content.

Podcasting

How to make money podcasting: How much money do podcasts actually make?

A podcast can help you make money as a creator. Here are 12 proven ways to make money podcasting, and how much you can make with each strategy.

Podcasting

The 7 best podcast microphone stands (2024)

Whether you go for a desktop mic stand next to your laptop or a telescopic boom mic stand, a flimsy base will easily fall over and potentially damage your microphone.

Related articles:

Share this article

Get started for free →