
OpenAI is introducing new voice and image capabilities in ChatGPT. These capabilities provide users with a more intuitive interface, allowing them to engage in voice conversations or share images with ChatGPT.
The voice and image functionality offers users additional ways to utilize ChatGPT in their daily lives. They can capture a picture of a landmark while traveling and engage in a live conversation about its interesting aspects. At home, they can take pictures of their fridge and pantry to decide what to cook for dinner (and ask follow-up questions for a step-by-step recipe). After dinner, they can assist their child with a math problem by taking a photo, circling the problem, and receiving hints that can be shared with both of them.
OpenAI is rolling out voice and image capabilities in ChatGPT to Plus and Enterprise users over the next two weeks. Voice functionality will be available on iOS and Android devices, allowing users to engage in voice conversations with ChatGPT. Additionally, users will be able to share images with ChatGPT on all platforms.
Speak with ChatGPT and have it talk back
Users can now use voice to have a conversation with their assistant. They can use it on the go, request a bedtime story for their family, or settle a dinner table debate.
To start using voice, users should go to Settings → New Features on the mobile app and enable voice conversations. Then, they can tap the headphone button in the top-right corner of the home screen and choose their preferred voice from five options.
The new voice feature is powered by a text-to-speech model that can generate audio that sounds like a human voice, using just text and a short sample of speech. OpenAI collaborated with professional voice actors to create each of the voices. They also use Whisper, their open-source speech recognition system, to convert the users’ spoken words into text.
Once in a tranquil woodland, there was a fluffy mama cat named Lila. One sunny day, she cuddled with her playful kitten, Milo, under the shade of an old oak tree.
“Milo,” Lila began, her voice soft and gentle, “you’re going to have a new playmate soon.”
Milo’s ears perked up, curious. “A new playmate?”
Lila purred, “Yes, a baby sister.”
Milo’s eyes widened with excitement. “A sister? Will she chase tails like I do?”
Lila chuckled. “Oh, she’ll have her own quirks. You’ll teach her, won’t you?”
Milo nodded eagerly, already dreaming of the adventures they’d share.
Chat about images
Users now have the ability to show ChatGPT one or more images. They can troubleshoot why their grill won’t start, explore the contents of their fridge to plan a meal, or analyze a complex graph for work-related data. To focus on a specific part of the image, users can utilize the drawing tool in the mobile app.
To get started, users can tap the photo button to capture or choose an image. If they are on iOS or Android, they should tap the plus button first. They can also discuss multiple images or use the drawing tool to guide their assistant.
Image understanding is powered by multimodal GPT-3.5 and GPT-4. These models apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images.
Deploying image and voice capabilities gradually
OpenAI’s goal is to build AGI that is safe and beneficial. OpenAI believes in making their tools available gradually, which allows them to make improvements and refine risk mitigations over time while also preparing everyone for more powerful systems in the future. This strategy becomes even more important with advanced models involving voice and vision.
The new voice technology—capable of crafting realistic synthetic voices from just a few seconds of real speech—opens doors to many creative and accessibility-focused applications. However, these capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud.
This is why OpenAI is using this technology to power a specific use case—voice chat. Voice chat was created with voice actors they have directly worked with. OpenAI is also collaborating in a similar way with others. For example, Spotify is using the power of this technology for the pilot of their Voice Translation feature, which helps podcasters expand the reach of their storytelling by translating podcasts into additional languages in the podcasters’ own voices.
Vision-based models also present new challenges, ranging from hallucinations about people to relying on the model’s interpretation of images in high-stakes domains. Prior to broader deployment, the model was tested with red teamers for risk in domains such as extremism and scientific proficiency, as well as a diverse set of alpha testers. The research enabled the team to align on a few key details for responsible usage.
Like other ChatGPT features, vision is designed to assist users with their daily lives. It performs best when it can see what the user sees.
This approach has been informed directly by the team’s work with Be My Eyes, a free mobile app for blind and low-vision people, to understand uses and limitations. Users have expressed that they find it valuable to have general conversations about images that happen to contain people in the background, such as if someone appears on TV while the user is trying to figure out their remote control settings.
The team has also implemented technical measures to significantly limit ChatGPT’s ability to analyze and make direct statements about people, as the system is not always accurate and respects individuals’ privacy.
Real-world usage and feedback will help the team make these safeguards even better while keeping the tool useful.
Users might depend on ChatGPT for specialized topics, for example in fields like research. The team is transparent about the model’s limitations and discourages higher risk use cases without proper verification. Furthermore, the model is proficient at transcribing English text but performs poorly with some other languages, especially those with non-roman script. Non-English users are advised against using ChatGPT for this purpose.
You can read more about OpenAI’s approach to safety and our work with Be My Eyes in the system card for image input.