AI Is Becoming Conversational

OpenAI’s new voice models push AI closer to live, useful conversation, longer calls, real-time translation and faster transcription for business products.

OpenAI’s Brand New Voice AI Is Here. It Could Change How Companies Talk to Their Customers

Author: Ben Sherry

OpenAI just launched a new set of voice models that can have longer conversations, instantly translate between languages, and more accurately transcribe spoken words into text. The new models are available for businesses to use in their products and services. 

According to OpenAI, companies including Zillow, Priceline, Deutsche Telekom, Vimeo, and Glean are already using these new models to build advanced travel agents, multilingual customer support assistants, and more capable voice assistants “that can reason through requests and take action in real time.”

Here’s a breakdown of the new models:

GPT-Realtime-2

GPT-Realtime-2 is the next in OpenAI’s line of speech-to-speech models. Unlike earlier voice AI models, the GPT-Realtime line of models don’t need to transcribe speech into text in order to process the info, enabling them to engage in more natural-sounding conversations. 

OpenAI says that the Realtime-2 has improved reasoning and a longer context window, making it better at completing complex agentic tasks. The model could be used to handle lengthy customer service conversations that require data analysis across multiple sources and multi-step workflows.

GPT-Realtime-2 gives developers the ability to direct the voice model more granularly, such as specifying specific phrases that the voice agent should often use. It can also be directed to use more or less effort into a given task, call multiple tools at once (enabling the agent to run several searches simultaneously), and understand industry-specific terms. 

One example offered by OpenAI came from Zillow, which is currently using the model to build an assistant that can help prospective homebuyers identify potential locations and autonomously schedule home tours. Another is Priceline, which OpenAI says is building tools that will enable people to manage their entire trip through voice conversations. 

GPT-Realtime-2 is priced at $32 per one million audio input tokens and $64 per one million output tokens. 

GPT-Realtime-Translate 

GPT-Realtime-Translate is a “live translation model” that OpenAI says is capable of translating over 70 input languages into 13 output languages in real time. Previously, OpenAI offered translation services through its transcription model, but the new model is purpose-built for transcription services. The model could be a big help to international call centers that are staffed by non-English speakers. 

One of the first companies to use GPT-Realtime-Translate is Deutsche Telekom, which is building tools that will enable customers to speak in their native language while the model translates in real time. 

GPT-Realtime-Translate is priced at $0.034 per minute of translated audio. 

GPT-Realtime-Whisper 

GPT-Realtime-Whisper is OpenAI’s fastest transcription model yet. The company’s previous transcription model, simply called Whisper-1, came in several sizes and speeds, with larger models being more accurate but slower to process audio in text, while smaller models were faster but less accurate. The company says the new model is both faster and more accurate, making it a viable contender for companies that need live captions in their product or a feature for taking meeting notes. 

GPT-Realtime-Whisper is priced at $0.017 per minute of transcribed audio. 

These new models are distinct from ChatGPT’s advanced voice mode, which is available to paid ChatGPT subscribers. That feature runs on an older version of the Realtime model, and will likely be replaced in the near future. 

Credits: TCA, LLC.

Discover more from thinkly gold

Subscribe now to keep reading and get access to the full archive.

Continue reading