OpenAI unveiled Voice Engine, a model for generating voice from a sample - turns out it's already been heard by the masses

OpenAI presented the results of the Voice Engine, a tool for realistic voice synthesis based on a 15-second sample and text, which has been developed for about two years. But there is no public access to it - due to the company's obvious concerns about security.

"We hope to start a dialogue about the responsible use of synthetic voices and how society can adapt to these new possibilities. Based on these conversations and the results of these small tests, we will make a more informed decision on whether and how to deploy this technology on a larger scale," OpenAI's blog says.

The generative artificial intelligence model that works with Voice Engine has been hidden in plain sight for some time. It underlies the voice and the ability to read aloud in ChatGPT, as well as pre-configured voices available in the OpenAI text-to-speech API. Spotify has also been using it since the beginning of September to dub podcasts in different languages.

The company sees several ways to apply the technology: assisting those who, for some reason, cannot read, translation, providing voice services to remote communities, supporting people with speech impairments, and aiding in voice restoration. Examples of applications with samples in several languages are also presented in the blog.

The website TechCrunch asked company representative Jeff Harris what materials Voice Engine was trained on. He replied that the Voice Engine model was trained on a mix of licensed and publicly available data. Details of training artificial intelligence models can be both a competitive advantage and a source of legal problems, so the lack of details is not surprising. Voice Engine uses user data extremely carefully:

"We take a small sample of audio and text and create realistic speech that matches the original speaker," Harris says. "The audio used is deleted after the request is completed."

According to the site, the price of the future service will be "biting". OpenAI removed the price of using Voice Engine from its marketing materials, but documents reviewed by TechCrunch indicate a cost of $15 per one million characters, or ~162,500 words in English. This is slightly more than Dickens' novel "Oliver Twist". This means approximately 18 hours of audio, so the price is slightly less than $1 per hour.

The cost is lower than one of the most popular competitors, ElevenLabs, - $11 for 100,000 characters per month. Interestingly, the HD quality version costs twice as much, but, as the OpenAI representative told TechCrunch, there is no difference between HD and non-HD voices - this can be understood in various ways. Also, Voice Engine does not offer controls for tone, pitch, or other voice characteristics.

The cost of a voice actor on ZipRecruiter ranges from $12 to $79 per hour - much more expensive than Voice Engine. Actors with agents receive much higher pay. There is also the issue of deepfakes. Therefore, the company is proceeding very cautiously, as with the examples of usage provided.