Blog

02.23.23

Voice Cloning 101: The Technology Behind Authentic-Sounding Voice Simulations

Ethan Baker Director of Content Development, Veritone

In this installment to our deepfake voice series, we will explore voice cloning by defining what it is, how it works, and the benefits of this new technology. As we progress through this series, we’ll delve into more detail on synthetic voice content, how you can create one, and ways to address the issues of ethical use and deepfake fraud.

In this blog, we’ll be covering these following areas:

What is voice cloning?
How is voice cloning possible?
What are the benefits of voice cloning?
What is the best voice cloning application?

An introduction to voice cloning

Voice cloning is often tossed around with other terms, such as deepfake voice, speech synthesis, and synthetic voice, that have slightly differentiated meanings. Voice cloning is the process in which one uses a computer to generate the speech of a real individual, creating a clone of their specific, unique voice using artificial intelligence (AI).

Text-to-speech (TTS) systems, which can take written language and transform it into spoken communication, is not to be confused with voice cloning. TTS systems are much more limited from the outputs they produce compared to voice cloning technology, which is really more of a custom process.

With a TTS system, the training data, the key component to any synthetically created media, informs the production of a voice output. In other words, the voice you hear is the one that was given in the data set.

Now, with the introduction of voice cloning AI technology, that changes. Methods have been put in place to provide deeper analysis and extraction of the characteristics of a target voice. These attributes can then be applied to different waveforms of speech, allowing someone to change the speech output of one voice to another.

How voice cloning works

Thanks to advancements in artificial intelligence (AI), particularly deep learning, a subset of machine learning underneath the umbrella of AI, we’ve been able to produce accurate replications of a voices. But this only made possible by two things:

Powerful hardware with cloud computing capabilities to process and render in a timely and efficient manner
Extensive training data of the targeted voice from which models can leverage to create an accurate voice clone

With the proper AI and developmental expertise and tools, it really comes down to the latter. You need a large amount of recorded speech to have enough data to train the voice model. The information around the voice is stored in an embedding, a fairly low-dimensional space where you can translate discrete variables into high-dimensional vectors.

In other words, it makes it easier to work with large inputs with machine learning models. For the sake of not getting too technical, we’ll leave it at that, but feel free to dive deeper into the subject if that interests you.

Benefits of voice cloning

Let’s start with the good. There are plenty of potential use cases for voice cloning that often become overshadowed by the negative uses, which we will address in a second. Some of the positive applications of technology include:

Increase advertising and sponsorship opportunities for voice personalities, celebrities, and influencers
Help companies work with talent during their busiest times of the year, such as football season for players or coaches
Revive voices from the past for use in entertainment to help tell a story in documentaries, movies, and TV shows
Diversify broadcast content for repeat content such as weather reports or sports updates
Localize content so that it can be heard in the host or narrators voice in another language

These are just some of the positive uses for voice cloning, and as the technology continues to evolve, more will emerge. But of course, everything hinges on the ethical use of someone’s voice. That’s why the need for a movement towards the standardization of the approval process is so imperative to protect everyone’s voice and ensure they have complete control over how it’s used.

Factors to consider when choosing a voice cloning application

To narrow down your search to find the best voice cloning applications you should first determine what you are looking for. Do you need something that’s more for text-to-speech output? Or do you need something more custom?

Once you’ve figured out why you need a voice cloning application, you should then hone in on three key criteria:

Output quality: you’ll want to make sure that the output is authentic sounding and meets your prescribed needs. Usually, they will have samples of what the product can do. If not, you should consider asking for a demo, if available, to determine how human their product sounds.
Intuitive interface: how easy is it to use the application? Is it hard to find things when you’re in the app or can you navigate and use it to meet your needs? Again, this can be determined by product videos, marketing content, and a demo.
Voice protections: you’ll want to make sure that the company follows ethical uses of voices. If it’s a custom service requiring training data, then it’s important to inquire about data protections and how a voice, when created, won’t be used improperly.

The ethical implications around voice cloning are the nexus of Veritone Voice, our voice-as-a-service application. Built within the framework of the application are the levers to give users control over their voice, enabling the proper protections so that they decide who can use their voice. This helps us deliver our custom voice-as-a-service solution to enable a complete white glove experience for the talent we work with.

In the next chapter in this series, we’ll be discussing Text-to-Speech (TTS) AI and how it’s related to voice cloning.

01.24.23 - ETHAN BAKER

Deepfake Voice—Everything You Should Know in 2023

Learn More

11.01.22 - ASHLEY BAILEY

HOW TO IMPROVE LOCALIZATION IN AUDIO PRODUCTIONS