Deepfakes, both the technology used to create it and how people use it, has become a hot topic frequenting headlines in the press. The technology has opened the door to a wave of new, creative solutions that will impact many industries. However, it has also raised serious questions on ethical use, which has largely been driven by negative press and outright misuse of the technology.
One cannot discuss this topic without addressing the negatives, but we intend to surface the exciting things happening in the space. When done ethically, AI voice deepfake, also known as voice cloning or synthetic voice, can be a force for good.
While the term “deepfake” typically means images and video, the purpose of this article series is to look at synthetically voice content or computer-generated speech. In this six-part series, we’ll explore both the potential applications for good with synthetic voice artificial intelligence (AI), how to protect yourself against voice fraud, and how to leverage Veritone’s proprietary AI solution, Veritone Voice, to generate your synthetic voice.
Continue to read or skip ahead to the parts that most interest you:
- A brief history on deepfakes
- What exactly are deepfake voices?
- What we’ll cover in the deepfake voice series
A brief history on deepfakes
If you traced the term deepfake to its point of origin, you would be surprised to learn that it came from the world of Reddit. A user coined the term, using it as their name. Today, the word has evolved to encompass any content categorized as synthetic media. Using a form of AI technology called deep learning, you’re able to create an image or a video that swaps out the original likeness of a person with that of another.
However, this technological concept originated well before Reddit was even a thing. In the late ‘90s, an academic paper that explored the deepfake concept laid out a program that would be the first instance of what we would call deepfake technology today.
It drew upon earlier work done around analyzing faces, synthesizing audio from text, and then modeling the actions of the human mouth in 3D space. Combining these three focuses, the authors wrote what they called the Video Rewrite Program, which synthesized new facial animations from provided audio recordings.
After the release of that academic paper, the study of this technology went cold in the early 2000s. But at the start of the new decade in 2010, research picked up once again, focusing primarily on developing facial recognition capabilities.
That changed with the release of two more papers, one in 2016 and another in 2017. These papers validated the power of deepfake audio creation by leveraging consumer-grade hardware. Since becoming known as deepfake, thanks to that infamous Reddit user, the technology evolved rapidly into more professional and practical applications.
One such application is to replicate or clone a person’s voice. This specific use case has gained more visibility thanks to intermittent headlines. One of the more recent controversies surrounding the technology involved the replication of famed chef, travel documentarian, and author Anthony Boudin, whose voice was cloned and used in a documentary about his life. While people can use the technology to revive voices that are no longer with us, it does raise ethical questions.
With the technology still emerging, standardization of the approval process is ongoing, and Veritone is leading the way in this regard. First and foremost, the estate must always be involved in these cases. If consent is not received, the project must not be approved. However, before we dive any further into its use and the implications of the technology, let’s first break down what precisely audio deepfake voice is and how it’s created.
What exactly are deepfake voices?
Deepfake voice, also called voice cloning or synthetic voice, uses AI to generate a clone of a person’s voice. The technology has advanced to the point that it can closely replicate a human voice with great accuracy in tone and likeness.
Creating deepfakes requires high-end computers with powerful graphics cards, leveraging cloud computing power. By leveraging more powerful computing hardware, you can accelerate the process of rendering, which can take weeks, days, to hours, depending on your rig.
To clone someone’s voice, you must have training data to feed into the AI models. This data is often original recordings that provide an excellent example of the target person speaking. AI can use this data to render an authentic sounding voice, which can then be used to speak anything that you type, called text-to-speech, or said, called speech-to-speech.
The technology has many worried about how it will affect a broad range of things from political discourse to the rule of law. Some of the early warning signs have already appeared in the form of phone scams and fake videos on social media of people doing things they never did. Questions of ethical use have also been raised, particularly in instances like the Anthony Bourdain documentary.
There are two ways that protections can be implemented. The first is creating a way to analyze or detect if a video is authentic. Likened to anti-virus software, this approach will inevitably be playing catchup as these detectors are defeated by continuously evolving generators.
The second and arguably the best way forward would be to embed creation and edit information in the software or hardware. This would only work if this data was uneditable, but the idea would be that it would create an inaudible watermark that would act as a source of truth. In other words, we would know if the video is authentic by seeing where it was shot, produced, edited, and so forth.
Positive use cases for deepfake technology
The technology is powerful, and no doubt, needs to have guardrails to defend against abuse. But quite recently it has proven how it can be used ethically for good. For example, it has been used as a way to help people who have lost their voice from throat diseases or other medical issues to get their voice back. This was recently achieved with Val Kilmer, who had lost his voice to cancer.
From a business perspective, it has opened the door to a variety of opportunities. It can be used to create a brand mascot or provide variety to content such as weather and sports reports in the broadcast world. Entertainment companies can bring back past talent or incorporate the voice of a historical figure’s voice into their programming. It was recently used to help translate podcast content into different languages using the podcasters voice. But this must be done ethically and with the proper approvals.
What We’ll Cover in the Deepfake Voice Series
This series is for anyone that wants to learn about deepfake voice, voice cloning, and synthetic voice content. It’s also for voice personalities, talent agents, film and TV producers, and organizations looking to leverage synthetic content to uncover new opportunities.
1: Voice Cloning
- How does voice cloning work, and what are use cases for it in the future?
- For those who are new to the technology and want to learn about how it works and potential business applications.
- CHAPTER 2:
Text-to-Speech (TTS) AI
- What TTS is, how is it related to voice cloning, and how does AI create realistic-sounding voices?
- Learn about how text-to-speech works within the voice cloning process and how AI technology achieves synthetically created voice outputs.
3: Synthetic Voice
- What’s synthetic voice, and how is it different from deepfake voice and voice cloning?
- How synthetic voice enables Voice-as-a-Service, and what the future will look like where you own your voice personas.
4: Deepfake Voice Fraud
- What are the latest ethical issues around deepfake voice, and how do you protect yourself?
- The rising risks of fraud and abuse with deepfakes and how a synthetic voice platform can standardize protection, compliance, and approvals.
- CHAPTER 5: How to
Create Synthetic Voice
- What does the process look like, and how does someone create their synthetic voice?
- A guide on how to leverage Veritone Voice to create your synthetic voice that’s protected and gives you complete control to license and monetize.
We hope you enjoy this series and come out of it knowing more about deepfake voice, voice cloning, and synthetic content.