Blog

04.20.23

How to Make AI Speech Sound Realistic

Rupal Patel, VP of Voice & Accessibility, Veritone

As technology advances, synthetic voices have become an essential part of our lives. From virtual assistants like Siri and Alexa to GPS navigation systems, synthetic voices are everywhere. However, not all synthetic voices are created equal.

Many of us can attest that we’ve heard some AI speech that can sound robotic and artificial, which breaks the illusion of speaking with a real person. In some situations (like GPS assistance), this isn’t a huge issue, but it is for use cases that rely on realistic, high-quality voices such as broadcast and film.

In this blog, we will discuss how you can make synthetic voices sound more realistic with:

Organic inflections, pauses, and breaths
Emotions and expression
Natural pronunciation

With AI text to speech software and machine learning such as Veritone Voice, users can create high-quality synthetic voices thanks to algorithms that can analyze and improve the sound of text to speech voices by generating more natural-sounding inflections and intonations, improving pronunciation, and reducing speech errors.

But, like any AI solution, it takes a human touch to review and finesse the output for the most organic-sounding results. Let’s dive into how users can quickly and easily adjust synthetic speech to make it sound more natural..

Use natural inflections, pauses, and breaths

One of the essential elements of a realistic-sounding synthetic voice is natural inflections and pauses. Natural-sounding voices include variations in tone and pitch, which make them more engaging and easier to understand. Additionally, pauses and breaks in speech help the listener comprehend the information better.

With Veritone Voice, users can adjust their synthetic speech via the user-friendly interface of levels and sliders or within the coding of the Speech Synthesis Markup Language (SSML)—both the API and UI have the same functionality. Once you select which voice (Custom, Premium, or Stock) and the style of voice you want to use for your project, you will be able to adjust the speed and pitch of the clip as a whole or within specific sections and add a pause, simply insert a comma within the text to speech tool (even if it’s not the grammatically correct placement of that punctuation). In the UI you can also adjust the pause between words with a slider for more control.

Another element of making artificial voice sound more realistic is the use of breath groups and phrasing. Natural-sounding breaths can help make a synthetic voice sound more human. Breathing patterns can help signal the listener when the speaker is about to finish a thought or take a break.

Synthetic voices with natural-sounding breaths can help convey a sense of pace and rhythm that mimics human speech. While not all AI voice applications can accommodate the sound of breath, this can be accomplished by adding pauses where breath would occur organically.

Consider emotion and expressiveness

Adding emotion and expressiveness can be achieved by incorporating prosodic modifications, which refers to the patterns of stress and intonation in speech. A natural sounding synthetic voice should be able to convey stress, pitch, and intonation changes that are consistent with the message being delivered.

Within Veritone Voice, this can be achieved by adjusting the pitch and loudness of certain words or phrases, the word stress, liveliness, or the tone (excited, terse, or disappointed are some of the available options).

Adding emotion and expressiveness to synthetic voices can make them more relatable and engaging. For example, a synthetic voice that expresses excitement, happiness, or empathy can help convey emotions in situations where facial expressions or body language are not visible.

Avoid overusing perfect pronunciation

Synthetic voices with perfect pronunciation and diction may sound impressive at first, but they can quickly become tedious—or even strange—to listen to. Overuse of “perfect” pronunciation can make the voice sound unnatural, and it can be challenging to understand the meaning of what is being said. In conversational speech, there is a concept known as co-articulation which refers to how phonemes sound in context versus in isolation.

To avoid over articulation, use more natural-sounding pronunciation and diction. This can be achieved by incorporating words or phrases that have been phonetically customized. In Veritone Voice, users can also alter the phonetics of specific words (which is important for brand names, individual names, locations, etc.) and add them to their account’s vocabulary.

Conclusion

Synthetic voices are incredibly useful and can make a significant impact on our lives from both a commercial and assistive technology point of view. From creating voices for the voiceless, aiding those with visual impairments, to enhancing accessibility by generating content in different languages, AI-generated speech has many practical applications.

By incorporating the power of AI qualities from natural voices, we can create natural-sounding text to speech that better resembles the human voice. Curious about how AI voice can help grow your organization? Reach out to a Veritone team member today for answers to frequently asked questions and to learn more about the benefits of AI-generated voice.

01.24.23 - ETHAN BAKER

Deepfake Voice—Everything You Should Know in 2023

Learn More

11.01.22 - ASHLEY BAILEY

HOW TO IMPROVE LOCALIZATION IN AUDIO PRODUCTIONS