Generative AI and Artificial intelligence has come a long way. Text to speech is a relatively older concept, it’s been around for a while. There’s much to unpack here and categorise and I’ll break it down and look at this from all angles. Whether you are a beginner or a pro, this should bring overall clarity to the Google Text to Speech API.
Okay, before we dive into any topic, it’s a must that we establish the ground rules. Let’s define a few terms and build up our foundation so we can rest firmly on it.
Let’s separate the two technologies here; text to speech and APIs, and what’s the role of Google Cloud.
Editors note: Looking for the leading text to speech API? Check out Speechify’s well documented and easy to use text to speech API.
Text to Speech
I’ve written extensively on this topic and you can read my What is text to speech blog and also read up on speech synthesis to get a firm grasp on this topic. These go more in depth and you can skip them for now. I’ll summarise them in a few sentences.
Text to speech relies on a technology called speech synthesis to convert words into AI generated speech. The use cases for this are abundant. From helping people with reading barriers such as dyslexia and bad vision to those simply blazing the efficiency trail.
API
API stands for Application Programming Interface. It simply acts as a bridge between two applications. If you were developing an app that had audio content and required text to speech functionality then you would have to build the text to speech functionality yourself, or you could simply connect to an existing text to speech API.
You would focus on building your app and rely on a third-party API as a bridge, to import the text to speech functionality to synthesize your text.
Google Cloud API
This is where Google Cloud comes into play. Google has developed a robust text to speech API and offers it up to developers in various fee structures. Any developer looking to build custom apps or web apps that require text to speech functionality could simply bridge that gap by using Google’s TTS features. Yes, TTS is short for text to speech.
Find the quickstart at Google Cloud Console https://cloud.google.com/. You can find tutorials, manage your service account, access wavenet voices and more.
Google Cloud itself is a cloud platform offered by Google and it offers a host of modular services. You can choose to use one, many, or all of its services. All you’d need to do is to create access keys for authentication of each API – the bridge. Most, if not all, services come with a cost though there might be a free threshold.
Google bought DeepMind in 2014 for its text to speech technology and work in neural network development. So, if you come across DeepMind, it is now Google DeepMind and they are all one and the same.
Now that we have a solid understanding, let’s dive deep into the Google Cloud Text to Speech API.
Google Text to Speech API Features
Google is a global tech pioneer and leader, there’s no doubt about that. When it comes to the TTS API, you can expect to find world class features that continue to evolve.
High Fidelity Speech
Google’s text to speech voices are some of the best in the industry. They sound very human like and with natural sounding intonation. TTS is in its earliest stages and those that can best synthesize audio to sound like a human is speaking is going to win this race.
Selection of Voices
Google claims the widest selection of voices so your project does not have to sound the same as the other 1000 out there or worse yet, your competitors’ app.
Create Your Own Voice
This borders voice cloning tech. You can create your custom voice by recording you or someone else, with their permission. You can then use this sample to be the voice that reads aloud all your text.
Neural Voices
Neural voices offer the best quality among the vast selection of voices. You can also internationalize these voices to grow your international audience.
Studio Voices
Studio voices are more top of the line voices and they sound very professional as if they were recorded the traditional method.
Voice Tuning
Pick a voice and then adjust the speed, the pitch, and more so that you can customise the tone or a voice.
How much does the Google Text to Speech API Cost?
It all comes down to voice quality and the length of your text. The more natural sounding you want your voice to be, the more expensive it will be. Though, expensive is relative here. Even the high quality voices are relatively inexpensive.
Voice type | Free per month | After free usage has been reachedd |
Neural2 voices | 0 to 1 million bytes | $16 per one million bytes |
Polyglot voices | 0 to 1 million bytes | $16 per one million bytes |
Studio voices | 0 to 100,000 bytes | $160 per one million bytes |
Standard voices | 0 to 4 million characters | $4 per one million characters |
Wavenet voices | 0 to 1 million characters | $16 per one million characters |
What’s the Difference Between Characters & Bytes
As you can see, the pricing varies significantly based on the quality of the voice. The audioencoding and processing it takes to turn text into speech varies from tier to tier. For the lower, the Standard Voices for example, the pricing is lower and is counted by characters.
This means, if your project has 4 million characters, it would cost you $16 to convert those characters into speech using the Standard Characters.
The Studio Voices on the other hand require greater processing power and are charged based on bytes. In some languages, like Japanese for example, a single character could be composed of multiple bytes.
So for the most accurate pricing it’s important to know which language you are working on and a basic understanding of an average amount of bytes for each character and estimate that accordingly.
How to Setup Your Google Cloud Platform Text to Speech API Project?
- Create Google Cloud account or login at this page
- Create a new project and name it appropriately
- Add a billing method. You will only get charged for what you use.
- Then choose your project and associate it with a billing account.
- Activate the Text-to-Speech API. Go to the search products and resources bar located at the top of the page, and type in “speech.”
- From the displayed results, choose the Cloud Text-to-Speech API
- Set up authentication for your development environment. For instructions, see Set up authentication for Text-to-Speech.
You can also try Text-to-Speech without linking it to your project:
- Choose the TRY THIS API option.
- To enable the Text-to-Speech API for use with your project, click ENABLE.
Check out the Google Cloud Documentation for further help.
How to Disable the Text to Speech API
To deactivate the Text-to-Speech API, go to your Google Cloud Platform dashboard and click on the “Go to APIs overview” link within the APIs box. Locate the Text-to-Speech API and then click on it, followed by selecting the “DISABLE API” button at the top of the page.
Get Started with Google Text to Speech API
Now that you have your project set up, you can use command line to get started.
gcloud init
Create local authentication
gcloud auth application-default login
Now you can install a client library. In this example, we’ll look at Node.js
npm install --save @google-cloud/text-to-speech
Google Cloud Text to Speech API Supports a These Languages:
- Go
- Java
- Node.js
- C++
- C#
- PHP
- Python
- Ruby
- TypeScript
- Terraform
- YAML
How Does the Google Cloud API Work?
It all begins with a simple API call. You would send your text in a transcript call and then you would receive an audio file of your spoken text. With your request, you can make specific requirements. Choose a voice, a language, and more and then the text to speech API will send you back the audio file.
You can learn how to install and use the text to speech client libraries here. Our code samples will be for Node.js. But you can choose anything else from Python to PHP. Whatever you are comfortable with.
const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');
const util = require('util');
const client = new textToSpeech.TextToSpeechClient();
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const text = 'Text to synthesize, eg. hello';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';
const request = {
input: {text: text},
voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
audioConfig: {audioEncoding: 'MP3'},
};
const [response] = await client.synthesizeSpeech(request);
const writeFile = util.promisify(fs.writeFile);
await writeFile(outputFile, response.audioContent, 'binary');
console.log(`Audio content written to file: ${outputFile}`);
And that’s it. You set up Google Cloud Text to Speech API and sent your first request to convert text to speech. You can get the file back in various formats; from OGG to MP3.
Here are a Few Ways to Use the Google Text to Speech API
The Google Text-to-Speech (TTS) API offers a versatile solution for various use cases across different industries. Some common use cases include:
- Text-to-Speech for Visually Impaired Users: Implementing TTS in applications to convert written content into spoken words, making digital information accessible for visually impaired users.
- Automated Phone Systems: Utilizing TTS to create natural-sounding prompts and responses for interactive voice response systems in customer service or information hotlines.
- Voiceovers for Media Content: Generating natural-sounding voiceovers for videos, podcasts, or other multimedia content to enhance user experience.
- Text-to-Speech for Translated Content: Converting translated text into spoken words to facilitate language learning, international communication, or content consumption in various languages.
- Reading Assistance for Dyslexic Users: Providing TTS functionality to assist individuals with dyslexia or reading difficulties in consuming written content.
- Voice Navigation in Applications: Integrating TTS into navigation applications to provide turn-by-turn directions or location-based information audibly.
- Text-to-Speech for Educational Content: Enhancing e-learning experiences by converting educational text content into spoken words, aiding comprehension and engagement.
- Speech Synthesis for Productivity Apps: Integrating TTS into productivity tools, such as note-taking or task management apps, to enable spoken feedback or information retrieval.
- Natural Voice for Virtual Assistants: Powering voice assistants with natural-sounding TTS to improve user interactions and provide information in a conversational manner.
- Auditory Alerts and Notifications: Using TTS to provide audible alerts, notifications, or status updates on Internet of Things (IoT) devices for enhanced user awareness.
Best Alternatives to Google Cloud TTS API
As of my last knowledge update in January 2022, there are several alternatives to the Google Text-to-Speech API. Keep in mind that the popularity and capabilities of these services may have changed since then. Here are some notable alternatives:
- Speechify Text to Speech API: We’re thrilled to unveil the development of a text-to-speech API that delivers Speechify’s most natural and beloved AI voices directly to developers worldwide. Save your seat today.
- Amazon Polly: Offered by Amazon Web Services (AWS), Polly provides natural-sounding speech synthesis in various languages and voices. It integrates well with other AWS services.
- Microsoft Azure Speech Service: Azure Speech Service includes Text-to-Speech capabilities and supports a variety of applications, including voice assistants, navigation systems, and more.
- IBM Watson Text to Speech: IBM Watson offers a Text to Speech service that allows developers to convert written text into natural-sounding speech using various voices.
- Nuance Communications: Nuance provides a range of speech and voice recognition solutions, including text-to-speech, for applications in healthcare, automotive, and customer service.
- CereProc: CereProc is a text-to-speech technology company that offers high-quality synthetic voices for applications like accessibility, entertainment, and communication.
- iSpeech: iSpeech provides cloud-based text-to-speech services with support for multiple languages and voices. It is suitable for various applications, including mobile apps and websites.
- ResponsiveVoice: ResponsiveVoice is a simple and affordable text-to-speech API that supports multiple languages and can be used in various web-based applications.
- Neospeech: Neospeech offers text-to-speech solutions with a focus on natural-sounding voices. Their technology is used in applications like e-learning and entertainment.
- ReadSpeaker: ReadSpeaker provides online and offline text-to-speech solutions for diverse applications, including websites, e-learning, and accessibility services.
- Acapelabox: Acapela Group offers a cloud-based text-to-speech API, Acapelabox, which supports multiple languages and voices for applications in various industries.
Google Text to Speech API FAQs
Google does have a multiple tiers of voices and almost each tier has a free limit. For example, the standard voices is free up to the first million bytes. After that it is $16 per million bytes. So yes, it can be free with limited characters or bytes.
Simply create an account at https://cloud.google.com/text-to-speech/ and follow the steps there. Also, I’ve outlined the process in detail in this blog, just above.
You can get a google text to speech API key by logging into your Google Cloud account and then create a project. Once you create your project you can generate an API key.
The URL for Google text to speech API is https://cloud.google.com/text-to-speech/
There is technically no free trial period for Google Cloud. There are multiple services within Google Cloud and each service has its own terms and free tiers.
No. The Google Cloud text to speech API requires an internet connection.
Authentication to Google Cloud services, including the Text-to-Speech API, can be done using API keys, OAuth 2.0, or service accounts. The appropriate authentication method depends on the use case and the type of application.
I’d rate it 5 stars. It’s easy to use, the search feature is great and is used the most. The pricing is decent and it’s overall a great product.
Google Text-to-Speech API provides client libraries for various programming languages, including Python. It also supports RESTful API requests, making it compatible with languages that can make HTTP requests.
Integrating Google Text-to-Speech API into an Android app involves using the TextToSpeech class and making API requests. Detailed instructions can be found in the official documentation for Android developers.
To implement Google Text-to-Speech API in a JavaScript application, you can make HTTP requests to the API endpoint. The process involves constructing the appropriate API request and handling the response in your JavaScript code. Refer to the official documentation for details.