Cloud Text-to-Speech is an API that lets you convert text into human-like speech. You pass it to text file and it returns raw audio data as a base64 encoded string. You must decode this base64 encoded string to an audio file before an application can play it. Luckily, most platforms and operating systems have tools for decoding base64 text into playable media files. You should be pretty familiar with machines that talk to us, thanks to things like Google Assistant, Google Translate, and Google Search. Text-to-speech supports any application or device that can send a rest or gRPC request. This includes phones, PCs, tablets, and IoT devices like TVs and speakers. Application developers have created lifelike interactions with users that transform customer service, sales, and consumer device interaction. What we're doing is showing you how to access the APIs through Client URL requests, CUR. There's an increasing trend to access Cloud APIs from programming language Client Libraries. All the documentation to do this is available on the Google Cloud site. You can exert a lot of control over the audio that comes back. For example, you can add pauses, numbers, date and time formatting, and other pronunciation instructions using speech synthesis markup language. You can change the speaking rate, the pitch of the default voice. You can also increase or decrease the volume and optimize the audio for an output device, such as playback on a speaker versus playback on a telephone. The Text-to-Speech API creates raw audio data of natural human speech. You can access more than 180 voices across more than 30 languages in variants. There's new voices and languages being added pretty frequently. These voices differ by language, gender, and accent for some languages. When you send a request to text-to-speech, you must specify a voice that actually speaks the words. Now the process of translating text inputs into audio data is called synthesis, and the output of synthesis is called synthetic speech. Synthesis is largely based on a very large database of short speech fragments that were recorded from a single speaker. These fragments are then broken into tiny chunks and recombine to form complete words and complete sentences. This is called concatenative text-to-speech or concatenative TTS. Those tinny sounding, unnatural computer voices are generated with this. With concatenative TTS, it's really difficult to modify the voice. For example, switching to a different speaker or alternating the emphasis or emotion of their speech is quite a challenge unless you record an entire new database. Another method is parametric TTS. This method uses a series of rules and parameters about grammar as well as mouth movements to guide a computer-generated voice. It's cheaper and faster than concatenative TTS, but it's even less natural sounding. Google has taken a totally different approach with WaveNet. A WaveNet model generates more natural speech sounds by actually creating the raw audio waveforms from scratch. This has a lot more human-like emphasis and infliction on syllables, phonemes, and words. Developers usually avoid modeling raw audio data because it's data points change so quickly, typically 16,000 samples per second or even higher. It's very important to have voice structure at a granular scale. WaveNet was built using a neural network model similar to those used for analyzing images. During the training phase, the WaveNet model determines the underlying structure of speech, such as which tones follow each other, and which waveforms are or are not realistic. The train networks synthesizes a voice one sample at a time, with each sample taking into account the properties of the previous one. This creates a very natural-sounding voice with emphasis, intonation, accent, and even lip smack sounds. WaveNet actually began as a research model and it was too computationally expensive to put in consumer products. Then it migrated to services like search and the Google Assistant. Now, the Text-to-Speech API offers a group of premium voices generated using WaveNet technology. Listen to the difference in these examples. The Blue Lagoon is a 1980 American romance and adventure film directed by Randal Kleiser. The Blue Lagoon is a 1980 American romance and adventure film directed by Randal Kleiser. The more natural sound of the WaveNet voice is clearly distinguishable.