Project “Jarvis”: step two (speak to me)

In my previous post, I conducted a few experiments with speech recognition via Google’s Speech API and get enough results to push the project “Jarvis” a bit further.
Now it is time for Jarvis to speak !

Text-To-Speech engines

There are many “Text-To-Speech” engines already packaged for the Rasberry Pi. Namely:

espeak: eSpeak is compact Open Source speech synthetizer (for English and other languages). It is available as a shared libray and as a command line program to speak from a file or from stdin. It can be used as a front-end to mbrola diphone voices.
festival: Festival Speech Synthesis System is a multi-lingual Open Source speech synthetizer which offers Text-To-Speech capabilities with various API.
flite: festival-lite is a small run-time speech synthesis engine developed at Carnegie Mellon University, derived from Festival.

Let’s install and try these three engines:

1
2
3

apt-get install espeak
apt-get install festival
apt-get install flite

Unfortunatley, I ran into a set of broken packages when I tried to install mbrola voices for espeak and festival:

root@applepie ~ # apt-get install mbrola-en1 mbrola-fr1 mbrola-fr4  mbrola-us1 mbrola-us2 mbrola-us3 festvox-en1 festvox-us1 festvox-us2 festvox-us3
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
 
The following packages have unmet dependencies:
 mbrola-en1 : Depends: mbrola but it is not installable
 mbrola-fr1 : Depends: mbrola but it is not installable
 mbrola-fr4 : Depends: mbrola but it is not installable
 mbrola-us1 : Depends: mbrola but it is not installable
 mbrola-us2 : Depends: mbrola but it is not installable
 mbrola-us3 : Depends: mbrola but it is not installable
E: Unable to correct problems, you have held broken packages.

It meant that the outputs from espeak and festival would quite probably be rather poor in quality. Thus, I introduced a new contender as an external service: Google Text-to-Speech API.

Here’s a little benchmark, where the speech outputs from each engine are compared, given the same quote from 2001 Space Odyssey.

Benchmark #1: espeak

Getting a .wav file from plain text is quite easy:

1	espeak "Look Dave, I can see you're really upset about this" --stdout > espeak.wav

Here’s the .wav output from espeak:

http://quantum-bits.org/wp-content/uploads/2013/02/espeak.wav

espeak

As expected, it is really bad. It reminds me of the speech synthetizer I used to play with on my Atari 1040STF in the 80’s 🙁

Benchmark #2: festival

Getting a .wav file from plain text is also easy:

1	echo "Look Dave, I can see you're really upset about this" \| text2wave -o festival.wav

And the resulting .wav output is:

http://quantum-bits.org/wp-content/uploads/2013/02/festival.wav

festival

Less robotic, but still very far from what I need for Jarvis 🙁

Benchmark #3: flite

Getting a speech output form flite is as simple as it is form espeak and festival:

1	echo "Look Dave, I can see you're really upset about this" \| flite -o flite.wav

And the resulting .wav goes like this:

http://quantum-bits.org/wp-content/uploads/2013/02/flite.wav

flite

Better. It’s getting HAL-like, but I really need something closer to a real human voice.

Benchmark #4: Google TTS

Google Text-To-Speech is a private REST API. Getting results is less straightforward but noneless very easily manageable. Here’s a little PHP script:

<!--?php
$voice = urlencode("Look Dave, I can see you're really upset about this");
$cmd ='/usr/bin/curl -A "Mozilla" "http://translate.google.com/translate_tts?tl=en_gb&ie="UTF-8"&q='.$voice.'" --> google.mp3';
shell_exec($cmd);
?&gt;

And here’s the result (converted to the same .wav format):

http://quantum-bits.org/wp-content/uploads/2013/02/google1.wav

Google (en_gb)

Much much better 😎 . Maybe a little too slow. Let’s try to play with localizations and switch from British English to US English:

http://quantum-bits.org/wp-content/uploads/2013/02/google2.wav

Google (en_us)

Surprisingly, the US voice is female 😀
Not bad. Now, let’s try a French version:

http://quantum-bits.org/wp-content/uploads/2013/02/google3.wav

Google (fr_fr)

Really good. Also a female voice. It is actually very close to the synthetic voice used at SNCF (French Railroads) stations. Kind of a scary voice. It feels like … I’m gonna miss a f**king train.

I think I’m gonna settle for the Bristish voice from Google’s Text-To-Speech Engine.

I’ll have to rely (once more) on an external service, but a electronic butler has to be British :p

3 thoughts on “Project “Jarvis”: step two (speak to me)”

coconox 19 March 2013 Reply

Hi, nice benchmark! Can you tell me how you installed Google TTS ? Thanks 🙂
Fred Post author5 May 2013 Reply

Hi, Google TTS is actually not installed, but used as a service via a REST API. See http://geeknizer.com/text-to-speech-api/ for example.
kevin 4 April 2017 Reply

Using the google option isn’t a long term solution. Once they figure out you’re using it in any automated way, you’ll get a check and after that, no more google tts.

3 thoughts on “Project “Jarvis”: step two (speak to me)”

Leave a Reply Cancel reply