|In my previous post, I conducted a few experiments with speech recognition via Google’s Speech API and get enough results to push the project “Jarvis” a bit further.
Now it is time for Jarvis to speak !
There are many “Text-To-Speech” engines already packaged for the Rasberry Pi. Namely:
- espeak: eSpeak is compact Open Source speech synthetizer (for English and other languages). It is available as a shared libray and as a command line program to speak from a file or from
stdin. It can be used as a front-end to mbrola diphone voices.
- festival: Festival Speech Synthesis System is a multi-lingual Open Source speech synthetizer which offers Text-To-Speech capabilities with various API.
- flite: festival-lite is a small run-time speech synthesis engine developed at Carnegie Mellon University, derived from Festival.
Let’s install and try these three engines:
1 2 3
apt-get install espeak apt-get install festival apt-get install flite
Unfortunatley, I ran into a set of broken packages when I tried to install
mbrola voices for
root@applepie ~ # apt-get install mbrola-en1 mbrola-fr1 mbrola-fr4 mbrola-us1 mbrola-us2 mbrola-us3 festvox-en1 festvox-us1 festvox-us2 festvox-us3 Reading package lists... Done Building dependency tree Reading state information... Done Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: mbrola-en1 : Depends: mbrola but it is not installable mbrola-fr1 : Depends: mbrola but it is not installable mbrola-fr4 : Depends: mbrola but it is not installable mbrola-us1 : Depends: mbrola but it is not installable mbrola-us2 : Depends: mbrola but it is not installable mbrola-us3 : Depends: mbrola but it is not installable E: Unable to correct problems, you have held broken packages.
It meant that the outputs from espeak and festival would quite probably be rather poor in quality. Thus, I introduced a new contender as an external service: Google Text-to-Speech API.
Here’s a little benchmark, where the speech outputs from each engine are compared, given the same quote from 2001 Space Odyssey.
Benchmark #1: espeak
.wav file from plain text is quite easy:
espeak "Look Dave, I can see you're really upset about this" --stdout > espeak.wav
.wav output from espeak:
As expected, it is really bad. It reminds me of the speech synthetizer I used to play with on my Atari 1040STF in the 80’s
Benchmark #2: festival
.wav file from plain text is also easy:
echo "Look Dave, I can see you're really upset about this" | text2wave -o festival.wav
And the resulting
.wav output is:
Less robotic, but still very far from what I need for Jarvis
Benchmark #3: flite
Getting a speech output form flite is as simple as it is form espeak and festival:
echo "Look Dave, I can see you're really upset about this" | flite -o flite.wav
And the resulting
.wav goes like this:
Better. It’s getting HAL-like, but I really need something closer to a real human voice.
Benchmark #4: Google TTS
Google Text-To-Speech is a private REST API. Getting results is less straightforward but noneless very easily manageable. Here’s a little PHP script:
1 2 3 4 5
<?php $voice = urlencode("Look Dave, I can see you're really upset about this"); $cmd ='/usr/bin/curl -A "Mozilla" "http://translate.google.com/translate_tts?tl=en_gb&ie="UTF-8"&q='.$voice.'" > google.mp3'; shell_exec($cmd); ?>
And here’s the result (converted to the same .wav format):
Much much better 😎 . Maybe a little too slow. Let’s try to play with localizations and switch from British English to US English:
Surprisingly, the US voice is female 😀
Not bad. Now, let’s try a French version:
Really good. Also a female voice. It is actually very close to the synthetic voice used at SNCF (French Railroads) stations. Kind of a scary voice. It feels like … I’m gonna miss a f**king train.
I think I’m gonna settle for the Bristish voice from Google’s Text-To-Speech Engine.
I’ll have to rely (once more) on an external service, but a electronic butler has to be British :p