I was on duty to create text-to-speech (TTS) engine. After researching about the latest technology, the state-of-art of TTS had come to Deep Neural Network (DNN) scheme, instead of insisting on Hidden Markov Model (HMM). The main reason of migrating the scheme in to DNN was just the powerfullness of it.
One of the framework of doing DNN TTS is IDLAK. It was fresh from the oven. Their team had just presented their paper on Interspeech 2016. IDLAK is branch of well-known automatic-speech-recognizer (ASR) engine KALDI.
Here I am documenting the whole steps building the engine.
Step 1
Clone IDLAK from the Github repository.git clone https://github.com/bpotard/idlak.git
Step 2
Open theidlak-trunk
directory , in this case onlyIDLAK
. Then readREADME
andINSTALL
cd IDLAK cat README.md cat INSTALL
Step 3
Compile the tool. I worked on 48 core computer, so I employed almost all the resources. Adjust the number according to your resources capacity.cd tools make -j 40 cd ..
Step 3.5
In case of there were any error due toATLAS
library had not been installed on the system, we should install it first. If there were no error, ignore this step. Assuming you have access to the root, run code below.sudo apt-get install libatlas-base-dev
For further information on how installing
ATLAS
, open the developer website here.Step 4
Now, we need to compile the source. Go to thesrc
directory.cd src ./configure make depend -j 40 make -j 40 cd ..
Step 4.5
I encountered problem about GPU architecture setting. The error message said something likeCUDA_ARCH is undefined
. Hence, I tried to do some trick onMakefile
incudamatrix
directory.cd cudamatrix vim Makefile
Find line 24 or code like :
ifeq ($(CUDA), true) ifndef CUDA_ARCH $(error CUDA_ARCH is undefined, run 'src/configure') endif endif
Change the code scope into :
ifeq ($(CUDA), true) ifndef CUDA_ARCH CUDA_ARCH = -gencode arch=compute_20,code=sm_20 \ -gencode arch=compute_30,code=sm_30 \ -gencode arch=compute_35,code=sm_35 \ -gencode arch=compute_50,code=sm_50 \ -gencode arch=compute_53,code=sm_53 \ -gencode arch=compute_60,code=sm_60 \ -gencode arch=compute_61,code=sm_61 \ -gencode arch=compute_62,code=sm_62 endif endif
Save and exit the vim
:wq!
, return to the src directory and repeat step 4. If there weren’t any error, you may skip this step.Step 5
Go to the example directory and opentts_dnn_arctic
.cd egs/tts_dnn_arctic/s1
Run the demo. (I had just tested the 48k demo)
./run_48k.sh
Step 6
Once the training is done, we are able to synthesize text to speech. The way we synthesize text is inputting the following code in terminal. Our synthesizedwav
will be placed onsynth
folder inwav_mlpg
sub-folder.echo "Some arbitrary text spoken in English" | utils/synthesis_voice.sh slt_mdl synth
Try listen to our
wav
cd synth/wav_mlpg play test001.wav
To make our life easier, I tried to make a shell script as follows :
#!/bin/sh # demo_synthesis.sh # Desciption : Synthesize English text command # Author : Tirtadwipa Manunggal <tirtadwipa.manunggal@gmail.com> # Jan 3 2017 if [ $# -eq 1 ]; then echo "$1" | utils/synthesis_voice.sh slt_mdl synth else echo "Usage error > ./demo_synthesis.sh \"Some arbitrary text spoken in English\" " fi cd synth/wav_mlpg play test001.wav cd ../..
Those all the steps on how to install IDLAK. It is still a long way to make a reliable and high quality TTS engine. One that become my concern is the vocoder which is used by IDLAK. IDLAK still use the old-fashioned vocoder method namely MFCC (Mel-Frequency Cepstrum Coefficient). The only problem of MFCC is MFCC originally made for ASR which non-invertible to waveform (Fourier Transform stuff).
Now the trend of vocoder for TTS has started to move to STRAIGHT1 that originally made by Professor Hideki Kawahara. STRAIGHT had the top-notch quality in the term of F0 prediction accuracy and naturalness of resynthesized voice. For more complete overview about STRAIGHT, open the official web here. But, unfortunately STRAIGHT is not open source.
The good news is there a researcher from Yamanashi University Japan, Professor M. Morise, create the STRAIGHT-like vocoder called WORLD. The method have the similar fashion with STRAIGHT. The developer also claimed that WOLRD are able to reduce to computational cost compared with TANDEM-STRAIGHT.
Correct me if I wrong.
Sincerely,
Tirtadwipa Manunggal
tirtadwipa.manunggal@gmail.com
Hi Tirtadwipa,
ReplyDeleteI followed your tutorial for implementing IDLAK for arctic data and I got the output speech signal.
Now I want to implement it using my local language data. I have both wav files and text files.
So, can you please guide me how to do this with IDLAK.?
Thanks in advance.
Regards
Giridhar