Thursday, 5 January 2017

Installing IDLAK Deep Neural Network Text to Speech Synthesizer

Installing IDLAK Deep Neural Network Text to Speech Synthesizer

I was on duty to create text-to-speech (TTS) engine. After researching about the latest technology, the state-of-art of TTS had come to Deep Neural Network (DNN) scheme, instead of insisting on Hidden Markov Model (HMM). The main reason of migrating the scheme in to DNN was just the powerfullness of it.

One of the framework of doing DNN TTS is IDLAK. It was fresh from the oven. Their team had just presented their paper on Interspeech 2016. IDLAK is branch of well-known automatic-speech-recognizer (ASR) engine KALDI.

Here I am documenting the whole steps building the engine.

  • Step 1
    Clone IDLAK from the Github repository.

    git clone https://github.com/bpotard/idlak.git
  • Step 2
    Open the idlak-trunk directory , in this case only IDLAK. Then read README and INSTALL

    cd IDLAK
    cat README.md
    cat INSTALL
  • Step 3
    Compile the tool. I worked on 48 core computer, so I employed almost all the resources. Adjust the number according to your resources capacity.

    cd tools
    make -j 40
    cd ..
  • Step 3.5
    In case of there were any error due to ATLAS library had not been installed on the system, we should install it first. If there were no error, ignore this step. Assuming you have access to the root, run code below.

    sudo apt-get install libatlas-base-dev

    For further information on how installing ATLAS, open the developer website here.

  • Step 4
    Now, we need to compile the source. Go to the src directory.

    cd src
    ./configure
    make depend -j 40
    make -j 40
    cd ..
  • Step 4.5
    I encountered problem about GPU architecture setting. The error message said something like CUDA_ARCH is undefined. Hence, I tried to do some trick on Makefile in cudamatrix directory.

    cd cudamatrix
    vim Makefile

    Find line 24 or code like :

    ifeq ($(CUDA), true)
        ifndef CUDA_ARCH
            $(error CUDA_ARCH is undefined, run 'src/configure')
        endif
    endif

    Change the code scope into :

    ifeq ($(CUDA), true)
        ifndef CUDA_ARCH
        CUDA_ARCH = -gencode arch=compute_20,code=sm_20 \
                    -gencode arch=compute_30,code=sm_30 \
                    -gencode arch=compute_35,code=sm_35 \
                    -gencode arch=compute_50,code=sm_50 \
                    -gencode arch=compute_53,code=sm_53 \
                    -gencode arch=compute_60,code=sm_60 \
                    -gencode arch=compute_61,code=sm_61 \
                    -gencode arch=compute_62,code=sm_62
        endif
    endif

    Save and exit the vim :wq!, return to the src directory and repeat step 4. If there weren’t any error, you may skip this step.

  • Step 5
    Go to the example directory and open tts_dnn_arctic.

    cd egs/tts_dnn_arctic/s1

    Run the demo. (I had just tested the 48k demo)

    ./run_48k.sh
  • Step 6
    Once the training is done, we are able to synthesize text to speech. The way we synthesize text is inputting the following code in terminal. Our synthesized wav will be placed on synth folder in wav_mlpg sub-folder.

    echo "Some arbitrary text spoken in English" | utils/synthesis_voice.sh slt_mdl synth

    Try listen to our wav

    cd synth/wav_mlpg
    play test001.wav

    To make our life easier, I tried to make a shell script as follows :

      #!/bin/sh
      # demo_synthesis.sh
    
      # Desciption : Synthesize English text command
      # Author : Tirtadwipa Manunggal <tirtadwipa.manunggal@gmail.com>
      # Jan 3 2017
    
    if [ $# -eq 1 ]; then
        echo "$1" | utils/synthesis_voice.sh slt_mdl synth
    else
        echo "Usage error > ./demo_synthesis.sh \"Some arbitrary text spoken in English\" "
    fi
    cd synth/wav_mlpg
    play test001.wav
    cd ../..

Those all the steps on how to install IDLAK. It is still a long way to make a reliable and high quality TTS engine. One that become my concern is the vocoder which is used by IDLAK. IDLAK still use the old-fashioned vocoder method namely MFCC (Mel-Frequency Cepstrum Coefficient). The only problem of MFCC is MFCC originally made for ASR which non-invertible to waveform (Fourier Transform stuff).

Now the trend of vocoder for TTS has started to move to STRAIGHT1 that originally made by Professor Hideki Kawahara. STRAIGHT had the top-notch quality in the term of F0 prediction accuracy and naturalness of resynthesized voice. For more complete overview about STRAIGHT, open the official web here. But, unfortunately STRAIGHT is not open source.

The good news is there a researcher from Yamanashi University Japan, Professor M. Morise, create the STRAIGHT-like vocoder called WORLD. The method have the similar fashion with STRAIGHT. The developer also claimed that WOLRD are able to reduce to computational cost compared with TANDEM-STRAIGHT.

Correct me if I wrong.
Sincerely,

Tirtadwipa Manunggal
tirtadwipa.manunggal@gmail.com


footnote :
[1] Abbreviated from Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum



1 comment:

  1. Hi Tirtadwipa,

    I followed your tutorial for implementing IDLAK for arctic data and I got the output speech signal.
    Now I want to implement it using my local language data. I have both wav files and text files.
    So, can you please guide me how to do this with IDLAK.?
    Thanks in advance.

    Regards
    Giridhar

    ReplyDelete