Linerocks: June 2019

Most automatic speech recognition (ASR) application requires language model to run properly. Why? Because the ASR engine turn the voice sequences in to most likely phonemes and then form words . From this stage the engine pay no attention to language context. Here the language modelling comes in to play. Right after the words are generated, the words or sentence are fed to the language model, let mention KENLM for Deepspeech. And voila the language-aware sentence are produced.

Deepspeech uses KENLM under the hood as currently it’s the fastest language model library in the wild. From KENLM we need 2 files, i.e. language model binary (lm.binary) and the trie. Let’s see how they are built.

Steps

0. Cloning and making the KENLM

Clone the code from repository https://github.com/kpu/kenlm.

git clone https://github.com/kpu/kenlm

Then build it,

cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4

1. Providing the corpus

Let’s assume we want to build Javanese language model. We are using openslr.org. The data, either the audio and utterances, are available for public in https://www.openslr.org/35/. Specifically for the utterances, we need to get the utt_spk_text.tsv. The structure of file are like following :

00004fe6aa	a4815	Kanthong semar minangka tanduran kang mrambat lan njulur
00005fb7fb	ede87	Rudy uga misuwur amerga inovasi anggoné nggawé bumbu dhasar
0000e5df79	ffe12	Banjur saluran bakal mbelok ning lateral tengen
0001418491	ce6f9	Ana telu seksi tradhisional ya iku Tripolitania Fezzan lan Cyrenaica
0001bbbc2e	0a834	Iwak Mas iki bisa kanggo praktik cuba cuba ana leb

Get only the sentence on third column and lowercase the text by :

 cat utt_spk_text.tsv | cut -d $'\t' -f3 | tr A-Z a-z > corpus.txt

Let’s assume the letters used in corpus.txt comprises of :

a
b
c
d
e
é
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

2. Creating `arpa` file

./bin/lmplz --text ${CORPUS_DIR}/corpus.txt --arpa ${CORPUS_DIR}/words.arpa --o 3

3. Building language model binary

./build_binary -T -s ${CORPUS_DIR}/words.arpa ${CORPUS_DIR}/lm.binary

4. Building the trie

./tensorflow/bazel-bin/native_client/generate_trie ${CORPUS_DIR}/alphabet.txt ${CORPUS_DIR}/lm.binary ${CORPUS_DIR}/corpus.txt ${CORPUS_DIR}/trie

Wednesday, 19 June 2019

Building KENLM Language Model for Mozilla Baidu DeepSpeech

Steps

0. Cloning and making the KENLM

1. Providing the corpus

2. Creating `arpa` file

3. Building language model binary

4. Building the trie

Wednesday, 19 June 2019

Building KENLM Language Model for Mozilla Baidu DeepSpeech

Steps

0. Cloning and making the KENLM

1. Providing the corpus

2. Creating arpa file

3. Building language model binary

4. Building the trie

2. Creating `arpa` file