Most automatic speech recognition (ASR) application requires language model to run properly. Why? Because the ASR engine turn the voice sequences in to most likely phonemes and then form words . From this stage the engine pay no attention to language context. Here the language modelling comes in to play. Right after the words are generated, the words or sentence are fed to the language model, let mention KENLM
for Deepspeech
. And voila the language-aware sentence are produced.
Deepspeech
uses KENLM
under the hood as currently it’s the fastest language model library in the wild. From KENLM
we need 2 files, i.e. language model binary (lm.binary
) and the trie. Let’s see how they are built.
Steps
0. Cloning and making the KENLM
Clone the code from repository https://github.com/kpu/kenlm.
git clone https://github.com/kpu/kenlm
Then build it,
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4
1. Providing the corpus
Let’s assume we want to build Javanese language model. We are using openslr.org. The data, either the audio and utterances, are available for public in https://www.openslr.org/35/. Specifically for the utterances, we need to get the utt_spk_text.tsv
. The structure of file are like following :
00004fe6aa a4815 Kanthong semar minangka tanduran kang mrambat lan njulur
00005fb7fb ede87 Rudy uga misuwur amerga inovasi anggoné nggawé bumbu dhasar
0000e5df79 ffe12 Banjur saluran bakal mbelok ning lateral tengen
0001418491 ce6f9 Ana telu seksi tradhisional ya iku Tripolitania Fezzan lan Cyrenaica
0001bbbc2e 0a834 Iwak Mas iki bisa kanggo praktik cuba cuba ana leb
Get only the sentence on third column and lowercase the text by :
cat utt_spk_text.tsv | cut -d $'\t' -f3 | tr A-Z a-z > corpus.txt
Let’s assume the letters used in corpus.txt comprises of :
a
b
c
d
e
é
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
2. Creating arpa
file
./bin/lmplz --text ${CORPUS_DIR}/corpus.txt --arpa ${CORPUS_DIR}/words.arpa --o 3
3. Building language model binary
./build_binary -T -s ${CORPUS_DIR}/words.arpa ${CORPUS_DIR}/lm.binary
4. Building the trie
./tensorflow/bazel-bin/native_client/generate_trie ${CORPUS_DIR}/alphabet.txt ${CORPUS_DIR}/lm.binary ${CORPUS_DIR}/corpus.txt ${CORPUS_DIR}/trie