Wednesday, 19 June 2019

Building KENLM Language Model for Mozilla Baidu DeepSpeech

Most automatic speech recognition (ASR) application requires language model to run properly. Why? Because the ASR engine turn the voice sequences in to most likely phonemes and then form words . From this stage the engine pay no attention to language context. Here the language modelling comes in to play. Right after the words are generated, the words or sentence are fed to the language model, let mention KENLM for Deepspeech. And voila the language-aware sentence are produced.

Deepspeech uses KENLM under the hood as currently it’s the fastest language model library in the wild. From KENLM we need 2 files, i.e. language model binary (lm.binary) and the trie. Let’s see how they are built.

Steps

0. Cloning and making the KENLM

Clone the code from repository https://github.com/kpu/kenlm.

git clone https://github.com/kpu/kenlm

Then build it,

cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4

1. Providing the corpus

Let’s assume we want to build Javanese language model. We are using openslr.org. The data, either the audio and utterances, are available for public in https://www.openslr.org/35/. Specifically for the utterances, we need to get the utt_spk_text.tsv. The structure of file are like following :

00004fe6aa	a4815	Kanthong semar minangka tanduran kang mrambat lan njulur
00005fb7fb	ede87	Rudy uga misuwur amerga inovasi anggoné nggawé bumbu dhasar
0000e5df79	ffe12	Banjur saluran bakal mbelok ning lateral tengen
0001418491	ce6f9	Ana telu seksi tradhisional ya iku Tripolitania Fezzan lan Cyrenaica
0001bbbc2e	0a834	Iwak Mas iki bisa kanggo praktik cuba cuba ana leb

Get only the sentence on third column and lowercase the text by :

 cat utt_spk_text.tsv | cut -d $'\t' -f3 | tr A-Z a-z > corpus.txt

Let’s assume the letters used in corpus.txt comprises of :

a
b
c
d
e
é
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

2. Creating arpa file

./bin/lmplz --text ${CORPUS_DIR}/corpus.txt --arpa ${CORPUS_DIR}/words.arpa --o 3

3. Building language model binary

./build_binary -T -s ${CORPUS_DIR}/words.arpa ${CORPUS_DIR}/lm.binary

4. Building the trie

./tensorflow/bazel-bin/native_client/generate_trie ${CORPUS_DIR}/alphabet.txt ${CORPUS_DIR}/lm.binary ${CORPUS_DIR}/corpus.txt ${CORPUS_DIR}/trie

Monday, 25 March 2019

Indonesia Presidential Debate Transcription Service in A Nutshell

I was writing this article when I was a software engineer at Bahasa Kita and Indonesia was going to have president election in 2019. As a way to educate the voters about their choice in the election, KPU (General Election Commissions) held debates. The debates were held five times with different topics discussed. The topics were as follows :

  1. Law, Human Right, Corruption, and Terrorism (17 January 2019 at Hotel Bidakara)
  2. Energy and Food, Natural Resources and the Environment, and Infrastructure (17 February 2019 at Hotel Sultan)
  3. Education, Health, Employment and Social and Culture (17 March 2019 at Hotel Sultan)
  4. Ideology, Government, Defense and Security and International Relations (30 March 2019)
  5. Social Economy and Welfare, Finance and Investment and Trade and Industry

As machine learning startup focused on voice, we at Bahasa Kita wanted to contribute to the election through the voice technology. In the time we look for the ideas, we remembered that our founder once had made use the voice technology for the deaf. Thus we decide to do the same thing upon the presidential debate. This was still relevant since we also found many comments from the disabled people hoping there was such technology to help them at Youtube video which has no caption available. Surprisingly, the idea is also needed by normal people when they miss a certain part while the debate is live. They want to review and dig deeper on the candidate’s speech. In short, we propose to transcribe the debate using our automatic speech recognition service and present the transcription quickly while the debate is still ongoing.

We had no long time to implement the idea because we had talked about how to contribute just 8 hours before the first debate. However, during the discussion, other thought appeared. It said that it feels incomplete when we just show the debate verbatim. Then, we searched references from the previous election and found an analytics company that do analytics on the debate verbatim. After analyzing the content, we realized that this analytics can be used as it is neutral. This neutrality is important as we were aware that anything in the political year can be assumed to be not neutral. The interesting thing was just hours or days after our website (debatcapres.bahasakita.co.id) went online, there was one of the candidate’s party contacted us to claim the website for their needs. It is surely a huge amount of money before our eyes. But we chose to decline the offers and keep providing the presidential debate independently for free.

Actually, it is too much if we call them by analytics, so I prefer to call them as a summary. The summary only counts the words statistic without prior knowledge about political, economics, or other needed. The summary comes initially in three forms: the total word count, the topic-specific word count, and word cloud. Word count is just a counting upon words uttered by each candidate. While topic specific word count must relate to the topics. The topics appeared to have other words association. We use these associations to count the topic related to word count. The last, word cloud shows a self-explaining visualization resembling a cloud.

If we take a look at Natural Language Processing (NLP) and Natural Language Understanding (NLU) state-of-the-art, there are still so many kinds of analysis that we can do and we are able to implement. For example, we can define whether utterances tend to polarize to a particular opinion or another way. Or we can display fact based on the news on the internet from the utterance told by the candidates. The most extreme one is we can show the personality and its description based on the given speech. All of them are done using artificial intelligence that we work on every day. But once again, it is too risky to release such information to the public as it is too subjective to judge the quality of analysis. Then we remain on the summary that I just mentioned and give further interpretation to the reader.

Later we were informed that our website is used by journalists to research and find the truth about each statement surfaced during the debate. We were quite happy that our website is useful to others.