Wednesday, 28 November 2018

CMU Dictionary Adaptation to Bahasa Indonesia Lexicon Building

While creating lexicon in voice recognition in Bahasa Indonesia, we need to define the phoneme set by ourselves since there are not such a widely used standard in Bahasa Indonesia. Instead of creating new definition, there is an idea to adapt from existing phoneme set. Obtained from Kaldi resources, we can adapt the phoneme set from English issued by Carnegie Mellon University (CMU Dictionary) which contains 134,000 words.

Bahasa Indonesia is quite simplelook here also as in major case the pronunciation and written letter are the same compared to English. Thus, it is not a tedious work to start building lexicon based on CMU Dictionary although we need to add new phonemes and leave some phonemes.

Table 1 CMU Dictionary Phoneme Set

Phoneme IPA Symbol Indonesian Words Example English Words Example
AA ɑ ternak, gembala odd, balm
AE æ - at, bat
AH ʌ ambil hut, butt
AO ɔ bakpao cow, story
AW saung ought, bout
AY kait, senarai hide, bite
B b lebah, beli bee, buy
CH cuka, ceri cheese, china
D d diam, duduk dump, did
DH ð ridho the, thy
EH ɛ enak, sepak education, bet
ER ɝ bageur*, reueus* hurt
EY - ate, bait
F ɾ faedah, fana fee, forest
G g gerbang, guna green, gate
HH h hampar, unggah he, hair
IH ɪ singgah, ikatan it, implication
IY i - eat, sheep
JH adiraja, keganjilan genuine, jimmy
K k kenangan, batuk key, camp
L l lingkaran, betul luck, love
M m minum, temaram mama, mine
N n naik, menikah knee, nice
NG ŋ yang, ngengat bank, sink
OW bongkar, bogor oat, boat
OY ɔɪ - toy, boy
P p pulsa, peluh pulp, pen
R ɹ ranjau, rintangan right, row
S s sakit, sayang sea, sun
SH ʃ masyarakat, syaikh* shine, she
T t tikung, timpa tea, tone
TH θ rabiul tsani* thug, theta
UH ʊ - hood, book
UW u kuku, suku two, coup
V v viral, vas vee, vocal
W w wejangan, wayang we, wide
Y j yakin, yoga yam, yield
Z z zaman, zamrud zoo, zee
ZH ʒ jangkrik, jerapah seizure, pleasure

From table above, it’s clear that there are phonemes that are not (commonly) used in Bahasa Indonesia. Yet these phoneme set does not cover all Bahasa Indonesia lemma regarding to the root of Bahasa Indonesia which come from majorly Malay, Dutch, Arabic, Chinese, Javanese, and Sundanese. To create the things short, here is list of phonemes that needs to be added to CMU Dictionary for Bahasa Indonesia

Table 2 Addition Phoneme Set

Phoneme IPA Symbol Indonesian Words Example Notes
NY ɲ kenyang, nyamuk alveolo palatal
KH x kholifah voiceless velar fricative
Q ʔ qurban
KX ʕ sa’at voiced pharyngeal fricative
DL dhuhur voiced alveolar sibilant with pharyngealization
GH ɣ ghaib voiced velar fricative

sources :
[1] http://kaldi-asr.org/doc/examples.html
[2] http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[3] https://en.wikipedia.org/wiki/ARPABET
[4] https://open-dict-data.github.io/ipa-lookup/ma/

No comments:

Post a Comment