There are no special rules on how word in Indonesia should be converted into phoneme. Mostly , each alphabet in word is phoneme. If we have word cinta
, it simply put c
+i
+n
+t
+a
(Indonesian phoneme) or ch
+ih
+n
+t
+aa
(English equal phoneme), as phonemes. The exception only applies on diphthong and nasal. Although we have exception, it remains simple. For example word sayang
which has nasal ng
, could simply be converted into s
+a
+y
+a
+ng
(Indonesian phoneme) or s
+aa
+y
+aa
+ng
(English equal phoneme). Or word aura
which has diphthong au
, could be converted into au
+r
+a
(Indonesian phoneme) or aw
+r
+aa
(English equal phoneme).
Here I write C++ routine to do such job. This routine maybe not the most effective one, but it works though. I design the routine to be able clean the non-necessary characters. Some part of the routine may seem useless. It is because I originally design for many task, but, in the end of the day I left the task to shell script. Number tokenization is not implemented yet. The output of the routine is English equal phoneme.