Sunday, 15 January 2017

Indonesian Word to English Equal Phoneme Routine

Indonesian Word to English Equal Phoneme Routine

There are no special rules on how word in Indonesia should be converted into phoneme. Mostly , each alphabet in word is phoneme. If we have word cinta, it simply put c+i+n+t+a (Indonesian phoneme) or ch+ih+n+t+aa (English equal phoneme), as phonemes. The exception only applies on diphthong and nasal. Although we have exception, it remains simple. For example word sayang which has nasal ng, could simply be converted into s+a+y+a+ng (Indonesian phoneme) or s+aa+y+aa+ng (English equal phoneme). Or word aura which has diphthong au, could be converted into au+r+a (Indonesian phoneme) or aw+r+aa (English equal phoneme).

Here I write C++ routine to do such job. This routine maybe not the most effective one, but it works though. I design the routine to be able clean the non-necessary characters. Some part of the routine may seem useless. It is because I originally design for many task, but, in the end of the day I left the task to shell script. Number tokenization is not implemented yet. The output of the routine is English equal phoneme.

I break apart the routine onto 3 part, the main program (main.cc), the library (wordproc.hpp), and the library implementation (wordproc.cc). Let’s start from the library.

wordproc.hpp

/***********************************************************/
/* wordproc.hpp       version 0.1           28 Dec 2016    */
/***********************************************************/
/* DESCRIPTION          :                                  */
/*  Library to do such text and word processing routines.  */
/*  Examples are provided in each function documentation   */
/*  below                                                  */
/* DEFINED FUNCTIONS    :                                  */
/*   void writeToFile(const char *, const char*, bool);    */
/*   void writeToFile(const string, const char*, bool);    */
/*   string readFromFile(const char*);                     */
/*   void cleanText(string&, bool);                        */
/*   int countWords(string);                               */
/*   vector <string> tokenizeString( const string, \       */
/*                                   const string, \       */
/*                                   bool, const char*);   */ 
/* AUTHOR               :                                  */
/*  Tirtadwipa Manunggal  <tirtadwipa.manunggal@gmail.com> */
/***********************************************************/

// User Defined Function Prototype

#ifndef WORDPROC_HPP_
#define WORDPROC_HPP_

#include <iostream>
#include <cstdlib>
#include <fstream>
#include <cstring>
#include <sstream>
#include <algorithm>
#include <cctype>
#include <vector>

using namespace std;

void writeToFile(const char *, const char*, bool);

void writeToFile(const string, const char*, bool);

string readFromFile(const char*);

void cleanText(string&, bool);

int countWords(string);

vector <string> tokenizeString(const string, const string, bool, const char*);

#endif

wordproc.cc

/***********************************************************/
/* wordproc.cc       version 0.1           28 Dec 2016     */
/***********************************************************/
/* DESCRIPTION          :                                  */
/*  Library to do such text and word processing routines.  */
/*  Examples are provided in each function documentation   */
/*  below                                                  */
/* DEFINED FUNCTIONS    :                                  */
/*   void writeToFile(const char *, const char*, bool);    */
/*   void writeToFile(const string, const char*, bool);    */
/*   string readFromFile(const char*);                     */
/*   void cleanText(string&, bool);                        */
/*   int countWords(string);                               */
/*   vector <string> tokenizeString( const string, \       */
/*                                   const string, \       */
/*                                   bool, const char*);   */ 
/* AUTHOR               :                                  */
/*  Tirtadwipa Manunggal  <tirtadwipa.manunggal@gmail.com> */
/***********************************************************/

#include <iostream>
#include <cstdlib>
#include <fstream>
#include <cstring>
#include <sstream>
#include <algorithm>
#include <cctype>
#include <vector>
#include "wordproc.hpp"

using namespace std;

// User Defined Function Implementation
void writeToFile(const char *fdata, const char *fname, bool isOverwrite){
    ofstream outfile;
    if(isOverwrite){
        outfile.open(fname, ios::out);
    }
    else {
        outfile.open(fname, ios::out | ios::app);
    }
    outfile << fdata << endl;
    outfile.close();
}

void writeToFile(const string fdata, const char *fname, bool isOverwrite){
        ofstream outfile;
        if(isOverwrite){
        outfile.open(fname, ios::out);
    }
    else {
        outfile.open(fname, ios::out | ios::app);
    }
        outfile << fdata << endl;
        outfile.close();
}

string readFromFile(const char *fname) {
    ifstream infile;
    string stemp, fcontent;
    infile.open(fname, ios::in);
    while(getline(infile, stemp)) {
        fcontent += stemp;
        fcontent.push_back('\n');
    }   
    return fcontent;
}

void cleanText(string &fcontent, bool isSpace){
    for(int i = 0; i < (int) fcontent.length(); i++) {
        if(!isalnum(fcontent[i]) && fcontent[i] != ' ' && fcontent[i] != '\n') {
            fcontent.erase(i, 1);
            i--;
        }
        if(fcontent[i] == ' ' && fcontent[i+1] == ' ') {
                        fcontent.erase(i, 1);
                        i--;
                }
                if(fcontent[i] == ' ' && fcontent[i+1] == '\n') {
                        fcontent.erase(i, 1);
                        i--;
                }
        if(fcontent[i] == '\n' && fcontent[i+1] == '\n') {
                        fcontent.erase(i, 1);
                        i--;
                }
    }
    if(!isSpace) {
        for(int i =1; i < (int) fcontent.length(); i++) {
            if(isalnum(fcontent[i-1]) && isalnum(fcontent[i+1])) {
                if(fcontent[i] == ' ') {
                    string rplc = "[spasi]";
                    fcontent.erase(i, 1);
                    fcontent.insert(i, rplc);
                }
            }
        }
    }
    std::transform(fcontent.begin(), fcontent.end(), fcontent.begin(), ::tolower);
}

int countWords(string fcontent) {
    int count(0);

    cleanText(fcontent, 1);
    for(int i = 0; i < (int) fcontent.length(); i++ ) {
        if(fcontent[i] == ' ' || fcontent[i] == '\n') {
            count++;
        }
    }
    return count;
}

vector <string> tokenizeString(const string fcontent, const string dlm, bool isWrite, const char *fname){
    vector <string> retString; // return vector string
    string s = fcontent;
        cleanText(s, 1);
        size_t pos(0);
        string token;
        while((pos = s.find(dlm)) != EOF){
                token = s.substr(0, pos);
        retString.push_back(token);
                s.erase(0, pos + dlm.length() );
        }

    while(s[s.length()-1] == '\n') {
        s.erase(s.length()-1);
    }
    retString.push_back(s);
    return retString;
}

main.cc

#include "wordproc.hpp"

// g++ -Wall -o word2lex wordproc.cc main.cc

int main(int argc, char **argv) {
    string ins;
    string s_ID = "[spasi] eu au ai ei oi kh \ 
                    ng ny sy a i o u e b c d \
                    f g h j k l m n p q r s t w y z";
    string s_EN = "pau ah0 aw0 ay0 ey0 oy0 k ng \
                   ny sh aa0 ih0 ow0 uw0 ah1 b ch \
                   d f g hh jh k l m n p q r s t w y z";
    string retString = " ";
    vector <string> vs;
    vector <string> vs_EN;
    vector <string> lxc;
    int ukuran(0);
    ostringstream ss;

    cleanText(s_ID,1);
    cleanText(s_EN,1);

    vs  = tokenizeString(s_ID, " ", 0, NULL);
    vs_EN   = tokenizeString(s_EN, " ", 0, NULL);

    ins = argv[1];
    cleanText(ins, 0);
    for(int ii = 0; ii < ins.length(); ii++) {  
        for(vector <string>::iterator it = vs.begin(); it < vs.end(); it++) {
            ukuran = (*it).length();
            if(ins.substr(ii,ukuran) == (*it)) {
                lxc.push_back(*it);
                ii+=(ukuran-1);
                break;
            }
        }
    }

    for(int ii = 0; ii < lxc.size(); ii++) {
        for(int jj = 0; jj < vs.size(); jj++) {
            if(lxc[ii] == vs[jj]) {
                lxc[ii] = vs_EN[jj];
                break;
            }
        } 
        if(lxc[ii] == "ny") {
            ss << "n y";
        }
        else if(lxc[ii] == "q") {
            ss << "k";
        }
        else if(lxc[ii] == "pau") {
            continue;
        }
        else {
            ss << lxc[ii];
        }
        if(ii != lxc.size()-1){
            ss << " ";
        }
        retString = ss.str();
    }
    retString = ss.str();
    cout << retString << endl;
    return EXIT_SUCCESS;
}

We need to compile using g++ or other equivalent compiler as follows

g++ -Wall -o word2lex wordproc.cc main.cc

The program should yield output as follows

$ ./word2lex magetan
m aa0 g ah1 t aa0 n

$ ./word2lex pernikahan
p ah1 r n ih0 k aa0 hh aa0 n

$ ./word2lex rindu
r ih0 n d uw0

$ ./word2lex kholifah
k ow0 l ih0 f aa0 hh

$ ./word2lex wanita[spasi]solehah
w aa0 n ih0 t aa0 s ow0 l ah1 hh aa0 hh

Comment : the trailing number behind vocal phoneme means nothing, only for compatibility in my other program.

Beside the basic phonetization above, Indonesia also have reserved words which are not read accordingly. For example, word tepung is not read t+e+p+u+ng, but t+e2+p+u+ng instead. The same e voice in word learn. There is no other way of phonetizing word except having special word list beforehand. But these words are minority in number. Moreover, although a word is mispronounced, that will not affect the meaning of the word. Eventually, I will improve this program.


Correct me if I wrong
Sincerely,

Tirtadwipa Manunggal
tirtadwipa.manunggal@gmail.com




No comments:

Post a Comment