Automatic Syllable Segmentation of Myanmar Texts using Finite State Transducer

 Abstract — Automatic syllabification lies at the heart of script processing especially for the South East Asian scripts like Myanmar. Myanmar syllabification algorithms implemented so far are either rule-based or dictionary-based approach. This paper proposes a new method for Myanmar syllabification which deploys formal grammar and un-weighted finite state transducers (FST). Our proposed method focuses on orthographic way of syllabification for the input texts encoded in Unicode. We tackle syllabification of Myanmar words with standard syllable structure as well as words with irregular structures such as kinzi, consonant stacking which have not been resolved by previous methods. Our FST based syllabifier was tested on 11,732 distinct words contained in Myanmar Orthography Corpus. These words yielded 32,238 syllables and are compared with correctly hand syllabified words. Our FST based syllabification method performs with 99.93% accuracy on Stuttgart FST (SFST) tools.



Abstract -Automatic syllabification lies at the heart of script processing especially for the South East Asian scripts like Myanmar. Myanmar syllabification algorithms implemented so far are either rule-based or dictionary-based approach. This paper proposes a new method for Myanmar syllabification which deploys formal grammar and un-weighted finite state transducers (FST). Our proposed method focuses on orthographic way of syllabification for the input texts encoded in Unicode. We tackle syllabification of Myanmar words with standard syllable structure as well as words with irregular structures such as kinzi, consonant stacking which have not been resolved by previous methods. Our FST based syllabifier was tested on 11,732 distinct words contained in Myanmar Orthography Corpus. These words yielded 32,238 syllables and are compared with correctly hand syllabified words. Our FST based syllabification method performs with 99.93% accuracy on Stuttgart FST (SFST) tools.

Index Terms-Automatic Syllabification, Finite State
Transducer, Myanmar syllabification, formal description of syllable structure

I. INTRODUCTION
Syllabication is the task of breaking words into syllables. This is called automatic syllabication when performed using computer algorithms instead of linguists. Knowledge of the syllable boundaries in words is very useful in a number of areas because it is impossible to store syllable information for all words in a language (new vocabulary are continually being added), and thus automatic syllabication algorithms are necessary [3]. Languages differ considerably in the syllable structures that they permit. For most languages, syllabification can be achieved either by writing a set of rules which explain the location of syllable boundaries of words step-by-step or using annotated corpus. Syllabification algorithms have been proposed for different languages by using different approaches. Up to our knowledge, rule-based approaches have been used for Asian scripts, for example, Lao [16] and Sinhala [2]. And also, corpus-based syllabification approach is done for Uyghur [12] and Urdu [1]. Moreover, many language-specific syllabification methods have been modeled by using finite state machines or neural networks and Finite State Transducers for multilingula syllabifiation [9].
These algorithms are mainly used in text-to-speech (TTS) systems in producing natural sounding speech, and in speech recognizers for dealing with out-of-vocabulary words. Also for Myanmar script, syllabification algorithm for Myanmar Text-to-Speech (TTS) system has been developed [20]. However, such phonetic syllabification cannot be used for some major Myanmar language processing tasks such as lexicographic sorting, word breaking, line breaking and spelling checking. In other words, it is necessary to use orthographic syllabification for these tasks.
Although a few researchers have documented attempts at syllabifying Myanmar words orthographically, this is the first known documented method for Myanmar orthographic syllabification using finite state transducers (FST). The objectives of this study are to represent Myanmar syllable structure in formal description (e.g, regular grammar) by taking advantage of its structuredness and unambiguity. Another objective of our study is to achieve correct syllabification without applying step-by-step rules and the need of corpus. In other words, our proposed method is neither heuristic approach nor annotated corpus-based approach.
Thus, in this research, we explain Myanmar syllabification in both orthographic and phonetic views. Un-weighted finite state transducer to divide Myanmar words into syllables is proposed. Syllable struture model is represented in Chomsky`s regular grammar and deploy finite state transducers for automatic syllabification of Myanmar Unicode texts. Our transducer accepts input string (also known as surface string) in Unicode encoding and in generation mode, the transducer produces syllabified texts with boundary marker notation.
The method was tested by using a text corpus containing 11,732 distinct words yielding 32,238 syllables on Stuttgart Finite State Transducer Tool (SFST). And its performance was then meaured in terms of the percentage of correctly syllabified words. Having limited resources for Myanmar language processing, the syllabified results are checked manually. Our result reports 99.93% of accuracy for the test set of 32,238 syllables. The rest of this paper is organized as follows: section 2 introduces syllable segmenation of Myanmar script by highlighting its challenges. We cover related works in section 3 and then discuss overview of Myanmar syllable strucutre in both phonetic and orthographic ways and Unicode cannonical order are in section 4. Section 5 describes our proposed Finte State Transducer approach, and our experiments and results are in section 6.

II. CHALLENGES IN MYANMAR SYLLABIFICATION
Myanmar language, formerly known as Burmese, is an official language of Myanmar and a member of Burmese-Lolo group of the Sino-Tibetan language spoken by about 32 million as a first language and as a second language by 10 million. The Burmese script, attested in stone inscriptions at least as far back as the early twelfth century C.E., is a phonologically based script, adapted from Mon and ultimately based on an Indian (Brahmi) prototype [ 15].
In Myanmar language, syllable is the smallest linguistic unit and one word consists of one or more syllables. Generally, Myanmar words can be classified into (1) Standard words (i.e, words with standard syllable structure) and (2) irregular words (i.e, words with abbreviated characters or words written in special traditional writing forms which is discussed in detailed in section 4.3). And each word type needs different orthographic ways of syllabifications as follows.

Word Type Example Word
Meaning Syllabified Output

Standard Word
Daughter #

Irregular Word
University # # In this example, the word "Daughter, " has two syllables, and where the former syllable has only one sub-syllabic element, consonant but the latter has three sub-syllabic elements of consonant , vowel and diacritic . For instance, syllables in alphabetic script are composed of only two alphabets, consonants and vowels whereas in Myanmar script, each syllable is composed of at most five sub-syllabic groups and each group has its own members. We will address about this in later section.
For the word "University, " has three syllables and it is written in one of the special traditional writing formats known as consonant stacking. It is syllabified by extra insertion of Myanmar Sign ASAT, U+103A between two stacked consonants to get the syllable boundary. For instance, the first character is combined with the upper character from the stack and form one syllable and the lower character in the stack itself becomes a syllable. Thus such kind of language-specific features make Myanmar syllable segmentation task complicated. Myanmar language has five different forms of irregular words which are explained in section 4.3. Besides, many Pali words and English loan words that refer to people, places, abbreviation of foreign words, currency units etc., can be found in Myanmar texts and we need to tackle segmentation of such words.
In handling such irregular words under-resourced language, Myanmar, rule-based approach and corpus-based approach have shown some failures. Nevertheless, in our FST-based approach, both standard and irregular words are correctly syllabified.

III. RELATED WORK
Scripts can be classified into five categories namely: alphabet, abjad, abugida, syllabary, and logosyllabary. There is no need to explain the first category, alphabet. The second and third categories do not sound familiar to us, but abjad corresponds to Semitic scripts, such as Arabic script and Hebraic script, while abugida corresponds to Ethiopian script and Indian-based script. The third category, abugida, "consists of consonant letters with specific vowels attached to them, and expresses other vowels (or a syllable with vowels) by modifying consonant letters in a consistent manner.
The fourth category, syllabary, can be expressed as "letters with no graphical relationship or rules between letters with similar sounds." A typical example is Kana in Japanese. The fifth category is logosyllabary. Of course, a typical example is Chinese characters. Chinese characters are so classified because they are both logography and syllabary at the same time [22]. syllable segmentation of most Alphabetic and Arabic scripts is done phonetically, i.e, the input surface string is first converted into phonetic symbols and then syllabify by using suitable approach such as rule-based or statistical approach. For the Indian-based script Abugida, syllabification can be done either phonetically or orthographically.
There are many approaches tackling the syllable segmentation task. Generally, these can be divided into two broad categories namely rule-based and data-driven approaches. The rule-based method effectively embodies some theoretical position regarding the syllable, whereas the data-driven paradigm infers "new" syllabifications from examples assumed to be correctly-syllabified already. However, it is difficult to determine correct syllabification in all cases and so to establish the quality of the "gold-standard" corpus used either to quantitatively evaluated the output of an algorithm or as the example-set on which data-driven methods crucially depend [11].
Besides these two categories, statistical methods and finite state methods are also applied for automatic syllabification and we will introduce previous approaches briefly.
In syllabification of written Uyghur [12], a rule-based approach that uses the Principle of Maximum Onset is applied. Experiment on a random sample shows that the syllabification algorithm achieves 98.7 percent word accuracy on word tokens, 99.2 percent on word types, and 99.1 percent syllable accuracy.
In [17], the authors described rule based syllabification algorithm for Sinhala after analyzing the syllable structure and linguistic rules for syllabification of Sinhala words. Rule-based syllabification algorithm for Malay is proposed based on Maximum Onset principle for text to speech system in [21].
In [11], one rule-based approach, and three data-driven approaches are evaluated (A Look-up Procedure, an Exemplar-based generalization technique and the syllabification by Analogy (SbA)). The results on the three databases show consistent and robust patterns: the data-driven techniques outperform the rule-based system in word and juncture accuracies by a very significant margin and best results are obtained with SbA. In [9], a weighted finite-state-based approach to syllabification is presented. Their language-independent method builds an automaton for each of onsets, nuclei, and codas, by counting occurrences in training data. These automatons are then composed into a transducer accepting sequences of one or more syllables. They do not report quantitative results for their method. Syllabification of Middle Dutch texts is done by the method which combines a rule-based finite-state component and data-driven error-correction rules. The authors adapt an existing method for hyphenating (Modern) Dutch words by modifying the definition of nucleus and onset, and by adding a number of rules for dealing with spelling variation [4].
Regarding Myanmar syllabification, two of the above mentioned approaches have already been done. In the corpus-based longest matching approach [6], the authors collected 4,550 syllables from different resources. The input texts are syllabified by using longest matching algorithm over their syllable list. They observed that only 0.04% of the actual syllables were not detected and described their failures as three facts: • differing combinations of writing sequences • loan words borrowed from foreign languages • rarely used syllables not listed in their syllable list Rule-based Myanmar syllable segmentation is done by [23] in which input text strings are converted into equivalent sequence of category form (e.g. CMCACV for the word "Myanmar") and compares the converted character sequence with the syllable rule table to determine syllable boundaries. The authors tested 32,238 syllables in the Myanmar Orthography [14] and the experimental results show an accuracy rate of 99.96% for segmentation. However, their approach cannot solve for the segmentation of irregular words with traditional writing forms namely kinzi, consonant stacking, great SA and English loan words with irregular forms as shown in table 1. However, in our approach, these kinds of failures in [6] and [23] are addressed the syllabification of irregular words and managed correctly. In [13] Unicode Technical Note, diacritic storage order of Myanmar characters in Unicode (which we refer as sub-syllabic components) is explained in detail and it is highlighted that diacritic storage order does not define a phonetic syllable. Further, automated syllable breaking approach for Myanmar script is mentioned. It is said that the syllable break may occur before any character cluster so long as the kinzi, asat and stacked slots remain empty in the cluster following the possible break point. It is mentioned that their algorithm does not require dictionary but still needs more refinement, for example, sequence of digits should be kept together and visible virama needs more complex analysis. The author also stated that the result of syllable breaking can be applied for line breaking and the same syllable breaking rules can be applied for lexicographic sorting.
Finite state methods have been applied in syllabification of alphabetic scripts but it has not yet been done in Myanmar script. Therefore, in our study, we describe syllable structure model in regular grammar and write regular expressions which are used as input to Finite State Transducer. In our experiment, we use 32,238 syllables covering all possible syllable structure in standard words and irregular words from Myanmar Orthography published by Myanmar Language Authority [14] and achieved the accuracy of 99.93%. We implemented the syllabification system by using the programming language for finite state transducer [18].

IV. MYANMAR SYLLABLE STRUCTURE
According to the Unicode standard version 6.2, Myanmar characters range from U+1000 to U+109F. Basically, Myanmar script, formerly known as Burmese, has 33 consonants, 8 vowels (free standing and attached), 2 tone marks, 11 medial consonants, a vowel killer or ASAT, 10 digits, 5 abbreviated syllables and 2 punctuation marks. Among them, consonants, medial consonants and attached vowels, a vowel killer or ASAT and tone marks can be combined in different ways to form a syllable and thus we refer these groups as sub-syllabic elements. Others, free standing vowels, digits and abbreviated syllables can be syllables by themselves. Further, Myanmar syllable structure can be represented in two ways namely phonetic and orthographic representations which are briefly explained in following section.

Myanmar Syllable Structure in Phonetic Representation
Myanmar script is known as diacritically modified consonant syllabic scripts or alphasyllabaries as it is derived from Brahmi and it inherited systemic features of Brahmi. The Brahmi system based on the unit of the graphic "syllable" or aksara, which by definition always ends with a vowel (type V, CV, CVV, etc). Further, the basic consonantal character without any diacritic modification is understood to automatically denote the consonant with the inherent vowel [15].
A Myanmar sentence is composed of words, and words written in Myanmar script are made of a series of distinct single syllables. In phonological representation, each syllable is made up of two parts: (a) A consonant or sometimes two consonant together and (b) A vowel or a vowel and a final consonant together. The first part of the syllable is known as head and the second part the rhyme. The rhyme may also contain tone. As an example, here is the same principle applied to some English words: (1) an "initial consonant"; for example, the consonants written: ပ လ န pronounced: p-l-n-thor or (2) an initial consonant combined with a second consonant, referred to below as a "medial consonant"; e.g. written: ပ လ pronounced: py-ly-hn-thw-There are only four medial consonants in Burmese script.
The rhyme of a syllable may be written with either (1) an attached vowel symbol; e.g. written: ပ လ န pronounced: pi lu na tho or (2) a consonant marked as a final consonant by carrying the "killer" symbol ; e.g. written: ပန လန န pronounced: pan lan naq theq or (3) a combination of an attached vowel symbol and a final consonant;e.g. written: ပ န လ န န pronounced: poun lein naun thaiq In addition, tones are part of the rhyme and are mostly represented by the two tone marks and ; e.g. written: ပ န လ န န pronounced: pou´n leı´n na´ tho´ Other ways of representing tone are used for certain rhymes.
It is also good to note that vowels in the Myanmar script are always attached to their heads (consonants) they are not letters in their own right, like the a, e, i, o, u of the roman alphabet, and they are not normally written independently. However, a Myanmar letter A "အ" (U+1021) is used to write syllables that have no initial consonant, such as "i" written အ , "an" written အန , "oun" written အ န The Myanmar letter A "အ" occupies the position of the initial consonant in the written syllable, but is read aloud as "no initial consonant" [8]. From the view of phonology, Myanmar syllable has the phonemic shape of C(M)V(CK)D | V(CK)D where C refers an initial consonant, M for medial consonant, V for attached vowels, CK stands for a final consonant ( a combination of consonant C and vowel killer K), and D for the tones respectively. The symbol ( ) means optional. Therefore, minimal syllable in phonetic view is CVD or VD.

Myanmar Syllable Structure in Orthographic Representation
As An additional complexity to Myanmar syllable structure is that there are syllables (X) containing up to 5 sub-syllabic groups namely consonant (C), medial consonant (M), dependent vowel (V), Asat or a vowel killer (K) and tones (D) and these groups can appear in a syllable as one of the following combinations. CMVCK CMVCKD CMCK CMCKD These combinations can be described as a regular expression as where the symbol " ? " stands for 0 or 1 occurrence of the character. In this expression, the combination of consonant and Asat or vowel killer (CK) is called final consonant and the syllables end with this combination are known as closed syllables. Further, as with the multiple-component vowels, the user reads the entire syllable as an entity. In Myanmar script, 3 vowels and 7 medials which are formed by the combination of other vowels or medials defined in the Unicode character chart. And, if we use this characteristics of vowels and medials in writing regular expression, the expression for syllable structure becomes like this X= C M* V* (CK)? D? where the notation ? means 0 or 1 occurrence and * means 0 or more occurrence [19].

Myanmar Words in Irregular Forms
There are also traditional as well as particular writing forms for some Myanmar words which are commonly used in the literature. The Burmese script is adapted from Mon, and ultimately based on Indian (Brahmi) prototype. Burmese scribes have the convention of preserving the original spelling of (mostly) Indian loan words and also follow the Indian practice of stacking geminate and homorganic consonants [15]. Therefore, we generally named the group of words with the above mentioned particular forms and English loan words as Irregular throughout the paper for readablilty and simplicity. The details of each irregular form and the correct syllabification for these forms are discussed as follows. 1) Kinzi. As mentioned in the previous section, final consonant (CK) can follow the main consonant and the resulting syllable is called closed syllable (CCK). In a few words (many of them loanwords), the combination of consonant Nga " " and vowel killer " " is " " which is not written on the line as the usual way, but is placed above the first consonant of the next syllable. The name of the reduced form is called kinzi and meaning "forehead rider" for obvious reasons [7]. Some examples are shown in the table below. From the phonetic view, sometimes, the kinzi is pronounced with high tone, for example, the word (ship) is pronounced as # by adding diacritic mark " " but in sorting order, it is sorted as without diacritic mark. The spelling of the words gives no indication that they are pronounced with a high tone on the kinzi. 2) Consonant Stacking. There are some final consonants which are not written in the reduced form like kinzi. With all final consonants other than kinzi ( ), instead of the final consonant being forced up and over the next consonant, it is the next consonant that is forced down and under the final consonant. So, if we have a pair of syllables like န and and they appear in a word that requires them to be compressed, then force the main consonant of second syllable under the န , and write as . The same consonants can be stacked and it is known as consonant repetition.
3) Loan Words. Loan words mean the words which adopt the English pronunciation directly and sometimes adding Myanmar pronunciation together with English pronunciation. For example, English word "car" is written as " " where the first syllable " " is English pronunciation and the second syllable " " is Myanmar pronunciation. 4) Great SA. Myanmar consonant Great SA " " is commonly used and the words with great SA is syllabified as + + . For example, the word "problem" in Myanmar language is ပ န which is syllabified as three syllables ပ # # န . 5) Contraction. There are usages of double-acting consonants in Myanmar and as the name give which acts both the final consonant of one syllable and the initial consonant of the following syllable. For example, the word which means "man" is syllabified as # . There are also some words which can be written in both in standard syllable structure and contracted form, for example, the word "daughter" is written as and .

Myanmar Canonical Ordering
It is possible for a Myanmar syllable to have a number of sub-syllabic elements (also known as diacritics in [13]) surrounding a base consonant, independent vowel or digit. Since all these diacritics are not spacing, it is important that there is consistent way of storing strings so that applications can work consistently. Further, our proposed method is for orthographic syllabification and thus canonical order of the script should be taken into account. Myanmar canonical order in Unicode encoding is described in [13]. There is a general order of: initial consonant, medial consonant, vowel, final and tone. The following is the example of how Myanmar strings are stored according to Unicode canonical order and their syllable boundaries. V. OUR APPROACH Automatic syllabification of word is challenging, not least because the syllable is not easy to define precisely. Consequently, no accepted standard algorithm for automatic syllabification exists [10]. However, syllabification can be achieved by writing a declarative grammar of possible locations of syllable boundaries in polysyllabic words [8]. On the other hand, Finite state machines are widely used in the field of natural language processing. Finite state transducers (FSTs) have been used ubiquitously in the domain of phonology as well as in morphology for all sorts of string mapping between string descriptions. Again, Finite state automata and transducers have been used in natural language processing of Asian languages for example, morphological analysis of Urdu [5] and likewise, formal grammar is used to express Sinhala computational grammar [2]. In our approach, first we write syllable structure in regular grammar and then converted it into regular expression as FST program is a set of regular expression. We adopt two-level morphology approach in which the surface string, for example, "boys" is analyzed and produced together with lexical information as "boy<Noun><Plural>". In our case, our syllabification transducer accepts input string/surface string and outputs the string together with syllable boundary information.
Grammar rules for independent vowels, digits and abbreviated syllable can be described as Firstly, based on the regular grammar as mentioned above, we write the regular expression to recognize Myanmar syllables and then construct the orthographic automata for a Myanmar syllable A mm as A mm = C Opt(M) Opt(V) Opt(CK) Opt(D) | P | I | N where Opt is the just acronym for "optional".
The above automaton accepts one syllable at a time and it can check the combinations of sub-syllabic elements orthographically. Then, we construct the syllabification automaton, denoted by A syl , which accepts a sequence of syllables and finds the syllable boundaries correctly. This is achieved by the expression A syl = A mm ( # A mm )* In this expression, syllable structure represented by A mm is followed by zero or more occurrence of the boundary marker (#) and a syllable form A mm . The automaton A mm accepts the sequence of syllables but we need to transform this into a transducer which inserts a boundary marker `#` after each syllable but not after the last syllable. This is simply achieved by computing the identity transducer for A mm and replacing `#` with a mapping ` : # ` in A mm . Now, the syllabification transducer becomes The syllabification Transducer for words with standard syllable structure is shown in figure 1. For irregular words, the stored character sequence is special. It usually uses invisible Myanmar sign VIRAMA (U+1039) in the input sequence encoding but it is required to output different character or characters according to the types of irregular words in section 4.3. Though irregular words are written in tradition forms and complicated, the result of their syllabification turned into standard word syllable structure. We construct the finite state transducers (FST) for each type of syllabification of irregular words respectively and we also show finite state transducer for standard syllable structure in Figure 1.

Experiments and Results
We use Stuttgart Finite State Transducer (SFST) Tools for our syllabification transducers although it is primarily concerned with morphology, for example, SMOR (a large German Morphological Analyzer). The specification of our syllabification transducer is written in SFST-PL, the programming language of the SFST tools which is a set of regular expressions.  By checking manually, we found that only syllabifcation of 9 stacked words out of 11,732 words are erroneous. Therefore, we received 99.93% of overall accuracy covering both types of word, standard and irregular words. By doing error analysis, the errors are caused by those words in which free standing vowels ဣ (U+1023), consonant stacking and other sub-syllabic elements are mixed and our FST cannot find correct syllable boundaries for these words. The free standing vowels ဣ (U+1023) which is the combination of consonant letter အ (U+1021) and the vowel (U+102D). However, this kind of error can be improved in our transducer and we will solve this issue in our future experiment.
We also analyze the syllabification results of our approach and other existing approaches on irregular words. Further, we summarize and compare the accuracy of the developed methods on Myanmar syllabification as follows. Table 6. Summary of Myanmar Syllabification Methods According to the above table, our approach promises the accuracy with 99.93% covering both standard and irregular words.

Conclusion
Syllabification is an important component of many speech and language processing systems, and this FST-based approach is expected to be a significant contribution to the field, and especially to researchers working on various aspects of Myanmar language and other Asian scripts.
For automatic syllabification of alphabetic languages, spelling does not have well-defined structure. Thus, the input texts are transliterated into phonetic symbols and syllabification is done on these transliterated texts, not on the input texts directly. Moreover, it is necessary to apply additional information using dictionary or annotated corpus and even FSA-based approach could be applied in statistical way. Myanmar script is Indic script in origin like Lao and Thai. Myanmar syllable structure is well-defined and unambiguous. And thus, it could be represented in finite state model. In general, finite state language processing becomes popular because of their simple, elegant and efficient computational power. From computational point of view, finite state based methods achieve a good performance and they are suitable for application more interested in speed and memory footprint. Many works have been done for Myanmar syllabification but FSA approach has not yet been applied to Myanmar. This paper proved that FSA approach gives significant solution for syllabification of Myanmar language as we achieve correct syllabification without applying step-by-step rules and the need of corpus. In other words, our proposed method is neither heuristic approach nor annotated corpus-based approach. Further, it could handle both regular and irregular syllable structures of Myanmar with acceptable performance. And we hope that it could be applicable to automatic syllabification of other syllabic writing systems.
In our further study, we will test our syllable segmentation FST on real online texts to evaluate the accuracy of the proposed approach and the evaluation process will be automated as a future improvement.