Neural Machine Translation for Sinhala-English Code-Mixed Text
Archchana Kugathasan ,
Sri Lanka Institute of Information Technology, LK
University of Moratuwa, LK
Multilingual societies use a mix of two or more languages when communicating. It has become a famous way of communication in social media in South Asian communities. Sinhala-English Code-Mixed Texts (SCMT) are known as the most popular text representation used in Sri Lanka in the informal context such as social media chats, comments, small talks etc. The challenges in utilizing the SCMT sentences are addressed in this paper. The main focus of this study is translating code-mixed sentences written in Sinhala-English to the standard Sinhala language. Since Sinhala is a low-resource language, we were able to collect only a limited number of SCMT-Sinhala parallel sentences. Creating the parallel corpus of SCMT-Sinhala was a time-consuming and costly task. The proposed architecture of Neural Machine Translation(NMT) to translate SCMT text to Sinhala, is built with a combination of normalization pipeline, Long Short Term Memory(LSTM) units, Sequence to Sequence(Seq2Seq) and Teachers Forcing mechanism. The proposed model is evaluated against the current state-of-the-art models using the same experimental setup, which proves the Teacher Forcing Algorithm combined with Seq2Seq and Normalization improves the quality of the translation. The predicted outputs from the model are compared using the BLEU (Bilingual Evaluation Understudy) metric and our proposed model achieved a better BLEU score of 33.89 in the evaluation.
How to Cite:
Kugathasan, A. and Sumathipala, S., 2022. Neural Machine Translation for Sinhala-English Code-Mixed Text. International Journal on Advances in ICT for Emerging Regions (ICTer), 15(3), pp.60–71. DOI: http://doi.org/10.4038/icter.v15i3.7250
30 Dec 2022.