Articles
Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages
Authors:
Chew Y Choong ,
Nagaoka University of Technology, Nagaoka, Niigata, Japan, JP
Yoshiki Mikami,
Nagaoka University of Technology, Nagaoka, Niigata, Japan, JP
CA Marasinghe,
Nagaoka University of Technology, Nagaoka, Niigata, Japan, JP
ST Nandasara
University of Colombo School of Computing, Colombo, Sri Lanka, LK
Abstract
Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n‑gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n‑gram orders and a mix n‑gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n‑gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n‑grams to training n‑grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n‑gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.
DOI: 10.4038/icter.v2i2.1385
The International Journal on Advances in ICT for Emerging Regions 2009 02 (02): 21-28
How to Cite:
Choong, C.Y., Mikami, Y., Marasinghe, C. and Nandasara, S., 2009. Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages. International Journal on Advances in ICT for Emerging Regions (ICTer), 2(2), pp.21–28. DOI: http://doi.org/10.4038/icter.v2i2.1385
Published on
08 Dec 2009.
Peer Reviewed
Downloads