Start Submission Become a Reviewer

Reading: Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written ...

Download

A- A+
dyslexia friendly

Articles

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

Authors:

Chew Y Choong ,

Nagaoka University of Technology, Nagaoka, Niigata, Japan, JP
X close

Yoshiki Mikami,

Nagaoka University of Technology, Nagaoka, Niigata, Japan, JP
X close

CA Marasinghe,

Nagaoka University of Technology, Nagaoka, Niigata, Japan, JP
X close

ST Nandasara

University of Colombo School of Computing, Colombo, Sri Lanka, LK
X close

Abstract

Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n‑gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n‑gram orders and a mix n‑gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n‑gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n‑grams to training n‑grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n‑gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.

DOI: 10.4038/icter.v2i2.1385

The International Journal on Advances in ICT for Emerging Regions 2009 02 (02): 21-28

How to Cite: Choong, C.Y. et al., (2009). Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages. International Journal on Advances in ICT for Emerging Regions (ICTer). 2(2), pp.21–28. DOI: http://doi.org/10.4038/icter.v2i2.1385
Published on 08 Dec 2009.
Peer Reviewed

Downloads

  • PDF (EN)

    comments powered by Disqus