Start Submission Become a Reviewer

Reading: Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Lea...

Download

A- A+
Alt. Display

Articles

Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Learning Techniques

Authors:

H. M. S. T. Sandaruwan ,

University of Ruhuna, LK
About H. M. S. T.
Department of Computer Science
X close

S. A. S. Lorensuhewa,

University of Ruhuna, LK
About S. A. S.
Department of Computer Science
X close

M. A. L. Kalyani

University of Ruhuna, LK
About M. A. L.
Department of Computer Science
X close

Abstract

With the technology revolution, most of the natural languages that are used all over the world have won the digital world. Therefore, people use modern technologies such as Social Media and the Internet with their native languages. As a result, people who are with self-ego on their tradition, race, caste, religion and other social factors, tend to make abusiveness on others who do not belong to the same social group by their native languages. Since the Social Media platforms do not have centralized control, it has become a good platform to advertise their backward ideas without being governed and monitored. The Sinhala language has also been added to most famous Social Media platforms. Though the Sinhala language has more than 2500 years of history, it does not have rich resources for computer-based natural language processing. Therefore, it has been a very difficult task to automatically detect Sinhala abusive comments which are being published and shared among Social Media platforms. Therefore, here, we have used evenly distributed 2000 Sinhala comment corpus among offensive and neutral classes to train three different models: Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM) and, Random Forest Decision Tree (RFDT) and the features were extracted from Bag of Word model, word n-gram model, character n-gram model, and word skip-gram model to automatically detect Sinhala abusive comments. After the training process, each model was tested with 200 evenly distributed comment corpus and MNB showed the highest accuracy of 96.5% with 96% average recall for both character tri-gram and character four-gram models. Further, two lexicon-based approaches called cross-lingual lexicon approach and corpus-based lexicon approach were considered to detect Sinhala abusive comments. From these two lexicon based approaches, the corpus-based lexicon gave the highest accuracy of 90.5% with an average recall of 90.5%.
How to Cite: Sandaruwan, H.M.S.T., Lorensuhewa, S.A.S. and Kalyani, M.A.L., 2020. Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Learning Techniques. International Journal on Advances in ICT for Emerging Regions (ICTer), 13(1), pp.13–25. DOI: http://doi.org/10.4038/icter.v13i1.7213
Published on 29 Jan 2020.
Peer Reviewed

Downloads

  • PDF (EN)

    comments powered by Disqus