Identification of Abusive Sinhala Comments in Social Media using Text Mining and Machine Learning Techniques

— With the technology revolution, most of the natural languages that are used all over the world have won the digital world. Therefore, people use modern technologies such as Social Media and the Internet with their native languages. As a result, people who are with self-ego on their tradition, race, caste, religion and other social factors, tend to make abusiveness on others who do not belong to the same social group by their native languages. Since the Social Media platforms do not have centralized control, it has become a good platform to advertise their backward ideas without being governed and monitored. The Sinhala language has also been added to most famous Social Media platforms. Though the Sinhala language has more than 2500 years of history, it does not have rich resources for computer-based natural language processing. Therefore, it has been a very difficult task to automatically detect Sinhala abusive comments which are being published and shared among Social Media platforms. Therefore, here, we have used evenly distributed 2000 Sinhala comment corpus among offensive and neutral classes to train three different models: Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM) and, Random Forest Decision Tree (RFDT) and the features were extracted from Bag of Word model, word n-gram model, character n-gram model, and word skip-gram model to automatically detect Sinhala abusive comments. After the training process, each model was tested with 200 evenly distributed comment corpus and MNB showed the highest accuracy of 96.5% with 96% average recall for both character tri-gram and character four-gram models. Further, two lexicon-based approaches called cross-lingual lexicon approach and corpus-based lexicon approach were considered to detect Sinhala abusive comments. From these two lexicon based approaches, the corpus-based lexicon gave the highest accuracy of 90.5% with an average recall of 90.5%.

warmly. Therefore, almost all the people around the world now are connected by some kind of Social Media.
Despite traditional ideologies, the young generation of each country joins the Social Media networks massively day by day. Since non-international languages are facilitated in Social Media, people who only know their mother tongue, have been able to work on these platforms without having any difficulties. It is a great opportunity for the people who do not have a place to share their ideologies, experiences or any other thing like to be shared.
Though the basic and primary concept of Social Media is connecting people, some social media users misuse this primary goal by making offense towards others. Since the number of users who have involved with Social Media is higher than traditional mass media, the impact that can be put on society by these Social Media are very serious.
When we narrow down our focus into Sri Lanka, according to the online statistics [15], more than 16 million people are using Sinhala as their communication language and more than 7 million people use the Internet. The Sinhala language has a remarkable history and it belongs to the Indo-Aryan language family which is a subfamily of Indo-European [1]. It consists of 18 vowels and 42 consonants, but now only uses 16 vowels and 41 consonants [17]. Since Sinhala is a morphologically rich language [1], it is a very complex task to make algorithmic natural language processing resources.
Like other languages, Sinhala has also been used in Social Media networks to post abusive comments. From 2014 to 2018, Sri Lanka faced two racism based riots due to these kinds of abusive comments and the government as well as Social Media authorities could not control the situation, as there was not a possible method available to identify Sinhala abusive comments which were being shared. In the 2018 incident, the government has to ban Social Media for a few days, but users were able to access their accounts by using third party Virtual Private Networks (VPN). Hence, it is clear that banning Social Media will not be an optimal solution to control such a situation in the future. It was proven by a few subsequent incidents that occurred after the 21 st April Easter bomb attack in Sri Lanka 2019. Even the government banned Social Media at that time, people shared a lot of abusive comments by enabling VPN.
One of the major problems in Social Media is the lack of governance in the content which is being published and shared. Until someone reports that a particular content violates the policies, that content will not be monitored and governed. Also, Social Media providers do not have enough linguistic expertise to handle such kind of situations. Most of the time, when a complaint is received, they say the reported content does not violate the policies even the content violates the policies.
Although everyone has a right to share their opinions and ideologies, that ideas should not invade others' feelings, because we all are human beings and have a right to be prevented from any kind of offensiveness. Since there are no perfect men and women, we should have a way to control these situations, such as social riots which occurred because of a few abusive comments.
A social study [1] done in 2014 shows that Sri Lankan people also use the Sinhala language in Social Media to spread abusive comments and it discusses the importance of identifying online Sinhala abusive comments in an automated manner. In this research, we focused on the identification of Sinhala abusive comments using text analytical models and lexicon models, hence it is a binary classification.
II. LITERATURE REVIEW Abusive speech detection on web content is an ongoing research area nowadays. There is a considerable amount of researches has been done for the English language ( [3], [7], [13]), but very few for other languages such as Arabic [16], German [7], and Chinese. Since 2015 to 2018, a lot of researches has been carried to detect abusive speeches, therefore significant attention has been attracted by the researchers who involve with the text mining area.
Though it is difficult to compare different methodologies that have been used for hate speech detection due to the different data sets they used, we can identify the main approaches that were used to detect online abusive comments. According to the literature that we studied, abusive speech identification has been carried through two different approaches namely, lexicon-based abusive speech detection and machine learning based abusive hate speech detection. Further, some researchers have used a combination of these two approaches to detect abusive comments.
According to the research [3] which was published in 2015, the lexicon-based approach has been considered to detect online hate speeches. They started the research with comments collection phase and concerned online forums, blogs and comment sections in news reviews to gather the comments. As the research data sources, they have created two data sets. In the first source, it includes 100 blog postings from different 10 web sites and each site, 10 blog postings were collected. These 10 web sites were taken from a list provided in the Hate Directory [4]. The next source was the web page content which was related to the Israel-Palestinian conflict and it was a 150-page web document. Annotation of content had been done by two graduate students in their university and 30% of each source was annotated.
This research has been conducted in three phases, as the first phase subjectivity detection has been done. In that case, they needed a sentiment lexicon and therefore they have used [5] and SentiWordNet [6] as resources. To determine whether a given sentence is subjective or not, they extracted the positive or negative scores from each sentiment word in the sentence. Then they calculated both positive and negative scores and subtract the negative score from a positive score. If the synset score is greater than 0.5 or less than -0.5, the given sentence is considered as positive or negative respectively. Therefore, if a sentence has either a positive or a negative score according to the given rules, that sentence was considered as subjective and not objective. About 56% of the first corpus and 75% of the second corpus were subjective in this research.
The lexicon of hate speech was built as the second phase. Therefore, they have used three different sets of features. They identified negative opinionated words from the subjectivity analysis and that words were selected as the first feature set of the hate speech lexicon. For the second set, they selected all verbs that are related to their hate corpus but not in the first feature set, then they gathered hypernyms. If those words exist in their corpus, then that words are added into the lexicon. The third set was created with hate nouns, which were related to three types such as religion, race, and nationality. By using Named Entity Recognition (NER) software, they have identified the source and the recipients of opinions. The experiment was done as follows.
At the first stage of the experiment, they only considered negative semantic features, but the accuracy was less than 70%. Then hate verbs were included in the semantic features, so the accuracy was increased above 70%. The best results obtained were an F-Score of 70.83 for the first corpus with three features.
Since we do not have a sentiment lexicon as SentiWordNet for the Sinhala language to evaluate a sentence as positive or negative before the phase of abusive speech detection, we have to directly apply the lexicon with given comments. As Sinhala does not have a good NER, we have not been able to obtain the source and the recipient as they did in their research.
However, the lexicon-based approaches are not enough in the process of the online abusive speech detection process, because the process of identifying abusive comments from the lexicon approach is limit to the words that exist in the pre-built lexicon. Therefore, some researches have focused on machine learning techniques to identify online abusive comments more precisely.
Machine learning allows computers to learn from the data without applying all the steps or the instructions by a programmer that is needed to perform some specific task. The foundation of machine learning is training data. Therefore, training data plays a huge role in machine learning and the learning algorithm generates a new set of rules, based on the inference of the data. These algorithms are formally known as the model and the same algorithms can be used by different data sets to generate different models.
Further, machine learning approaches can be categorized into two strategies as supervised learning and unsupervised learning. Supervised and unsupervised learning is divided based on a feature of the training data. If the training data is labeled with its belonging class, the learning model will be considered as supervised learning. Otherwise, the model is known as an unsupervised model. Among various supervised learning algorithms, Naïve Bayes, Decision Tree algorithms, Support Vector Machine algorithm, and Logistic regression algorithm have been considered in abusive speech classification. Thus unsupervised learning algorithms such as K-means clustering and hierarchical clustering are very few in the abusive speech detection process.
In 2018, research [7] has been done on the topic of hate speech detection and they have concerned the German language as the targeting language and have tried to apply the approaches that were used in English hate speech detection researches to build their models. In that study, they have shown the importance of hate speech detection on other languages as English due to some crisis. Therefore, the main goal of their research was to "investigate the potential value of automatic analytics of German texts to detect hate speech".
As the dataset, the user-generated comments were taken from the news platforms on the Internet and they had to use web scraping technology since most German news platforms do not offer APIs. Therefore they have used a Python framework called Scrapy; with that web scraping technology, 376,143 comments and 21,740 articles have been collected and that comments were annotated by 247 individual participants. In that process, they have only concerned two class labels as "Hate" and "Non-hate". Then they received a total of 11,973 rated comments and they were distributed among three classes as 3875 hate comments, 6073 non-hate, and 2025 unclassified comments. For the feature extraction process, they have used Bag of Words (BOW), N-grams, Linguistic features such as number of punctuation marks, number of words in a comment and Word2Vec/Doc2Vec. From annotated comments, they have used 811 hate and 1561 non-hate comments to create the model. Because of the unbalance of the two classes, comments were used under the sampling method. Then 1622 comments that include both hate and non-hate comments were applied to the logistic regression model and the highest accuracy was taken for the BOW feature with 0.7608.
Since they did not have a hate speech corpus, they used web crawling techniques to extract the comments from online forums. We also had to use the same web scraping approach as we did not have a comment corpus that contains Sinhala abusive speeches. This research [7] was based on the German language and it shows that the BOW features are good at detecting hate speeches as it works with English despite the language difference. Hence it guided us to use the BOW feature with the Sinhala abusive speech detection.
As the very first attempt in hate speech detection in Sinhala comments that are published with the Sinhala UNICODE, research [8] was published in 2018. In that study, they have focused on Sinhala racism based speech detection. However, as the first attempt on Sinhala abusive detection, it comes with few limitations. The first limitation is, though they could get 70.8% accuracy, that accuracy cannot be considered as a good measurement, because they have used an unbalanced data set which contains 73 racism based comments with 111 non-racism based comments. The second limitation is the feature extraction method that they used. Though a lot of similar researches ( [19]) which was done on other languages has tried with several feature extraction methods, this study [8] has only considered the word bi-gram models. From that, we cannot get a clear idea about the best fitting features that can be used in Sinhala abusive comment identification process. Therefore, there is a significant need to identify other best fitting features, which can be used in Sinhala abusive comment detection process.
Since abusive comments are not limited to racism based comments, we need to consider other characteristics of abusive comments such as religion, gender, sexual orientation or any other disability since these types of comments should also be removed from Social Media. As we identified, the third limitation of this research is the model that they have used to train the model. Since SVM is considered as the only one model, we cannot get a clear idea of other existing models that can be applied with Sinhala abusive comments identification process.
Once we went through the literature, we found that all most all the researches have used PYTHON with NLTK libraries [18] for pre-processing and model-building. Further, supervised learning methods are widely used in abusive speech detection and rather than deep learning techniques, machine learning approaches have been used in this task. Therefore, we also considered the statistical model-based machine learning approaches due to the lack of a large number of comments to consider the deep learning approaches.
In [13], it has used natural process feature selection algorithms like character n-gram, word n-gram, word skipgram and brown cluster with the Support Vector Machine classifier. Here, they have obtained 78% accuracy for the character 4-gram feature.
As the research [18] shows, unigram, bigram, and trigram features can also be used with the TF-IDF method and as per the results, they have found that Logistic Regression works better and the best model has resulted with 0.91 precision, 0.90 recall, and 0.90 F1-Score.
Recent research [19] done for the Indonesian language discusses the importance of identifying hate speeches in the Social Media that are published with the Indonesian language. Since they have not much-related researches regarding the Indonesian language, they were guided by other language researches. Therefore, they have used word n-gram and character n-gram feature extraction methods with the Naïve Bayes, SVM, Bayesian Logistic Regression (BLR), and RFDT classifiers. The best performance of 93.5% of F-measure was scored for the word n-gram features with the RFDT algorithm.
By considering all most all the factors, we proceeded the research with a few different feature groups; Character ngram, word n-gram, word-skip gram, and BOW features as experimented in ( [7], [8], [19]). These extracted features were trained and tested with a few statistical models; MNB, SVM, and RFDT as these classifiers have shown good performances ( [8], [18], [19]) despite the language difference to fill the blanks in Sinhala abusive comment identification process. Other researches such as ( [7], [18], [19]) on various languages shows the application of pre-processing techniques such as stemming and removing stop words are best for model training. Therefore, we also have considered these preprocessing techniques with our study.

III. METHODOLOGY
The goal of our research was to compare multiple models and investigate the effect of features that can be used to train models in the Sinhala abusive speech detection process. Therefore a methodology based on training, validation, and testing have been considered in our research. At the training phase, we conducted experiments with three different variations. As the method 1, we built a corpus-based lexicon with hate and offensive words, hence it is an abusive lexicon for Sinhala, and then that lexicon was used for abusive speech classification. As the method 2, machine learning techniques were used. By applying extracted features, three different models: Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM) and Random Forest Decision Tree (RFDT) were trained. As the method 3, using trained models and created lexicon, we tested the testing data set that is containing 200 evenly distributed comments. Since we could not find any Sinhala abusive or offensive related corpus online, we initiated the research with the corpus construction.
International Journal on Advances in ICT for Emerging Regions April 2020

A. Data Set Construction
In supervised learning methods, there is a need for a labeled data set that is required for the training. The resulting system's ability depends on the content of the data set and its annotation. When the data set is highly correlated with the considering topic, the predicting results can be guaranteed. Though there is a lot of hate speech annotated data set available on the Internet for many languages such as English, German and Arabic, there is no such resource available for the Sinhala language. Therefore, we extracted and annotated the comments according to the details given in the following section.

1) Collection of Sinhala comments:
Here we considered two online Social Media platforms: Facebook and YouTube with a Sri Lankan gossip site. Since Facebook and YouTube have APIs, we could access them by creating API keys. But for Sinhala gossip site, we had to build an in-house web crawler to extract the comments.

2) Comments filtering process:
Since some of the collected comments were not in Sinhala Unicode encoding, we filtered them out from our corpus. Further, some special characters such as emoji, URLs and other non-Sinhala characters were removed from our data corpus by using regular expressions.

B. Comments Annotation.
After the comment extraction process, extracted comments were annotated with the help of three annotators, since we are focusing on the supervised learning approach. Based on the majority vote, the class of comment was elected. If a comment was elected as a neutral one it was annotated with the label "0" and else if it is an abusive comment, it was labeled as "1". From the annotated comments, 2000 comments were selected to train the model while selecting 200 comments for testing the model randomly.  From the annotated comments, randomly selected 100 comments from each class and that comments set was used as the testing data set.
After the annotation process, we did a corpus analysis of collected data.

C. Corpus Analysis
To identify the characteristics of our corpus, we applied a few corpora analyzing techniques as follows.

1) Length Analysis:
At the first step of the corpus analysis, we did a length analysis of the corpus. The annotated data set was used for the analysis and we divided the comments into two corpora such as offensive and neutral. This comment separation was done based on the comments' class and it proceeded with a Python script. If the comment's label is "1" that comment is put to the offensive corpus and if the label is "0", it is put to the neutral corpus by this script.
Each corpus contained 1000 comments. At the length analysis process, we considered the following criteria and the results obtained are listed in TABLE II.
• Average word length In this analysis, we considered the average word length of words in each corpus. It is important to identify the number of characters that are used in each corpus, because every language has a fixed set of the abusive word list, though they may vary according to the context. • Average sentence length We analyzed sentence length to check whether offensive and neutral speeches use a small number of words or not, as we wanted to discover the average sentence length that is used in the offensive and neutral comments. According to the analysis, we found that neutral comments use more words than offensive comments and Sinhala abusive words have fewer characters than normal words.

2) Vocabulary Analysis:
Here, we have considered the number of words used in each corpus. Our objective was to identify the number of words used in all two types and we wanted to analyze the word count behavior in both corpora. Therefore, we did the following two analyses for each offensive and neutral corpus. In TABLE III, analyzed results are listed.

• Total number of words
Here we considered the total number of words used in each corpus. Our objective was to compare and contrast the total number of words used in an offensive speech and a neutral speech.

• Total number of unique words
We found the number of unique words in each corpus. We did this analysis to check whether people use the same set of words to make comments or not. Though each corpus contained 1000 comments, from these results, we can conclude that offensive speeches use only a few words since it has less unique words than the neutral comments corpus and these words are used again and again.

3) Zipf's Law Analysis:
This law was invented in 1935 and it was the first academic study of word frequencies. It states that the frequency (f) of any word in a given natural language corpus is inversely proportional to its rank (r) in the frequency table. The formula of Zipf's law is as follows.
Where 'f' is the frequency of a word ranked 'rth' and the exponent 'a' is almost 1.
Zipf's law behavior on the Sinhala language has been experimented in 2004 [9] and it shows that the Sinhala language almost follows the law. Since their corpus is based on grammatically correct Sinhala sentences, we decided to study the behavior of Zipf's law on an ungrammatical, neutral and offensive corpus. Therefore, we divided our sentences into two corpora as we used in the above analysis. Each corpus consisted of 1000 comments and that comments were tokenized. The tokenization process was done in two approaches. In the first approach, stop words removing and stemming processes were not considered and in the second approach, words were tokenized after applying stop word removal and stemming techniques.
• Zipf's law behavior of abusive comments without stemming and stop word removing We took 1000 comments to study the behavior of Zipf's law in Sinhala offensive comments. In this analysis, we did not consider any stemming techniques and stop word removing techniques. Top 10 words with the highest frequency are shown in TABLE IV and some letters of some words are replaced by the "#" symbol due to the abusiveness of the words. අමන් 50 Then we plotted the log value of offensive words frequency against the log value of word ranks. Following Fig.  1 shows the retrieved graph.
According to the graph that we obtained, we can identify that even the corpus is not grammatically correct; it follows Zipf's law.
When we further investigated, we found that there are a lot of words that frequently occurred are just used to construct the order of a sentence. In other words, some of the frequently occurred words are not useful in the separation of comment class since they are used in both abusive and neutral comments. Here, we considered them as stop words. After removing stop words from the corpus, we did another experiment to analyze Zipf's law behavior.
• Zipf's law behavior of abusive comments with stemming and stop word removing Here, we applied the same approach in the offensive corpus. After applying the stemming and stop word removing techniques on the offensive corpus, we got word tokens with their frequencies. Obtained results are shown in TABLE V. Some characters of words in the table are replaced by the "#" symbol since they are disgraceful words. According to the results, we observed that applying stemming as pre-processing is also a better practice since it reduces the feature vector by replacing words by their root word.
As done previously, we plotted the log value of offensive words frequency against the log value of word ranks. Following Fig. 2 shows the graph obtained.
According to the graph, we can conclude that, even after applying pre-processing techniques, the corpus follows Zipf's law.

D. Sinhala Abusive lexicons construction
Since there is no lexicon resource for Sinhala abusive speech detection, we made two lexicons by following two approaches. In the following sub-sections, these approaches are discussed.
International Journal on Advances in ICT for Emerging Regions April 2020

1) Dictionary-based abusive lexicon creating:
As there is no online dictionary for Sinhala hate speeches, we used "google bad word list" [10] based on the user policies of them and also other resources that contain Social media banned English words. Therefore, we used an online Sinhala English dictionary to map each banned word in English list to Sinhala. Though we sent 1703 English bad words to the online Sinhala-English dictionary, it gave 1128 translated words. Then that translated words were given to the annotators to get classified. Among 1128 words, only 157 words were annotated as abusive.  Since dictionary-based lexicon does not cover all words that can be used in the context of Sinhala abusive comments, we decided to consider corpus-based lexicon construction.

2) Corpus-based abusive lexicon creating:
In this approach, we used our 2000 comment corpus as the resource to identify words that are specific to Sinhala abusive speeches. We separated the comment corpus into two corpora called abusive and neutral by considering their class. Each corpus contains 1000 comments and using a seed word set taken from an online source, we searched their variations in the offensive corpus. Fig. 4 shows the steps that were used to build the corpus-based lexicon. This seed word set contains base forms of Sinhala abusive words. By adding suffixes to these seeds words, their variations were identified. Annotators' agreement on each word is listed in TABLE VII. Then both dictionary-based and corpus-based lexicons were used to identify Sinhala abusive speeches.

E. Text Pre-processing.
To reduce the dimensionality of the feature vector, we applied several famous pre-processing techniques, before the models are trained. All non-Sinhala characters, emojis, and URLs were removed as the basic step of pre-processing by using regular expressions. Further, the following standard techniques were used.

1) Stop words and Stop word removing:
Stop words are words that have a little meaning but they are essential to maintain the structure and grammatical relationship among other words in a sentence. In general, stop words are known as the most common words in a given language. In Natural Language Processing, these words are dropped to reduce the dimensionality of the feature vector. Since the sense of these words affects the sentiment of a given sentence, before the classification, it is essential to decide whether these stop words should be removed or not.
Even though Sinhala is a less resource language, in this study we have used a stop word list which is compiled and published by [11]. That list contains 425 words. But some words in that list are important as they are the cause for the negativity of a sentence as well as their presence in Sinhala abusive comments. Therefore those words were removed from the standard stop word list before the classification. The removed words from that stop word list are listed in the following TABLE VIII.

2) Stemming:
Stemming is the process of reducing a word into its word root by stripping recognized prefixes and suffixes. It is very important to identify stems of words because it reduces the dimensionality of the feature vector by converting words into its relevant word stem. Since Sinhala is a less resource language in natural language processing, it does not have a stemmer such as Porter's Stemmer for English. Therefore, in this study, we used a shallow stemming method proposed [12] for Sinhala. The main problem it has is, if the document does not contain the stem word itself, the algorithm is unable to find the stem of a particular word. Since its programmatic implementation is not available publically, we implemented the concept and applied it to our corpus. TABLE IX shows few stem roots that were found through our corpus. Identified stems were listed alphabetically and saved in a text file and it was used as a stemming dictionary in the process of feature extractions. The steps that we followed to apply stemming are shown in the following figure, fig. 5. Since this algorithm is based on removing suffixes, we needed a standard suffixes list for Sinhala. Therefore, we used a standard list that contains 413 suffixes, published by [11]. Some sample suffixes are listed in TABLE X. In our study, we use machine learning algorithms to detect abusive Sinhala comments and therefore identified features through the literature could be used to train these models. Since Sinhala abusive comment detection is a novel area, it is a challenging task to identify best fitting feature types. However, in this study, we considered four different feature types: Bag of Words (BOW), word n-gram, character n-gram, and word skip-gram to train the models.

1) Bag of Words (BOW) features:
BOW is the most famous [7] and the simplest feature that can be used in natural language processing. Therefore, it has been used in many text mining related researches. In this approach, a text is represented as a bag of its words disregarding grammar and the word order. Though it neglects the grammar and the word order, the frequency of each word's occurrence is recorded with the word.

2) Word n-gram features:
Word n-gram is a word model that captures the structure of sentences or corpus. Though BOW is good at feature extraction, it is not sufficient since natural languages do not contain just words but words with some structure. These ngram features can be a unigram, bigram, trigram or combination of these features. Similar researches in other languages such as English [13] show that considering word n-gram features to identify abusive speeches makes good results. Therefore we considered several groups of word ngram features and they are listed in TABLE XI. The main difference between word unigram and BOW is the word count that is considered in each approach. In BOW, all the words are considered when the feature vector is constructed. But International Journal on Advances in ICT for Emerging Regions April 2020 in word unigram, it does not consider all the words in the corpus to make the feature vector.

3) Character n-gram features:
Most of the languages such as English and Sinhala are composed of characters or letters, digits, punctuations, and spaces. In Social Media comments, where many words very often to be misspelled, character n-grams are especially powerful at detecting patterns in such things and substantially less sparse than previously introduced word n-gram features.
Though many kinds of research in abusive speech detection domain have used word n-grams for feature extraction, character n-grams were considered by less number of researches. A hate speech identification study [7] in the German language has used character 2-gram and 3-gram features to identify hate speeches with 0.62 and 0.65 accuracies respectively.
In our study, we used up to 4-gram character levels separately and combined them to identify hate, neutral and offensive speeches. TABLE XII shows the feature groups that were considered in our study.

4) Word skip-gram features:
These features are similar to word n-gram features and the difference is that skip-gram models extract features from a text by parsing some words from its current position. As an example, consider the sentence "I went to school in the morning", if the features are taken with 1-skip gram then the features will be as this: "I to", "went school", "to in", and "school the", "in morning". Here we considered 1-skip gram and 2-skip grams for bigrams and these were combined with normal bigram features and with unigram features. TABLE XIII shows the word skip-gram feature groups that were considered in our study. These features were extracted by using SciKit learns count vectorizer and skip-gram features were obtained by customizing the count vectorizer.

G. Feature Vectorization.
The process of converting natural language text into numbers is called as vectorization in machine learning. Machines are not able to identify natural languages as a human does and, also statistical methods that are used for classification of given texts require input data in the form of numeric. Therefore feature vectorization plays a major role in natural language processing. In this study, we used two feature vectorization methods, Scikit learns CountVectorizer [14] and TfidfTransformer to vectorize extracted features.

1) CountVectorizer:
It is an implementation of SciKit learn's machine learning package and it is being used in a lot of ongoing researches all over the world. It provides a simple way to tokenize a collection of text documents and build a vocabulary of known words. When new documents are needed to be classified using that vocabulary, these documents are encoded.

2) TfidfVectorizer:
TfidfVectorizer is another implementation of feature vectorization. It consists of the term frequency-inverse document frequency (TF-IDF) concept as the basis. Typically TF-IDF is composed of two terms and calculates a weight. These two terms can be explained as follows.
• TF: Term Frequency, measures how frequently a term occurs in a document. Since every document is in different lengths, it is possible that a term would appear much more times in a lengthy document than shorter ones. Therefore, the term frequency is always divided by the document length (total number of words) for normalization

Tf(t,d) = frequency of term (t) in document (d) / total number of terms in (d)
• IDF: Inverse Document Frequency, measures the importance of a term. TF considers all terms are in an equal manner. But it is known that some terms have less importance while classifying sentences. Therefore, we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

IDF (t, D) = log_e (Total number of documents (D) / Number of documents with term (t) in it).
The TF-IDF score of a term (t) is calculated according to the following equation.

Tf-idf (t, d, D) = TF (t, d). IDF (t, D)
Since we use Scikit learn CountVectorizer to vectorize features, we did not use TfidfVectorizer directly, but TfidfTransformer, another implementation in Scikit learn package is used to convert CountVectorizer into TfidfVectorizer.

H. Classifiers and Machine Learning.
Machine learning is a necessary component of advanced text classification. The primary aim of machine learning is to allow computers to learn automatically without involving humans.. In this study, we focused on supervised machine learning algorithms for the process of identifying offensive and neutral comments. Therefore, we have used three different machine learning algorithms: Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB), and Random Forest tree as per the analysis we did in the literature review, most of the researches have shown best performance for these classifiers. In the following subsections, these algorithms are described separately. Here we used the algorithm implementations of Scikit learn.

1) Naïve Bayes (NB):
Naïve Bayes classifiers are simple probabilistic classifiers based on the application of Bayes theorem with strong independence assumptions between the features. In this study, Multinomial Naïve Bayes (MNB) classification technique has been used. It considers word frequency information in the document for analysis, where a document is considered to be an ordered sequence of words obtained from the vocabulary. The main difference between Multinomial NB and Bernoulli NB is that Bernoulli NB cares only the presence or absence of a particular feature (word) while Multinomial NB considers the occurrence (frequency count) of the features (words). Here we used the Scikit learn's implementation of MNB to classify a given sentence.

2) Support Vector Machine (SVM):
Support Vector Machine is a very famous classification method that is being used in the area of natural language processing. Unlike the Naïve Bayes algorithm, SVM is a non-probabilistic classifier algorithm. It is an efficient classification method when the feature vector is high dimensional. SVM separates data points using a hyperplane with the largest amount of margin. It constructs a hyperplane in multidimensional space to separate different classes. One of the main advantages of SVM is the robustness in general and effectiveness when the number of dimensions is greater than the number of samples. Here we used the Scikit learn's SVM implementation to achieve the classification goal.

3) Random Forest algorithm (RFDT):
Random forests also known as random decision forests are famous as an ensemble method that can be used to build classification models. It consists of several decision trees and based on the majority vote, a particular sentence or a document is classified. This algorithm differs from previous MNB and SVM since this decides by considering the majority vote from several trees. Since more trees are there, the random forest will not overfit the model and it is a reason for using the random forest in text mining researches as well as in hate speech detection.

I. Experiments
We started our experiments with generated lexicons and later we used feature extraction methods with classifiers to identify abusive speeches in Sinhala. Therefore, we selected a new 200 comments set randomly, that are annotated by previous annotators. This comment set also balanced among abusive and neutral classes.

1) Experiment 01:-Dictionary-based lexicon approach for abusive speech detection:
Despite the inability to find context-based opinion words in the dictionary, we made the first experiment for 200 annotated comments. Therefore, we selected 100 neural comments and 100 abusive comments which were not used to construct the corpus-based lexicon. Finally, we got 200 comments, 100 as neutral and 100 as abusive.
We used the dictionary that we built to classify the given comments. That dictionary contains 157 abusive words, together all of them we considered as an abusive word dictionary. After that, we built an algorithm to check whether any comment consists of an offensive word or not. If a sentence contains an offensive word that is listed in the dictionary, the algorithm identifies it as an offensive that is abusive, otherwise as a neutral comment.
In the first phase of the algorithm, we tokenized the sentences and sent them through the algorithm.

2) Experiment 02: Corpus-based lexicon approach for abusive speech detection:
We did the same experiment here with the same algorithm by changing the lexicon to a corpus-based lexicon.

3) Experiment 03:-word n-gram features for abusive speech identification:
Here we used feature groups that are listed in TABLE XI with three classifiers: MNB, SVM, and RFDT. To extract features, we use the comment corpus (TABLE I) which contains evenly distributed 2000 comments. Before the word n-gram extraction, we applied previously described preprocessing techniques. Then the extracted features were vectorized and weighted. Thereafter, each classifier was trained separately using those weighted features.

4) Experiment 04:-character n-gram features for abusive speech identification:
We trained each classifier by using character n-gram feature groups that are introduced in TABLE XII. In this experiment, characters with word boundaries were considered. Therefore CountVectorizer's "char_wb" controlling value was used.

5) Experiment 05:-word skip-gram features for abusive speech identification
As the final experiment of our study, we used word skipgram features as listed in TABLE XIII to train each classifier. Since we do not have prior knowledge on best features that can be used to identify Sinhala abusive speeches, we used five-word skip-gram feature groups here.
In experiments 03, 04 and 05, features were extracted by using the comment corpus that is listed in TABLE I. Then extracted features were applied to three classifiers and they were trained. Evenly distributed 200 test comment corpus was used to test every model that is trained through the experiments. Results obtained through these experiments are discussed in the next section.
International Journal on Advances in ICT for Emerging Regions April 2020 IV. EXPERIMENTAL RESULTS AND EVALUATION As our research follows a quantitative approach, we can use statistical and mathematical techniques to evaluate what is done in the study. Therefore, we have used precision, recall, and accuracy as performance evaluation measurements. Since binary classification is considered, two confusion matrixes that are with two class labels are constructed in this study. The TABLE XIV gives the structure of the confusion matrix and based on that matrix, precision, recall, and accuracy are obtained.
Here we considered the binary classification's confusion matrix for describing the precision, recall, and accuracy since it is the standard way to demonstrate confusion. At each experiment, a unique confusion matrix is built and based on its true positive, false positive, false negative and true negative values, accuracy, precision, and recall are calculated. Accuracy of a model is calculated according to the above equation and it is known as the measurement of the fraction of correct predictions. But it has some very common problems. Major problem is that accuracy is not a good measure when the data set is unbalanced through the classes. But our data set is balanced through all two classes, accuracy can be used as a performance evaluation method.
Precision is the fraction of relevant instances among the retrieved instances and here precision for offensive classification is known as the fraction of actual offensives among predicted offensives.
The recall is the fraction of the relevant instances that are successfully retrieved. Since it does not contain true negatives, it is a good measurement in our study. It measures the predicted actual instances among all actual instances for a particular class label. Therefore it can be taken as the main measurement here. F1-Score is the harmonic mean of precision and recall. It ensures that there will be no overly rely on either precision or recall. Therefore F1-Score is considered as another performance measurement in this study.
Results that are obtained through every experiment is discussed below.

A. Experiment 01 Results
As experiment 01 described, we tested the dictionarybased lexicon with 200 evenly distributed comment corpus. Results obtained through the experiment are listed in TABLE XV and its confusion matrix is listed in TABLE XVI. Though the model has 0.54 accuracy, the recall of abusive is 0.10. As described previously, recall plays a major role among other performance measurements, and having high recall value makes trust in the model. Therefore, with this less recall value of the offensive class, we cannot conclude that a cross-lingual dictionary is a good approach for Sinhala abusive speech detection. Translated words contain the definitions of words than their practical forms. That is why these measurements gave low values. Therefore, we can conclude that abusive speech identification cannot be done efficiently by translating one language's hate or offensive words into another language.

B. Experiment 02 Results
Since experiment 01 is not efficient and sufficient in Sinhala abusive speech identification, the constructed corpusbased lexicon was tested with the same 200 comment corpus. Results obtained through the process are listed in following  The corpus-based lexicon gives 0.905 accuracies in the process of abusive speech identification and this accuracy is greater than the dictionary-based lexicon approach. Therefore, it is clear that corpus-based lexicons are effective than the dictionary-based translated lexicons in the process of Sinhala hate speech detection. Though the corpus-based approach gave 90.5% accuracy, it is not sufficient in the process of abusive language detection, since the lexicon approaches depend on already classified words. Once these approaches meet strange abusive words, they will not be able to identify them as abusive. Therefore we need to consider machine learning approaches.

C. Experiment 03 Results
In this experiment, we trained two different classifiers with five different feature groups. Therefore, we obtained ten models to test. Since we have limited space here, we considered only one confusion matrix. Performance measurements for each feature group with each classifier are listed in TABLE XIX.
As TABLE XIX shows, word n-gram feature with MNB gives the highest accuracy for the W01 feature group which contains unigram features. Also feature group W04 and W05 have shown better accuracies and recall values. Though W02 and W03 contain more information than W01, they show low accuracies in Sinhala hate speech detection.
Although W01 shows the highest accuracy for RFDT as well as MNB, W04 gave the highest accuracy for the SVM classifier. Therefore, it is clear that the same feature groups may not be suitable when classifiers are changed. Among three classifiers, MNB has given the highest accuracy with the highest f1-score for the W01 feature group.

D. Experiment 04 Results
According to the performance measurements, we can conclude that all the feature groups that we considered in character n-gram are good at Sinhala abusive comments identification process as it gives higher Precision, Recall, and F1-Score values regardless of the model that we used to train.
When we compare the precision, recall, F1-score, and accuracy values of the MNB, SVM, RFDT classifiers from the TABLE XXI, it is clear that the MNB has outperformed the other two models for the character n-gram model.

E. Experiment 05 Results
In this section, we discuss the results obtained through the models that were trained by word skip-gram features. As previously described in two subsections, here we also tested ten models with a test comment corpus which has 200 evenly distributed comments. TABLE XXIII shows the performance measurements that were obtained.

N-Neutral
Though all skip-gram feature groups show more than 50% accuracy, recall for Neutral comments are less than 40% in both S01 and S02 feature groups. According to the results, feature groups S01 and S02 with MNB tend to classify neutral comments as abusive comments. From all measurements, it shows that feature group S03 is the best to identify offensive speeches in Sinhala with Multinomial Naïve Bayes classifier.
By focusing on the S01 and S02 feature groups, we can conclude that S01 and S02 are not good to detect Sinhala offensive speeches with SVM since their recall measurement of Neutral speech is less than 20%. These two models tend to classify neutral comments wrongly as offensive comments. Both S04 and S05 feature groups show the best accuracy as 92.5% with the SVM classifier.
Random forest classifier performs worst with S01 and S02 feature groups since they have low recall values for neutral speeches. The highest average f1-score and accuracy has obtained for the S03 feature group. Therefore S03 ssssis the best word skip-gram feature that can be used to identify Sinhala offensive speeches.
From these experiments, it is clear that character fourgram (C03), feature group C04 and feature group C05 are the best features for Sinhala abusive speech detection since it gives high accuracies and f1-scores for all classification models. Therefore, we can conclude that these features can be taken in the process of Sinhala abusive speech identification effectively though the classification models are changed.
As the test results show, we can conclude that MNB performs well with each feature group than the other two classifiers.
V. CONCLUSION AND FUTURE WORKS In this study, we have identified the best features and classification models that can be used in Sinhala abusive speech identification with machine learning. Further, we built two new lexicons which contain Sinhala abusive words. Abusive speech detection was done by using these two lexicons and we observed that corpus-based lexicons are the best approaches in Sinhala abusive speech detection process.
As per the results we obtained through the experiments, character four-gram (C03), C04 and C05 features have outperformed all other feature types that were considered in this study. Since many abusive speeches are published with spelling mistakes and substituted with similar characters, we can conclude that character n-gram features perform well in abusive Sinhala speech detection.
Regardless of the feature extraction method, when we focus on the performance measurement values of TABLE  XIX, TABLE XXI, and TABLE XXIII, it is clear that the RFDT classifier has the least accuracy values for the Sinhala abusive comments detection.
Since we considered only supervised methods, it is an open area to apply unsupervised learning techniques to identify Sinhala hate speeches.
As the features of this research, we only considered macro features such as word n-grams and character n-grams. Therefore, micro features such as patterns of the speeches using POS tags, presence and frequencies of punctuation marks and word counts in a sentence can be used to identify Sinhala hate speeches.
Here we only considered abusive speeches which are published using Sinhala Unicode, because we did not have enough comments to consider Singlish (Sinhala words are written in English) speeches. Therefore, Singlish is also an open area to be considered in hate speech detection domain. Working with these transliterated forms (Singlish words) may be a challenging task as Singlish comments contain both pure English and Sinhala comments. It will make the stemming and other pre-processing techniques as well as feature vectorization techniques too complex.
Further, comparison between stop word removing and without removing stop words should also be investigated, since Sinhala is a rich morphological language. s ACKNOWLEDGMENT We would like to thank the people who support us to make research data set available.