Facebook for Sentiment Analysis: Baseline Models to Predict Facebook Reactions of Sinhala Posts

— Research on natural language processing in most regional languages is hindered due to resource poverty. A possible solution for this is utilization of social media data in research. For example, the Facebook network allows its users to record their reactions to text via a typology of emotions. This network, taken at scale, is therefore a prime dataset of annotated sentiment data. This paper uses millions of such reactions, derived from a decade worth of Facebook post data centred around a Sri Lankan context, to model an eye of the beholder approach to sentiment detection for online Sinhala textual content. Three different sentiment analysis models are built, taking into account a limited subset of reactions, all reactions, and another that derives a positive/negative star rating value. The efficacy of these models in capturing the reactions of the observers are then computed and discussed. The analysis reveals that the Star Rating Model , for Sinhala content, is significantly more accurate (0.82) than the other approaches. The inclusion of the like reaction is discovered to hinder the capability of accurately predicting other reactions. Furthermore, this study provides evidence for the applicability of social media data to eradicate the resource poverty surrounding languages such as Sinhala.


I. INTRODUCTION
NDERSTANDING human emotions is an interesting, yet complex process which researchers and scientists around the world have been attempting to standardize for a long period of time. In the computational sciences, sentiment analysis has become a major research topic, especially in relation to textual content [1,2]. Several fields, scattered in diverse arenas from product marketing to political manipulations, benefit from the advancements in sentiment Correspondence: Vihanga Jayawickrama #1 (E-mail: vihangadewmini.17@cse.mrt.ac.lk) Received: 10-08-2022 Revised:25-10-2022 Accepted: 28-10-2022 Vihanga Jayawickrama 1 , Gihan Weeraprameshwara 2 , Nisansa de Silva 3 [3], Aguwa et al. [4], and Zobal [5] have described the potential of sentiment analysis and attempted to introduce useful tools for use in this field and discover new knowledge Sentiment analysis of textual content can be approached in two ways: 1) Through the perspective of the creator 2) Through the perspective of the observer. Many research projects attempt to follow the first approach, but only a few such as Hui et al. [6] have followed the second. Exploring the perspective of the observer would be quite important since the emotional reaction of the author and the reader to the same content is not necessarily identical. For certain fields, such as movie reviews [7] or product reviews [8], the perspective of the author is much more valuable than that of the reader; however, this relationship does not always hold true. Much effort is generally expended in the field of political polling, for example, where the public perception of a speech is studied to assess impact.
To the extent of our knowledge, no attempt has been made to do such analysis in Sinhala, the subject of this study. Sinhala, similar to many other regional languages, suffers from resource poverty [9]. Previous research and resources available for NLP in Sinhala are limited and isolated [10,11]. This is therefore an experimental attempt in bridging this knowledge gap. The objective is to predict the sentimental reaction of Facebook users to textual content posted on Facebook. This study uses a raw corpus of Sinhala Facebook posts scraped through Crowdtangle 1 by Wijeratne and de Silva [12], and analyses the user reactions therein as a sentiment annotation that reflects the emotional reaction of a reader to the said post [13]. Facebook reactions Like, Love, Wow, Haha, Sad, Angry, and Thankful are utilized as the sentiment annotation of a post within the scope of this project. Figure 1 illustrates the visual representations of Facebook reactions presented to the users and are included in the dataset. Overall, three models were created and tested. As for the first model, a reaction vector was created for each post with the normalized reaction counts belonging to Love, Wow, Haha, Sad, and Angry categories. Like and Thankful reactions, which are outliers at positive and negative ends of the spectrum respectively, were ignored. The results showed that the procedure could predict reaction vectors with F1 scores ranging between 0.13 and 0.52. The second model was highly similar to the first model, the only difference being the inclusion of Like and Thankful reactions for the prediction. The resultant F1 scores ranged between 0.00 and 0.96. In the third model, the reactions were combined to create a positivity/negativity value for each post, following the procedure presented by De Silva et al. [8]. Here, Love and Wow were considered as positive, Sad and Angry were considered as negative, and Haha was ignored due to its conflicting use cases. The normalization was carried out as earlier for the four reactions included, and the difference between positive and negative values were re-scaled into the range 1 to 5, in order to map to the popular star rating system utilized by De Silva et al. [8]. The F1 score of this star rating value ranged between 0.29 and 0.30. In contrast, the binary categorization of reactions as Positive and Negative exhibited promising results, with F1 scores in the range 0.70-0.71 for Positive and 0.41 -0.42 for Negative.
Thus, it can be concluded that such a binary categorization system captures the sentimental reaction to Facebook post more efficiently in comparison to the multicategory reaction value system, and presents a measure of reasonable accuracy in the imputation of such sentiment.
It should be re-iterated here that the values used here are completely independent from the intended or perceived sentiment of the original posts and are solely dependent on sentiment expressed by the audience reactions. Further, the model only attempts to predict the positivity or the negativity of Facebook reactions added to a post by users, and not of the actual emotion inflicted in the users by the post. While the duo might be correlated, the exact nature of the relation would have to be further explored before reaching a distinct conclusion. Figure 2 illustrates the scope of this research, where arrows indicate the influences among intended and perceived sentiments. This journal paper is an extension of our previously published conference paper [14].

II. BACKGROUND
Many of the studies on sentiment analysis are focused on purposes such as understanding the quality of reviews given for products presented in e-commerce sites [8,15,16] or understanding the political preferences of people [3,17].
Among the research on review analysis, the work of De Silva et al. [8] is prominent. Rather than conducting a sentiment analysis following the more traditional procedures of identifying sentiments at the sentence level or at the document level, which assumes each sentence and each document to reflect a single emotion, this study had taken a path to determine sentiments on an aspect level. Different aspects were extracted from the review, and for each aspect, the sentiment value was calculated. Further, the study provides a set of guidelines to determine the semantic orientation of a subject using a sentiment lexicon while guiding how to handle negations, words that increase sentiment, words that shift the sentiment of the sentences, and groups of words that are used to express an emotion, all of which are important to convert sentiment in text into mathematical figures. The methodology presented by De Silva et al. [8] is crucial for this study since it provides the basis of one of the two workflows we discuss in this study to predict reactions for Sinhala text.
The work by Martin and Pu [16], a research done on creating a prediction model that could identify helpful reviews that are not yet voted by other users, emphasizes the value of sentiment analysis. Rather than solely relying on structural aspects of a review such as the length and the readability score, the emotional context was also utilized in rating the reviews, with the support of the GALC lexicon, which represents 20 different emotional categories. One of the most important findings of the project was that the emotion based model outperforms the structure based model by 9%. The work of Singh et al. [15] too has used several textual features such as ease of reading, subjectivity, polarity, and entropy to predict the helpfulness ratio. The model intends to assist the process of assigning a helpfulness value to a review as soon as the review is posted, thus giving the spotlight to useful reviews over irrelevant reviews. Both researches have highlighted the usefulness of understanding the reaction of the reader to different content. The studies on political preferences cover a massive area. Many governments and political parties use social media to understanding the audience. Therefore, the power vested in sentiment analysis cannot be ignored.
The research done by Caetano et al. [17] and Rudkowsky et al. Facebook data plays a major part in our research. Therefore, it is vital to explore the previous research done on Facebook data. The work by Pool and Nissim [18] and Freeman et al. [19] use datasets obtained from Facebook for emotion detection. The data scope covered through the work of Freeman et al. lacks diversity since the research is solely focused on Scholarly articles. However, Pool and Nissim has attempted to maintain a general dataset by using a variety of sources, ranging from New York Times to SpongeBob. The motivation behind this wide range of sources was to pick the best sources to train ML models for each reaction. Pool and Nissim too has looked into developing models with different features such as TF-IDF, embeddings, and n-grams. This comparison provides useful guidelines for picking up certain features in data. One of the most important aspects of the work by Pool and Nissim is that they have taken an extra step to test the models with external datasets; namely, AffectiveText [20], Fairy Tales [21], and ISEAR [22], to prove the validity of the developed model since those are widely used datasets in the field of sentiment analysis. This provides a common ground to compare different sentiment October 2022 International Journal on Advances in ICT for Emerging Regions analysis models. The work of Graziani et al. [13] too follows the same procedure in comparing their model to those of others. While all papers mentioned above provide quite useful information, almost all of them relate to English, which is a resource-rich language. On the contrary, our project will be based on the Sinhala language, which is a resource-poor language in the NLP domain [9]. Very few attempts have been made to detect sentiments in Sinhala content, and most of the attempts made were either abandoned or not released to the public [10]. This poses a major challenge to our work due to the scarcity of similar work in the domain.
Among the currently available research in this arena, Senevirathne et al. [23] is the state-of-the-art Sinhala text sentiment analysis attempt to the best of our knowledge. Through this paper, Senevirathne et al. has introduced a study of sentiment analysis models built using different deep learning techniques as well as an annotated sentiment dataset consisting of 15059 Sinhala news comments. The work was done to understand the reactions of the people reading. Furthermore, earlier attempts such as Medagoda et al. [24] provides insight into utilizing resources available for languages such as English for generating progress in sentiment analysis in Sinhala. The partially automated framework for developing a sentiment lexicon for Sinhala presented through Chathuranga et al. [25] is a noteworthy attempt at using a Part-of-Speech (PoS) tagged corpus for sentiment analysis. The authors proposed the use of adjectives tagged as positive or negative to predict the sentiment embedded in textual content.
Obtaining a corpus that would fit our purposes was the second major challenge we faced when working with a Sinhala, given that, as Caswell et al. [26] observes, the majority of the publicly available datasets for low resource languages are not of adequate quality. Fortunately, the work of Wijeratne and de Silva [12] provided an adequate dataset. The authors presented Corpus-Alpha: a collection of Sinhala Facebook posts, Corpus-Sinhala-Redux: posts with only Sinhala text and a collection of stopwords. Both the raw corpus created by the authors and the stopwords will be used in our work.

III. METHODOLOGY
This study was conducted using the raw Facebook data corpus developed by Wijeratne and de Silva [12] through Facebook Crowdtangle. The corpus consists of 1,820,930 Facebook posts created by pages popular in Sri Lanka between 01-01-2010 and 02-02-2020 [12]. The table I describes the columns of the corpus that were utilized for the purpose of this study. The Facebook reactions, which are emotional reactions of Facebook users to content, are utilized as sentiment annotations within this study. When taken collectively, user annotations can be considered as an effective representation of the public perception of the given content.

A. Pre-processing
The corpus was pre-processed by cleaning the Message column and normalizing reaction counts. Cleaning the Message column was initiated by removing control characters in the text. Characters belonging to Unicode categories Cc, Cn, Co, and Cs were replaced with a space [27]. The character with the unicode value 8205, also known as the Zero Width Joiner, was replaced with a null string while the other characters in category Cf were replaced by a space. The reason for this is that the Zero Width Joiner was often present in the middle of Sinhala words, especially when the Sinhala characters rakāransaya (රකාරාාංශය), yansaya (යාංසය), and rēpaya (රේඵය) were used.
From the subsequent text, URLs, email addresses, user tags (of the format @user), and hashtags were removed. Since only Sinhala and English words are to be considered in this study, any words containing characters that are neither Sinhala nor ASCII were removed. The list of stop words for Sinhala developed from this corpus by Wijeratne and de Silva [12] were removed next. English letters in the corpus were then converted to lowercase. All remaining characters that do not belong to Sinhala letters or English letters were replaced with white spaces. Numerical content was removed due to their high unlikelihood to be repeated in the same sequence order. Finally, multiple continuous white spaces in the corpus were replaced with a single white space. Once cleaned, entries of which the Message column were merely null strings or empty strings were removed from the corpus. The final cleaned corpus consisted of 526,732 data rows.

B. Core Reaction Set Model
In selecting the core reaction set, Like and Thankful reactions were excluded due to their counts being outliers in comparison to other reactions; Like being an outlier on the higher end and Thankful being an outlier on the lower end. The total count of each reaction in the corpus along with their percentages are mentioned in table II. A probable reason for the abnormal behaviour of those reactions are the duration that they have been present on Facebook. Like was the first reaction introduced to the platform, back in 2009 [28]. Love,   [30]. The reaction was removed from the platform after a few days, and was reintroduced in May 2017 to be removed again after the Mother's Day celebrations [31]. Thus, the core reaction set was defined considering only the Love, Wow, Haha, Sad, and Angry reactions. The percentages of the core reactions are also shown in Table II. Furthermore, Fig. 3 shows the core reaction percentages as a pie chart. Thus, initially, the normalization was done considering only the core reactions. The dataset was then divided into train and test subsets for the purpose of calculating and evaluating the accuracy of vector predictions. The message column of the train set was tokenized into individual words, and set operation was used to obtain the collection of unique words for each entry. Then, a dictionary was created for each entry by assigning the normalized reaction vector of the entry to each word. The dictionaries thus created were merged vertically, taking the average value of vectors assigned to a word across the dataset as the aggregate reaction vector of that word. Equation 3 describes this process where is the aggregate reaction vector for the word W, is the reaction vector of the th entry ( ), n is the number of entries, and ∅ is the empty vector.
The dictionary thus created was used to predict the reaction vectors of the test dataset. Entries in the test set were tokenized and then converted to unique word sets similar to the aforementioned processing of the training set. Then for each of the words in a set of a message which also exists in the dictionary created above, the corresponding reaction vector was obtained from the dictionary. For entries of which none of the words were found in the dictionary, the mean vector value of the train dataset was assigned. Equation 4 shows the calculation of the predicted vector for a message where, is taken from the dictionary (which was populated as described in Equation 3), and is the number of words in the message .

C. Defining the Evaluation Statistics
To evaluate the performance of the prediction process, a number of statistics were calculated. Equation 5 shows the calculation of Accuracy for reaction where, is the expected (actual) value for the entry as calculated in Equation 2 and is the predicted value calculated in Equation 4 as .
= min ( , ) The accuracy can be defined this way since we are solving a bin packing problem and the vector values are sum up to 1. Equations 21, 7, and 22 shows the calculation of Recall ( ), Precision ( ), and F1 score ( 1 ) respectively where notation is same as Equation 5.
The above measures were calculated for each entry of the dataset and the average value of each measure was assigned as the resultant performance measure of the dataset. Those values were then averaged across 5 runs of the code.

D. All Reaction Set Model
The All Reaction Set Model was developed following the same procedure of the core reaction set model. In addition to the reactions included in the core reaction set, Like ( ) and Thankful ( ) were considered during this step. Equation 9 depicts how the sum of reactions is obtained while the normalized value * for each reaction could be obtained as mentioned in Equation 10. * refers to the sum of reactions obtained through Equation 9. * == The sentiment vector for each entry was then generated following the same procedure as in III-B. The evaluation was done as mentioned in III-C.

E. Star Rating Model
The next step of the study was inspired by the procedure proposed by De Silva et al. [8]. They propose using the star rating to generate sentiment vectors. Since the star rating take They propose using the star rating associated with amazon customer reviews to generate sentiment vectors. Since the star rating takes a value between 1 and 5 where 3 is considered neutral, and values more than 3 and less than 3 are considered as positive and negative respectively by them. To adjust Facebook reactions to this scale, we classified the reaction is considered to be uncertain due to its conflicting use cases: the reaction is often used both genuinely and sarcastically on the platform [32]. Therefore, the experiment was carried out considering only the Love, Wow, Sad, and Angry reactions. The normalization process described in Section III-B for the Core Reaction Set Model was updated by modifying Equation 1 as shown in Equation 11 and modifying Equation 2 as shown in Equation 12, where ́ is the modified sum of reactions of the entry. Figure  4 presents the distribution of selected reactions in the corpus.
The positive sentiment value ( ( , ) ) for entry was calculated by summing the Normalized Love (́) and Normalized Wow (́) values while the negative sentiment ( ( , ) ) was calculated by summing the Normalized Sad (́ ) and Normalized Angry (́) values, as shown in Equations 13 and 14. Using ( , ) and ( , ) , the aggregated sentiment for entry was calculated as shown in Equation 15.
The Star Rating Value ( ) for entry which is calculated over the entire dataset was computed as shown in Equation 16 where is the set of entries in the dataset.
The sentiment vector ( ) for entry is defined in Equation 17 where ( , ) , ( , ) , and were calculated as mentioned before.
Once the vectors were computed, the processing of test and train sets, building of the dictionary, and evaluating the International Journal on Advances in ICT for Emerging Regions October 202  Angry Negative model was conducted akin to that in Section III-C and Section III-B. The performance measures of the model were calculated using Gaussian distances.

1) Accuracy:
The accuracy of prediction for each post was measured in terms of True Gaussian Distance of a post, which is defined as the Gaussian distance to the predicted Star Rating Value of the post from its true Star Rating Value, on a distribution centered on the true Star Rating Value. It should be noted that the raw star rating values before discretizing into classes are utilized here. The accuracy ́ of a post with True Gaussian Distance , is calculated as shown in Equation 18. Equation 19 then describes the calculation of accuracy ́ for class of which the number of posts is .
2) Precision: In order to calculate the precision of predictions, the Gaussian Trespass of each post into its predicted class was considered. The trespass was measured as the Gaussian distance from the boundary of the true class of the post to the midpoint of its predicted class, on a Gaussian distribution centered around the midpoint of the true class. Equation 20 shows the calculation of precision of each star rating class, where ́ represents the precision value of class , , represents the number of correctly classified posts in class , and represents the trespass value of post in class .
3) Recall: The recall value was calculated for each post in terms of its Class Gaussian Distance, which is defined as the Gaussian distance to the midpoint of the predicted Star Rating Class of the post from the midpoint of its true Star Rating Class, on a distribution which is centered on the midpoint of the true class. The recall value ́ for a class consisting of an number of Facebook posts, each with a recall of ́, was obtained as depicted by Equation 21.

5) Overall Performance:
The overall performance measures for star rating were calculated by taking a weighted mean of performance measures of classes, with weights assigned based on the class size.
IV. RESULTS Table III shows the results obtained for the preference measure defined in Section III-C for the Core Reaction Set Model introduced in Section III-B and All Reaction Set Model introduced in Section III-D. All reactions except Sad reach their highest F1 score at the 95% − 05% train-test division, while the Sad reaction reaches its peak F1 score at the 80% − 20% division. Interestingly, the performance of the model in predicting each reaction seems to roughly follow a specific pattern; reactions that were used more often in the dataset seem to have a higher F1 score than reactions that were used less often, with the exception of the F1 score of Wow being higher than that of Sad. Figure 5 portrays the F1 score for each reaction as the train-test division varies for the Core Reaction Set Model. In the case of All Reaction Set Model, as shown in Table III, while the F1 of Like was much higher than that of other reactions, its inclusion brought forth significant reductions in the F1 scores of the other reactions. The Thankful reaction had a F1 of almost zero.
The overall results obtained for Star Rating Model introduced in section III-E are shown in table V. In contrast to the results obtained for Positive and Negative components, aggregation of reactions into a single Star Rating value has caused a significant decrease in precision; possibly due to the discrete nature of the Star Rating value which is divided into bins at 0.5 intervals. Figure 6 portrays the change of F1 value with the train-test division.
International Journal on Advances in ICT for Emerging Regions October 2022  Table VI. It could be observed that the model exhibits better performance with regard to predicting more neutral star rating values. While accuracy and recall measures show comparable performance across all classes, this difference becomes much more prominent in precision. Consequently, a notable increase in performance is observed in more neutral classes in terms of F1 score. Further explorations revealed that the root cause of this issue is that the predictions of the model tend to lean towards more neutral classes, as portrayed in Table VII. It should be noted that the extremely positive and extremely negative classes are significantly larger in size, in comparison to comparatively neutral classes.
As portrayed by Figure 5, the performance of the models remains largely unaffected by the train-test division chosen. The reason could be the large size of the dataset; the number of unique words in the train dataset does not change significantly for different train-test divisions.

V. CONCLUSION
Upon comparing the Star Rating Model with the Core Reaction Set Model, it becomes evident that the F1 scores are significantly improved upon the accumulation of separate reaction values into two categories as Positive and Negative. A possible reason for this is the possibility of the intracategory measurement errors being eliminated due to merging. However, merging all reactions into a single Star Rating value accentuates errors. This could be accounted to the additional error margin introduced by discretization. Further, the model predictions for Star Rating Classes that are closer to the median proves to be better than those for the edge-classes. The negative effect of Like and Thankful reactions, which were eliminated in the Core Reaction Set Model due to their abnormal counts, could be proven as well. The inclusion of those reactions caused significant reductions in the F1 scores of the other reactions as can be seen from the results of the All Reaction Set Model.
This study represents modelling efforts that may be considered classical and limited in nature. Recent years have seen a significant growth in machine learning algorithms delivering exceptional results in many domains of text analysis, especially in finding non-linear relationships in the    [33] highlights a number of preprocessing steps (such as dimensionality reduction using topic modelling or principal component analysis) and algorithms that may be combined with the feature engineering work presented here (especially the selection of useful data classes and reduction to a star rating) for potentially more accurate models in the future. As noted therein, deep learning techniques hold particular promise. This is further explored in the work of Weeraprameshwara et al. [34], [35] that can be considered as a continuation of the research, which tests new models and develops a new embedding system using the Facebook data.
The study uses a word embedding developed by the work of Senevirathne et al. [23] for the Facebook dataset. However, developing an embedding structure based on the dataset may provide better sentiment annotation. Further enhancements can be done by introducing granularity to the embedding structure such as sentence embeddings.
An alternate approach to sophisticated modelling would be to examine pre-processing techniques therein that may not be possible in Sinhala as of the time of writing, due to limited or missing language resources and tooling, as noted by de Silva [10]; building these tools may further yield increases in accuracy even with a simplistic model.