A Deep CNN Architecture with Novel Pooling Layer Applied to Two Sudanese Arabic Sentiment Datasets

Arabic sentiment analysis has become an important research field in recent years. Initially, work focused on Modern Standard Arabic (MSA), which is the most widely-used form. Since then, work has been carried out on several different dialects, including Egyptian, Levantine and Moroccan. Moreover, a number of datasets have been created to support such work. However, up until now, less work has been carried out on Sudanese Arabic, a dialect which has 32 million speakers. In this paper, two new publicly available datasets are introduced, the 2-Class Sudanese Sentiment Dataset (SudSenti2) and the 3-Class Sudanese Sentiment Dataset (SudSenti3). Furthermore, a CNN architecture, SCM, is proposed, comprising five CNN layers together with a novel pooling layer, MMA, to extract the best features. This SCM+MMA model is applied to SudSenti2 and SudSenti3 with accuracies of 92.75% and 84.39%. Next, the model is compared to other deep learning classifiers and shown to be superior on these new datasets. Finally, the proposed model is applied to the existing Saudi Sentiment Dataset and to the MSA Hotel Arabic Review Dataset with accuracies 85.55% and 90.01%.


Introduction
Sentiment analysis is an important field because it enables us to discover voices and opinions relating to topics of interest in a particular context, for example, views about political issues in elections, or opinions about products or ways of providing services.With the emergence of participatory web services in areas such as education and health, there is a need for sentiment analysis in identifying problems and hence upgrading quality standards.Recently, the spread of Arabic content, especially on social media, and the application of artificial intelligence and deep learning in analyzing Arabic sentiments, has led researchers to delve deeper into Arabic text.Initially, this work has been carried out on Modern Standard Arabic (MSA).However, more recent work has also been concerned with the regional Arabic dialects that are often used in everyday informal communications.There are a significant number of variations between dialects and MSA in terms of language: • MSA has a dual form of short vowel, omitted in written text, in addition to the singular and plural vowel forms, for masculine and feminine.Dialects often do not create such uniqueness between the sexes; instead they have an open system which is more complex than MSA, allowing the prefix and suffix to be attached to a base, and pronouns to function as indirect objects.
• MSA has a complex system of grammar, which dialects sometimes lack.Dialects may not use diacritics, whereas most instances in MSA are represented with directly written diacritics since adverbs and objects are expressed using suffixes.
• Arabic vocabulary varies according to the dialect used.For example, 'money' in MSA is , whereas in dialects it is .
• There are differences in the conjugation of verbs, even though the root is retained.For example, conjugation of the root in MSA is , while in dialects it is .
This can be seen in 'he plays', which is (MSA) and (dialects).
The World Arabic Language Dialects map3 indicates 21 Arabic dialects and shows the different regions of the world in which they are spoken.This can give us hints about how dialects are related to each other.Table 1 (derived from istizada.com4 ) shows the number of speakers for eight of the most important dialects.As can be seen, Sudanese Arabic is the fifth most widely spoken dialect, with 32 million speakers.This is why we have concentrated on Sudanese in this work.
• We build a public 3-class Sudanese sentiment dataset SudSenti3 from Twitter 5 .
• We design a Sudanese stop-word list, and use it for text normalization in the preprocessing phase.
• We propose a Sentiment Convolutional Model (SCM) which is a five-layer Convolutional Neural Network incorporating our Mean Max Average (MMA) pooling layer.
• We compare SCM with other machine learning and deep learning methods, and show that it gives a high classification performance.
• We also compare the proposed MMA pooling layer to the standard pooling layer used in other works, and show that it gives the best performance.
• Finally, we apply SCM to datasets in both MSA and Saudi dialect; the proposed approach once again shows a high performance.
The paper is organized as follows.Section 2 reviews previous work on sentiment analysis for Arabic.Section 3 describes the creation of the SudSenti2 and SudSenti3 datasets.Section 4 outlines the proposed model architecture.Section 5 presents our experiments, including preprocessing steps, experimental settings, baselines, results and discussion.Finally, Section 6 draws conclusions and suggests future work.

Related work
As we have mentioned, the Arabic language is very widespread in the world and is spoken in many dialects.Some of these have already been the subject of sentiment analysis research (Table 4).Here, we discuss which dialects have been studied, what datasets were used, and what sentiment analysis techniques were adopted.For a recent review of Arabic sentiment analysis, please also refer to [10].
[6] applied NB, SVM, Decision Trees (DT), and K-Nearest Neighbor (KNN) algorithms on four Arabic datasets, Opinion Corpus for Arabic (OCA) [36], Modern Standard Arabic (MSA), Crawler tweets 2014 datasets [2], and the Large-scale Arabic Sentiment Analysis Dataset (LABR) [13].The aim was to determine the emotions of the Arabic text, using methods based on bigrams and voting combinations.Accuracy was 93.4%, better than the individual classifiers.
[5] used Logistic Regression (LR) with a Term Frequency -Inverted Document Frequency (TF*IDF) weighting model for feature extraction, on Arabic Services Reviews in Lebanon (ASRL) collected from Google reviews and the Zomato website in the Lebanon dialect.For positive classifications, P = 0.88 and R = 1.00, and for negative, P = 0.80, and R = 0.80.Thus the positive result is better than the negative.
[1] utilized SVM and LSTM on both Modern Arabic and the Algerian dialect [30].The results for SVM and LSTM on the Algerian dataset were 86% and 81% respectively.
[25] presented a comprehensive Arabic preprocessing approach, then designed two architectures, MC1 and MC2.On the difficult ASTD dataset [32], for the 4-class task, accuracy was 73.17%, on 3-class it was 78.62%, and on 2-class it was 90.06%.On the large 2-class ATDFS dataset [11], their model worked effectively, with performance up to 92.96%.
Finally, the task of Arabic preprocessing has received attention in previous work [37][38] [18] including an approach presented by the authors [25].Here we create two Sudanese Arabic sentiment datasets, one 2-Class and one 3-Class.
After detailed preprocessing, we apply the proposed classifier SCM+MMA and compare its performance to ML and NN classifiers.

Dataset Creation
In this work, two new datasets for Sudanese are proposed.The 2-Class Sudanese Sentiment Dataset (SudSenti2) was created from Facebook and YouTube.The 3-Class Sudanese Sentiment Dataset (SudSenti3) was created from Twitter posts.Table 6 shows summary information for the two datasets.

SudSenti2 Dataset
The following steps were carried out: 1. Texts were collected from Facebook 6 and YouTube 7 .
2. All texts matching queries such as the following 8 were downloaded, using the Orange Data Mining software 9 : , This resulted in 4,544 matching posts.
3. Three judges were chosen to classify the posts.All were university teachers who were native speakers of Sudanese Arabic.All judges judged all posts.4. Posts which were not considered Sudanese by at least two of the three judges were deleted. 5.Each post was then classified as Negative, Positive or Neutral.(Neutral posts were subsequently deleted.)A text is considered positive if it contains joyful, happy, or amusing vocabulary, or if there is a positive emoji, or if there is more than one emotion but the positive feeling is dominant.A text is negative if it contains negative, disappointed, sad, or disturbing vocabulary, or if there is a negative emoji, or if there is more than one feeling and the negative emotion is dominant.Finally, a text is considered neutral if it not clearly positive or negative.6. Judges worked independently.If at least two of the three judges classified a post as negative it was judged Negative sentiment, and similarly for Positive sentiment.Neutral posts were deleted from the collection, resulting in a 2-class dataset.7. By following the above procedure, 4,000 posts were selected from the original 4,544.The final SudSenti2 dataset contains 2,027 positive posts and 1,973 negative posts.

SudSenti3 Dataset
The following steps were carried out to produce the 3-class dataset: A stopword list has been produced containing MSA and colloquial Arabic stopwords used in Sudan.The list contains 269 words and 2,095 characters (Table 5 11 ).

Text Encoding
Input layer: In order to start, let us assume that the input layer receives text data as X(x1, x2, ..., xn), where x1, x2, ..., xn is the number of words with the dimension of each input term m.Each word vector would then be defined as the dimensional space of R m .Therefore, R m×n will be the input text dimension space.
Word embedding layer: Let us say the vocabulary size is d for a text representation in order to carry out word embedding.Thus, it will represent the dimensional term embedding matrix as A m×d .The input text X(xI), where I = 1, 2, 3, ..., n, X R m×n , is now moved from the input layer to the embedding layer to produce the term embedding vector for the text.For the dialects we use TF-IDF, and for MSA we implemented the AraVec [42] word embedding pre-trained by Word2Vec [27] on Twitter text.
The representation of the input text X(x 1 , x 2 , ..., x n ) R m×n as numerical word vectors is then fed into the model.
x 1 , x 2 , ..., x n is the number of word vectors with each dimension space R m in the embedding vocabulary.

Mean Max Average Pooling
In CNNs, the pooling function is essential for extracting the specific features from the feature map.The aim of pooling is to determine the output of Y k , the pooling P k for k = 1, ..., K, where the set of activations in P k is represented as c 1 , ..., c |P k | and |P k | denotes the number of activations.By collecting the outputs of all the pooling regions, the pooling feature map E= e 1 , ..., e k is obtained.We will start with a quick overview of standard pooling strategies: Max pooling [34]: This takes the biggest activation in the pooling region: Max pooling is ideal for extracting local characteristics from a feature map, such as edges, lines, and textures.
Average pooling [22]: This calculates the mean value of activities in the pooling region: By smoothing the pooling region in this way, it is possible to extract global characteristics.
Min pooling [41]: Calculates the minimum value of activities in the pooling region: Our proposed Mean Max Average (MMA, Figure 3) pooling calculates the mean of the max value and the average value: MMA aims to combine the advantages of Max pooling and Average pooling.   .kernel-size = 3, padding = 'valid', activation = 'ReLU', and strides = 1.These are followed by the proposed Mean Max Average (MMA) pooling function, 1D pool size = 2, then Dense = 32, and activation='ReLU', then Dropout 0.5, then batch normalization and another Dropout 0.5, then Flatten and finally a softmax layer.This is fully connected to predict the sentiment between three classes (Positive, Negative, Neutral) or two classes (Positive, Negative).

Experiments
Our experiments include four aspects (Figure 1): 1. Preprocessing the datasets and checking the steps.
2. Utilizing existing machine learning and deep learning methods to verify performance.
3. Applying the proposed method.

Datasets
For sentiment classification on Sudanese Arabic text, machine learning and deep learning models are trained using the new SudSenti2 and SudSenti3 datasets introduced in Section 3.
For the Saudi dialect we use the Saudi Sentiment Dataset (SSD) [14] 12 .
SSD consists of two classes, 2,436 positive tweets and 1,816 negative tweets.
For sentiment classification in MSA, the models are trained using the Hotel Arabic reviews dataset (HARD) 13 .It is a rich dataset, with more than 370,000 reviews expressed in MSA.Here we utilized two classes, 5,857 positive tweets and 6,353 negative tweets.
Table 6 shows the details of the datasets.

Experimental Settings
We used machine learning algorithms and deep learning models for training with all the Arabic sentiment datasets for 2-way and 3-way classification.The machine learning algorithms were Naive Bayes (NB) [28], Logistic Regression (LR) [8], Support vector machines (SVMs) [29] and Random forest (RF) [35].
For the SSD and HARD datasets, we applied the proposed approach and existing deep learning models.The settings for the experiments are shown in Table 7.We used our own tuning and hyperparameter values and chose the TensorFlow framework for the implementation.
The aim was to evaluate the proposed SCM+MMA model in 2-way sentiment classification, working with the SudSenti2, SSD and HARD datasets.SudSenti2 was introduced in Section 3. As baselines there are four ML models (LR, RF, NB, SVM) and three NN models (RNN, CNN, CNN-LSTM).The configuration of SCM+MMA is shown in Table 7. Ten-fold cross-validation was used for all models and the average performance reported.The results are shown in Table 8.
On the SSD dataset (Saudi dialect), the best model is SCM+MMA (84.02%) and the best baseline is CNN-LSTM (83.55%).Finally, on the HARD dataset (MSA) the best model is again SCM+MMA (88.37%) as against the best baseline, CNN (87.06%).
In summary, the experiment showed that the proposed model performed well on two-class datasets.

Experiment 2: Three-way Sentiment Classification
The aim was to evaluate SCM+MMA once again, this time on the new 3-way Sudanese dataset, SudSenti3 (Section 3).
3-way classification is known to be a harder task than 2-way, particularly as the Neutral class can contain examples with both positive and negative aspects, a factor which may confuse the model.As baselines there are three NN models (RNN, CNN, CNN-LSTM).The configuration of SCM+MMA was the same as in Experiment 1 (Table 7) except that there were three outputs, not two.Once again, ten-fold cross-validation was used for all models.The results are shown in Table 9.The best performing model is SCM+MMA (85.23%).The best ML baseline is LR (79.37%) and the best NN baseline is CNN (83.61%).

Experiment 3: Evaluation of MMA Pooling
A key part of the proposed SCM+MMA model is the MMA pooling layer.The aim of this experiment, therefore, was to compare MMA with three commonly-used pooling layers, Max, Avg and Min.Firstly, in 2-way classification, the performance of SCM+Max, SCM+Avg and SCM+Min was compared with SCM+MMA on SudSenti2, SSD and HARD (compare with Experiment 1).Fifteen-fold cross-validation was used and the average performance reported.Results are shown in Table 10.SCM+MMA is the best performing model on all three datasets (SudSenti2 92.75%, SSD 85.55%, HARD 90.01%).The best baseline pooling layer varies between Avg and Max by dataset (SudSenti2: Avg 91.75%; SSD: Max 84.72%; HARD: Max 89.27).
Secondly, in 3-way classification, the performance of SCM+Max, SCM+Avg and SCM+Min was compared with SCM+MMA on SudSenti3 (compare with Experiment 2).Fifteen-fold cross-validation was again used.Results are shown in Table 11.Once again, SCM+MMA is the best performing model (84.39%) with the best baseline being Max (84.11%).
In conclusion, the MMA pooling layer performs well compared to Max, Avg and Min.

Accuracy during training
Figure 4 shows the accuracy and validation accuracy of the NN baseline models and the proposed method with the SudSenti2 dataset.After 50 epochs, the SCM+MMA model shows the highest performance, reaching 92.25%.Figure 5 shows the same information for the SSD dataset (SCM+MMA reaches 84.02%) while Figure 6 is for the HARD dataset (SCM+MMA reaches 88.37%).
Figure 7 shows the accuracy and validation accuracy for the NN models and the proposed model with the SudSenti3 dataset.After 50 epochs, SCM+MMA reaches 85.23%.We note that the proposed method was stable over epochs for training and validation with different datasets.

Conclusion and Future Work
In this paper, we first presented two new sentiment datasets for the Sudanese dialect of Arabic.SudSenti2 was collected from Facebook and YouTube, while SudSenti3 was based on Twitter tweets.Following a discussion of Arabic pre-processing methods appropriate to sentiment classification, we proposed a new model for this task, SCM+MMA.This includes four convolutional layers plus MMA, our proposed pooling layer.In 2-way sentiment classification using the SudSenti2 (Sudanese), SSD (Saudi) and HARD (MSA) datasets, SCM+MMA gave the best performance relative to ML and NN baselines.In 3-way classification using SudSenti3, SCM+MMA was also superior to the baselines.Finally, the proposed MMA pooling was compared to Max, Avg and Min baselines and shown to perform better than them in both 2-way and 3-way classification.
In future work, we plan to use an attention mechanism as part of a more complex deep learning method, to extract features from a huge corpus covering all Arabic sentiment dialects.

Data Availability
The SudSenti2 and SudSenti3 datasets are publicly available 15 .

Acknowledgments
The research for this paper is supported by the Science Support Agency (Grant no.12345) and the Projects Fund (Grant no.23456).Many thanks to George Kour for the ArXiv style: https://github.com/kourgeorge/arxiv-style.

Figure 1 :
Figure 1: Overall description of work.

For 2 -
way classification, Figures 8, 9 and 10 show the validation accuracy during training for the SudSenti2, SSD and Hard datasets.Finally, for 3-way classification, Figure 11 shows the validation accuracy during training for SudSenti3.

Figure 4 :
Figure 4: Accuracy and validation accuracy with the SudSenti2 dataset.

Figure 5 :
Figure 5: Accuracy and validation accuracy with the SSD dataset.

Figure 6 :
Figure 6: Accuracy and validation accuracy with the HARD dataset.

Figure 7 :
Figure 7: Accuracy and validation accuracy with the SudSenti3 dataset.

Figure 9 :
Figure 9: Validation accuracy with the SSD dataset.

Figure 10 :
Figure 10: Validation accuracy with the HARD dataset.

Table 3 :
Examples of differences between MSA and the Sudanese dialect.
Be careful today don't go out of the room Keep going down this road until you find a pharmacy Leave drinking too much coffee not good for your health

Table 4 :
Previous work on sentiment analysis for different Arabic dialects.

Table 5 :
Examples from the Sudanese stopword list.

Table 6 :
Datasets for our experiments.

Table 8 :
Experiment 1: Accuracy of ML and NN sentiment classifiers on 2-class datasets.SudSenti2 is a new 2-class dataset for Sudanese, created from Facebook and YouTube (Section 3).SCM+MMA is the proposed model.

Table 9 :
Experiment 2: Accuracy of NN sentiment classifiers on the SudSenti3 3-class dataset, created from Sudanese Twitter posts (Section 3).SCM+MMA is the proposed model.

Table 10 :
Experiment 3: Accuracy of the SCM model with different pooling layers.The task is 2-class sentiment classification, applied to the SudSenti2, SSD, and HARD datasets.MMA is the proposed pooling layer.

Table 11 :
Experiment 3: Accuracy of the SCM model with different pooling layers.The task is 3-class sentiment classification, applied to the SudSenti3 dataset.MMA is the proposed pooling layer.