Depression Detection on Twitter Social Media Using Decision Tree

Depression is a major mood illness that causes patients to experience significant symptoms that interfere with their daily activities. As technology has developed, people now frequently express themselves through social media, especially Twitter. Twitter is a social media platform that allows users to post tweets and communicate with each other. Therefore, detecting depression based on social media can help in early treatment for sufferers before further treatment. This study created a system to detect if a person is indicating depression or not based on Depression Anxiety and Stress Scale - 42 (DASS-42) and their tweets using the Classification and Regression Tree (CART) method with TF-IDF feature extraction. The results show that the most optimal model achieved an accuracy score of 81.25% and an f1 score of 85.71%, which are higher than baseline results with an accuracy score of 62.50% and an f1 score of 66.66%. In addition, we found that there were significant effects on changing the value of the maximum features in TF-IDF and changing the maximum depth of the tree to the model performance.


Introduction
Depression is a mental health mood disorder that causes patients to experience severe symptoms that affect their daily activities such as eating, sleeping, working, and how they feel or think [1]. According to WHO, depression affects 3.8% of the human population worldwide, with 5.0% of adults and 5.7% of adults over 60 years old. Approximately 280 million people worldwide suffer from depression. Depression can cause a person to suffer extremely and exhibit poor performance in daily activities; it can even lead to suicide. People with depression are frequently misdiagnosed, while people who are not depressed are prescribed antidepressants [2].
With the development of technology, humans often express themselves through posts on social media. Therefore, a study by Budiman et al. [3] was carried out to collect data with keywords that indicated depressive disorders on the Twitter platform by involving psychiatrists to label datasets that indicated depression or not. Based on that study, we can identify whether a person is indicated to be depressed or not through social media, especially Twitter.
Social media is an online platform for socializing between users with similar interests, backgrounds, or activities that allows the users to interact without restrictions. With social media, it is possible for humans to communicate with each other wherever they are and whenever they want [4]. According to Kepios, as of April 2022, 58.7% of humans worldwide have social media accounts [5]. Twitter is a social media for connecting and communicating through the quick and frequent exchange of messages. Users can post tweets containing text, photos, videos, and links. In addition, tweets will be shown on the profile and can be seen by followers or can be searched on Twitter [6]. Statistica Research Department shows that in January 2022, Twitter had 342.75 million monetizable daily active users worldwide, with Indonesia being ranked fifth [7].
Many studies have been done to detect depression through social media, especially Twitter. Research conducted by Nugroho, K. S. et al. [8], who researched on Twitter about the potential for depression and anxiety disorder using BiLSTM, resulted in an accuracy score of 94.12%. However, although the accuracy is high, BiLSTM can cause overfitting if the dataset is not big enough. Research by Ahmed Husseini et al. [9] conducted a study of depression detection from Twitter users using several methods. The study stated that Recurrent Neural Network (RNN) resulted in an accuracy score of 91.245% but has limitations regarding long sentences. A study by Rizki, A. et  A. et al. [11], who defined a binary classification that identifying a person indicated depression or not based on his Twitter activities using Support Vector Machine (SVM), Naive Bayes (NB), and Decision Tree (DT) with all possible combinations of feature values shows the SVM model has achieved the best accuracy metric combinations with 82.5% of accuracy. Although the DT model can fail if exposed to brand-new data with 77.5% of accuracy and NB with 80% of accuracy. In a Study by Le Yang et al. [12], classified depression from audio and video information using a Decision Tree, the performance was almost 100% correctly classified. In the test set, the f1 score resulted in 72.4%, which is higher than the baseline.
Suppose we can detect whether someone is indicating depression through their social media. In that case, further treatment can be given, either professionally or moral assistance from the closest person, before being handled further. So, studying a system that can detect whether a person is indicating depression or not based on their tweets can assist in providing treatment for people who are indicating depression. In this research was conducted to build a classification model that aims to classify the data from tweets to detect whether someone is indicating depression or not. We proposed the Decision Tree method, because based on study by Le Yang et al. [12], using a Decision Tree was almost 100% correctly classified with 72.4% of f1 score. In addition, we focused on increase the accuracy and f1 score by hyperparameter tuning to make a better model that can get a better prediction. So, classifying someone that indicating depression using Decision Tree model is proven to produce good performance.

Research Methods
This research on the detection of users that indicate depression or not is based on several studies as a reference. We proposed a Decision Tree (DT) based model, namely Classification and Regression Tree (CART), that can detect which users indicate depression by user tweets using Term Frequency-Inverse Document Frequency (TF-IDF) for feature extraction. Figure 1. shows the flowchart that runs on the system. This section explains about methods used in this research.

Data Collection
The dataset was obtained through Twitter crawling. Before crawling the tweets, we shared the Depression Anxiety Stress Scale (DASS) 42 questionnaire with respondents. This questionnaire is for labeling the dataset. DASS-42 is a psychological assessment scale to measure a person's depression, anxiety, and stress level based on 42 questions. Each scale (depression, anxiety, and stress) contains 14 items. Table 1 shows the distribution of items [13]. Self-assessment is done by filling in a scale value of 0 to 3 for each item with the information 0: does not occur, 1: rarely occurs, 2: sometimes occurs, and 3: often occurs. The DASS-42 was assessed by calculating the total score for each disorder, so the maximum score for each disorder was 3 x 14 is 42. Table 2 shows the severity of the disorder [13]. In this research, we only use the depression scale for labeling the respondents that indicated depression if that person has a score above 9 (10 to 42 will be labeled as indicating depression) without paying attention to the severity of the disorder. Table 3 shows the 14 questions. After respondents have completed the questionnaire (DASS-42) and filled in their Twitter usernames, we crawl their tweets for the dataset. We crawl the tweets without dates and keywords limit. Result of data collection contains username, tweet, and label with csv Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. format that be the dataset for next process. Table 4 shows the example of data collection result. Merasa sedih dan depresi 5 Kehilangan minat pada banyak hal (misal: makan, ambulasi, sosialisasi) 6 Merasa diri tidak layak 7 Merasa hidup tidak berharga 8 Tidak dapat menikmati hal-hal yang saya lakukan 9 Merasa hilang harapan dan putus asa 10 Sulit untuk antusias pada banyak hal 11 Merasa tidak berharga 12 Tidak ada harapan untuk masa depan 13 Merasa hidup tidak berarti 14 Sulit untuk meningkatkan inisiatif dalam melakukan sesuatu The dataset contains 157 users with usernames, tweets, and labels. Figure 2 shows the distribution of dataset labels, which contains two labels, "1" means to indicate depression, and "0" means not to indicate depression. There were 92 users who indicated depression and 65 users who did not indicate depression.

Data Preprocessing
Data preprocessing is a method to make data of higher quality and improve performance [14]. In this research, preprocessing techniques are case folding, data cleaning, tokenization, stop word removal, and stemming. Case folding is a stage of changing uppercase letters into lowercase letters [15]. Data cleaning is a process to remove the noises in the data like numbers, emoticons, and punctuation to remove unnecessary information [16]. Tokenization is the process of splitting sentences into tokens of words. Stop word removal is the process of removing words that are unimportant to reduce word dimensions. Finally, stemming is the process of returning affixes to basic words [15]. Table 5 shows the example of data preprocessing.

Feature Extraction with TF-IDF
Machine learning algorithms cannot process raw text directly. Instead, it needs feature extraction to convert text into a matrix or vector [17]. Feature extraction is a technique to remove irrelevant data features to reduce the data space dimensions [18].
In this research, we proposed Term Frequency-Inverse Document Frequency (TF-IDF) as feature extraction. TF-IDF is a technique that calculates the weight of each word. TF is to measure how many words appear in one document, while IDF calculates the weight of each word in a document. The more words appear, the higher the weight of those words [19].

Modeling with Decision Tree
Decision Tree (DT) is an algorithm that has the concept of converting data into a visual form in the form of decision tree rules [20]. DT is a classification model like a tree where each tree branch represents the choice, and the tree's leaf represents the decision's outcome. The advantage of this method is that it can change the decision-making area to be simpler and more specific than was previously complex. In addition, DT is flexible in selecting features from various internal nodes. The selected features will differentiate a criterion from other criteria in the same node. This flexibility can improve the quality of the decision's results [21].
A tree starts with a root node that represents a decision. Then, based on the root node, it will be split into branches representing the possible decision. Finally, the result is a leaf node that represents the resulting class [21]. DT needs to split a node based on the best value. Different DT algorithms use different calculations to get the best value for splitting the node. Table 6 shows the calculation comparison [22].  (1984) to refer to DT algorithms for classification or regression modeling. CART uses the gini index to split criterion [23]. Gini index is defined as: D is a dataset containing n samples, and Pj is the relative probability that the sample of category j appears in dataset D. Gini index is used to differentiate the highest number between categories at different nodes in the data. Therefore, the sample's category distribution is more uneven when the gini index value is lower. That means the capacity to differentiate between various categories is improved if the subset created by the splitting point has a higher category purity [23].
Gini(D) is the gini index of an attribute; n1 represents the amount of data in D1 and n2 represents the amount of data in D2 [23].

Evaluation
In this research, we used the accuracy score and f1 score to evaluate the system's performance. Accuracy represents how many classes are classified correctly. Accuracy is obtained by True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) from the confusion matrix. The confusion matrix represents the actual and predicted class in a square matrix [24]. Accuracy is defined as: F1 score can be defined as a harmonic mean of precision and recall. A high f1 score means the model has good precision and recall values. Precision is the ratio between TP and total data that is predicted to be positive, and recall is the ratio between TP and total data that is positive [25]. F1 score, precision, and recall are defined as:

Dataset
In this research, we shared the DASS-42 questionnaire with respondents to label the dataset. Table 7 shows the top five rows from the DASS-42 result. After that, we did data preprocessing for the dataset from case folding to stemming. Table 8 shows example of data preprocessing result. Then, we split the dataset and did feature extraction using TF-IDF with various values of features. These various ratios of split data and various value of features in TF-IDF is to determine the baseline. We used data split into 70:30 ratio, 80:20 ratio, and 90:10 ratio with maximum features in TF-IDF into 5000 maximum features, 7000 maximum features, and 10000 maximum features before modeling.

Experimental Result
In this research, we conducted three experiments, namely the CART algorithm with various ratios of data split and various maximum features in TF-IDF to determine the baseline; the CART algorithm with hyperparameter tuning the maximum depth of the tree to increase performance; and using other DT-based algorithms to compare with our model. Figure 3. shows the result of the 70:30 ratio of data split with 5000, 7000, and 10000 maximum features in TF-IDF. The best result is 5000 maximum features with a 56.25% accuracy score and 55.31% f1 score. Figure 4. shows the result of the 80:20 ratio of data split with 5000, 7000, and 10000 maximum features in TF-IDF. The best result is 5000 maximum features with a 59.37% accuracy score and 43.47% f1 score. Finally, Figure 5. shows the result of the 90:10 ratio of data split with 5000, 7000, and 10000 maximum features in TF-IDF. The best result is 5000 maximum features with a 62.50% accuracy score and 66.66% f1 score. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. Based on these results, the model generated the best data split with the 90:10 ratio and 5000 maximum features in TF-IDF with a 62.50% accuracy score and 66.66% f1 score. This result will be the baseline for the next experiment. The comparison of the ratio of data splits, and the number of maximum features can be seen in Table 9. The value of features in TF-IDF has a significant effect on the performance. The more features in TF-IDF will decrease the accuracy and increase the f1 score, but the smaller features in TF-IDF will increase the accuracy and decrease the f1 score. As seen at the 70:30 ratio of data split, when the maximal features are increased from 5000 to 10000, the accuracy decreases by 4.17%, and the f1 score increases by 1.29%. At the 80:20 ratio of data split, when the maximal features are increased from 5000 to 10000, the accuracy decreases by 6.25%, and the f1 score increases by 13.67%. At the 90:10 ratio of data split, when the maximal features are increased from 5000 to 10000, the accuracy decreases by 6.25%, but the f1 score does not increase or decrease. Based on these results, we concluded that the higher amount of data train would enhance the model's performance, but the higher number of features in TF-IDF will decrease the accuracy score but increase the f1 score.   Figure 6 shows the accuracy results by trying various values for the parameter maximum depth of the tree against the train and test data. Figure 7 shows the trendline of the accuracy of train and test data. As can be seen, increasing the maximum depth values will enhance training data accuracy but smaller the test data's accuracy. The trendline for the training data is higher, but the trendline for the test data is lower. The gap between train and test accuracy is higher. That leads to overfitting, which the model predicts almost Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. perfectly on the training data but fails to predict on the test data. We did pre-pruning the tree by early stopping the growth of the tree. We obtain the most optimal value of the maximum depth is 4. Based on this value, the accuracy for train data is 82.26%, and test data is 81.25%. Based on this scenario, the accuracy was increased by 18.75% from the baseline. So, tuning the maximum depth of the tree can lead to a better model's performance, but it also can lead to overfitting because the test data fails to predict as well as train data.
The third experiment compares the CART algorithm to other DT-based algorithms: AdaBoost Decision Tree, Gradient Boosted Decision Tree, and Random Forest. The comparison results can be seen in Table 10. CART algorithms with hyperparameter tuning have the best result among other DT-based algorithms, including the baseline result. The accuracy increases by 18.75%, and the f1 score increases by 19.05% from the baseline. That means the hyperparameter tuning the tree's maximum depth significantly affects the model's performance.

Conclusion
In this research, we created a detection model that predicts whether a user is indicated depression or not by their tweets. We develop a CART algorithm with TF-IDF feature extraction and hyperparameter tuning the maximal depth of the tree. Our model outperforms the baseline result and other DT-based algorithms such as AdaBoost, Gradient Boosting, and Random Forest. The best model is the CART with a 90:10 ratio of data split, 5000 maximum features in TF-IDF, and 4 of maximum depth of the tree with an accuracy score of 81.25% and f1 score of 85.71%. Furthermore, our experiments show there is a significant effect on changing the amount of the train data, the value of features in TF-IDF, and the value of the depth of the tree for the model, but it must be done carefully so that the model does not overfit. Based on the results, the classification model can detect whether a person is indicating depression or not by their tweet with good performance, which can assist in providing treatment for a person who is indicating depression.
For future work, it can be tested by using more datasets, with another tuning the parameter, and other feature extraction methods as the comparison to conduct the classification and detection to get better performance.