DHF Incidence Rate Prediction Based on Spatial-Time with Random Forest Extended Features

This study proposes a prediction of the classification of the spread of dengue hemorrhagic fever (DHF) with the expansion of the Random Forest (RF) feature based on spatial time. The RF classification model was developed by extending the features based on the previous 2 to 4 years. The three best RF models were obtained with an accuracy of 97%, 93%, and 93%, respectively. Meanwhile, the best kriging model was obtained with an RMSE value of 0.762 for 2022, 0.996 for 2023, and 0.953 for 2024. This model produced a prediction of the classification of dengue incidence rates (IR) with a distribution of 33% medium class and 67% high class for 2022. 2023, the medium class is predicted to decrease by 6% and cause an increase in the high class to 73%. Meanwhile, in 2024, it is predicted that there will be an increase of 10% for the medium class from 27% to 37% and the distribution of the high class is predicted to be around 63%. The contribution of this research is to provide predictive information on the classification of the spread of DHF in the Bandung area for three years with the expansion of features based on time.


Introduction
Dengue Hemorrhagic Fever (DHF) is a category of dangerous disease that can cause death for sufferers. This disease is transmitted through the bite of Aedes Aegypti and Aedes Albopictus mosquitoes. The mosquito carries the dengue virus and transmits it to humans through bites, resulting in dengue symptoms [1]. The spread of dengue cases is influenced by several factors, including rainfall [2], temperature, altitude, distribution of men [3], population mobility, population density, level of community knowledge, wind speed [4], and humidity [2] [5].
DHF spreads in tropical climates such as Indonesia. One of the areas with the highest incidence rate of DHF is Bandung City. According to [6] and [7], there have been recorded fluctuations in DHF cases in Bandung City from 2017 to 2021. In 2017, 1.786 cases were recorded, in 2018 there was an increase in cases recorded at 2.826 cases [6], in 2019 the number of cases doubled from the previous year, which was recorded at 4.424 cases, then in 2020 it decreased to 2.790 cases, and again increased in 2021 to 3.743 cases [7]. The highest number of cases occurred in 2019 which was recorded at 4.424 cases, a drastic increase of 56.54% compared to the number of cases in 2018. The three sub-districts with the highest distribution of DHF cases include Arcamanik sub-district with 241 cases, Coblong sub-district with 263 cases, and Kiaracondong subdistrict with 308 cases. Sub-districts with the smallest distribution of DHF cases were Sumur Bandung Subdistrict, which recorded 49 cases, Bandung Wetan Subdistrict, 62 cases, and Cinambo Sub-district, 70 cases [6]. This shows that DHF is a difficult disease to handle with the number of cases that always fluctuates every year and there is no optimal solution. Therefore, the government hopes for a solution to reduce dengue cases in each sub-district. One of them is by displaying the distribution of cases in each subdistrict in the form of classification prediction maps for the next few years so that the community and government can provide optimal actions and solutions to reduce the spread of dengue cases. In the field of information technology, the use of machine learning can be implemented to predict and classify dengue incidence rates based on historical data from previous years. In addition, there are other methods that can predict the incidence rate of DHF in areas where the value is unknown by kriging interpolation. One method that can be used is Ordinary Kriging. The number of Elqi Ashok, Sri Suryani Prasetiyowati, Yuliant Sibaroni Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. DHF cases that always fluctuates every year makes this problem even more challenging to predict the distribution of the incidence rate of DHF disease in each sub-district.
Not many studies have discussed the prediction of the classification of the incidence rate distribution of DHF based on the expansion of spatial and time-based features. Some studies usually only focus on predicting or classifying DHF, but not based on the problem of spreading cases in an area. Prediction and classification of DHF have been carried out by several researchers [8], [9], [10], and [11]. The study [8] applied the Random Forest algorithm using 10-fold cross-validation to predict dengue fever based on patient data from hospitals and laboratories. The number of trees built is 500 trees with the number of features that are tried on each splitting of 5 features. This study resulted in an accuracy of 92.34%, recall of 94.04%, and specificity of 92.19%.
Research [10] built a dengue virus diagnostic system by combining the Random Forest algorithm and Raman Spectroscopy. The data used were 100 samples collected from patients exposed to the dengue virus. Of the 100 data samples, 45 samples were labeled positive. This study reduces the dimensions of the data using the Principal Component Analysis (PCA) method and evaluates the Random Forest classification model with 5-fold cross-validation. The built diagnostic system produces 91% accuracy, 91% recall, and 91% specificity. Furthermore, the study [11] applied the Random Forest algorithm and Artificial Neural Network (ANN) to predict the clinical degree of DHF. The data used comes from patient data and laboratory data. Both models were evaluated using 5-fold and 10fold cross-validation. This study resulted in the highest accuracy in the Random Forest model of 58% with 10fold and the ANN model of 57% with 5-fold.
In research [9] [4], [13], and [14] using Random Forest. The study [4] applied the Random Forest and K-Nearest Neighbor (KNN) algorithms with two scenarios. In the first scenario, the modeling process uses a patterned model based on data from the previous 2 years and produces the lowest Root Mean Squared Error (RMSE) value of 29.25. In the second scenario, the modeling process uses a random data model and the lowest RMSE value is 45.48. However, this research still has shortcomings, the resulting RMSE value is still quite high and the map developed is only limited to 1 year. While the map development in research [13] the features used were too few and did not use the features that caused DHF. This study resulted in a very good accuracy value, but the accuracy value was not indicated and the map developed was difficult for readers to understand.
The study [14] implemented the Random Forest algorithm to predict the transmission of dengue fever in Shenzhen City, China, and determine the most important factors. The process of mapping the risk of dengue transmission is carried out with the help of the Argis software. This study divides the data for model training by 65% and the remaining 35% is used for model testing. The results showed that the AUC value was 0.8 with the most important features being average rainfall, maximum temperature, and workplace density.
While research on kriging has been carried out by [15], [16], and [17]. This study [15]  In studies [18], [19], and [20] discussed the application of the Ordinary Kriging method to predict the spread of DHF. In [18]  Based on previous research reviews with the advantages and disadvantages that have been presented, there is no research that combines the random forest and ordinary kriging algorithms. Thus, this study proposes these two methods for the prediction of the classification of the spread of the incidence of dengue fever with the expansion of features based on time. Feature expansion is carried out based on feature data of 2 years, 3 years, and 4 years previously. The data used are climate data, population, education history, and blood type. In previous studies, no one has used blood type data features. The addition of these features is carried out by considering that dengue patients are caused by mosquito bites and their blood is sucked. Therefore, this study predicts the classification with the expansion of features based on the time using the Random Forest and Ordinary Kriging algorithms. The purpose of this study was to determine the distribution of DHF in each subdistrict in the next three years and to find the features that had the most influence on the spread of DHF based on the results of the most optimal feature expansion. So that the community and government can provide appropriate prevention and treatment efforts to reduce the spread of DHF in each sub-district in the city of Bandung.

Research Methods
The methods used are Random Forest and Ordinary Kriging. Random Forest algorithm was used to predict the classification of incidence rates in 30 sub-districts based on the expansion of features from the previous 2 to 4 years. Then, the results were interpolated with Ordinary Kriging to predict the spread of dengue in the Bandung area for the next three years. The design of the system is shown in Figure 1.

Dataset
This study uses data on DHF cases obtained from the Bandung City Health Office, climate data from the Bandung Meteorology, Climatology and Geophysics Agency, population data, educational history data, and blood type data obtained from the Bandung City Central Statistics Agency. The data was collected based on 30 sub-districts in the city of Bandung from 2017 to 2021. Thus, the data obtained were 150 sample data and 13 features. Feature names are denoted by X1...X(n). Table  1 presents the results of the feature name notation and a description of each feature.

Data Preprocessing
The dataset that has been obtained is still in the form of raw data, so it is necessary to use a data preprocessing technique. The use of preprocessing is intended so that the dataset used produces quality data and is ready to be processed to build a classification prediction model.   Table 2 explains that an area is categorized as low if the area's IR is less than 55 per 100.000 population. If the IR number is in the range of 55 to 100 per 100.000 population, then the area is categorized as a medium, and categorized as high if the IR number is more than 100 per 100.000 population.
This study uses the stratified k-fold cross validation method to divide the data into 2 parts: training data and test data. The purpose of using stratified k-fold cross validation is to reduce bias in the model [22] and avoid errors caused by unbalanced classes [23]. In general, the way this method works is to divide the dataset into several folds according to the value of k, where each fold is carried out by a training process and model testing [22]. In this study, the number of k used is k=10. The selection is based on the small amount of data used in this study, so it is necessary to carry out more model training processes so that the model built is accurate and can predict well.

Random Forest
Random Forest is one of the ensemble methods that can be used for the classification of large amounts of data by building a regression tree consisting of a collection of decision trees. The decision tree was chosen randomly from the training data, then combined using the Breiman bagging method. After that, majority voting is carried out based on the decision tree to get predictive results [11].
The performance of the Random Forest model has been tested in predicting and classifying various types of datasets, even for unbalanced classes [24]. This is influenced by the use of random sampling and the principle of the ensemble technique [11]. According to [25] the Random Forest algorithm can naturally adjust to unbalanced classes by down-sampling the majority class and constructing each tree for the minority class so that the dataset becomes more balanced.
The development of the Random Forest model can be carried out in three steps. First, the data is divided into 2 parts training data and test data. The division is 2/3 of the data used as training data and the remaining 1/3 as test data used for validation of learning models on training data. Second, create a decision tree from a random data set with a bootstrap sample. The branching of each tree is determined by predictors chosen at random at the node points. Third, calculate the average value of all the results of the decision tree predictions. This average value is the result of the prediction of the random forest model. Therefore, each individual in the decision tree greatly influences the final predictive value [24]. In mathematical terms, the majority voting formula is as follows [25]: Where n is the training data sample, M is the number of decision trees built, n (x ; Θ , n ) is the predicted value at point x, and Θ 1 , . . . , Θ are independent random variables [25].

Random Forest Prediction Model
The Random Forest model was developed by expanding the feature column based on the features of the previous few years. From the data that has been collected for 5   Table 3 shows the scenario of feature expansion in the random forest prediction model based on feature data of 2 years, 3 years, and 4 years before. The prediction process is carried out from 2019 to 2021, for example predicting 2021 based on the feature expansion scenario of the previous 2 years, then the model uses feature column expansion in 2019, and 2020. While the target of the model is 2021. Examples of feature expansion combinations based on the previous 2 years are presented in Table 4.

Random Forest Model Selection
The best random forest prediction model is selected based on the highest accuracy value and the most optimal number of feature extensions. Accuracy is the percentage of truth in the test data which is calculated based on the number of correct predictions divided by the total predictions [8].
Where TP (True Positive) is the actual class labeled positive is predicted to be true as a positive label. TN (True Negative) is the actual class labeled negative which is predicted to be true as a negative label. FP (False Positive) is the actual class labeled negative is predicted to be falsely labeled as positive. FN (False Negative) is the actual class labeled positive which is predicted to be wrongly labeled as negative [26].
Meanwhile, the feature expansion scenario is carried out by utilizing the Sklearn SelectKBest library which can improve the accuracy and performance of the prediction model [27]. The way this technique works is to select a number of k features that have the highest score, where the score is calculated using a univariate statistical analysis of each variable [28]. This study uses the f_classif score function. This function calculates criterion f using dispersion analysis based on the difference in the mean value of the features in finding dependencies on the data. The f_classif function is calculated using the following formula [29]: Where C is the number of classes, is the number of sample data in the dataset, is the number of sample data with the label class i, , is the feature value of class i, is the value feature average in class i, and is the average feature value in the data set.

Class Prediction
At this stage, incident rate class predictions are made for 2022, 2023, and 2024. The prediction process uses the best Random Forest model that has been developed based on data from the previous 2 years, the previous 3 years, and the previous 4 years. The model was selected based on the highest accuracy value and the most optimal number of features.

Theoretical Semivariogram Model
Theoretical semivariogram is a model that is used as input in the interpolation process using ordinary kriging to predict the incidence of dengue fever in 30 subdistricts and other locations whose values have not been recorded. The semivariogram model was obtained based on the parameters of the distance between 2 points, the range value, and the threshold value [30]. This study uses 3 semivariogram models, namely spherical, exponential, and gaussian models. The general form of the three models is obtained from [30] and is stated as follows.
The general form of the spherical model is shown in equation (5) Furthermore, for the exponential model, the general form is shown in equation (6) ( While the general form of the Gaussian model is shown in equation (7) ( Where (ℎ) is the theoretical semivariogram, c is the sill value, while a is the range value, and h is the distance between 2 points [30].
Where Y is the actual value of the test data, Ŷ is the predicted value of the test data, and n is the number of test data.

Ordinary Kriging
Ordinary Kriging is a kriging technique based on stochastic interpolation [31]. This technique is most often used to estimate a value at the location point of an area based on a known variogram and use data in the surrounding environment to make predictions [32].
The incidence rate of DHF at the point Χ₀ can be predicted using the data values of n neighboring samples Χᵢ and combining them linearly with λᵢ weighting [15].
Where Ẑ(Χ₀) is the predicted value at the location point Χ₀, (Χᵢ) is the IR value of DHF in each sub-district, Χ₀ is the predicted sub-district location, Χᵢ is the observed sub-district location, λᵢ is the weighted value of the observed sub-district location, and n is the number of sample data.
Ordinary kriging is an exact interpolator which means that if Χ₀ is exactly equal to the observed subdistrict location then the predicted value is exactly equal to the data value at that subdistrict location [32].

Results and Discussions
This research uses the random forest method with a parameter experiment of the number of trees built as many as 100, 200, 300, 400, and 500 trees with the expansion of the previous 2 to 4 years of features. Meanwhile, the ordinary kriging method was experimented with by applying anisotropy to the major and minor range parameters in each semivariogram model. The best random forest model was chosen based on the highest accuracy value in the parameter experiment and the expansion of its features. In ordinary kriging, the semivariogram model with the lowest RMSE value was chosen as the best model to predict the IR distribution of DHF in the next three years.

IR DHF Classification using Random Forest
The performance of the random forest classification prediction model was measured using accuracy and the best model was selected for 2 years, 3 years, and 4 years based on the highest accuracy. Table 5, table 6, and  table 7 show the accuracy value of the test results of each developed model.   Furthermore, in Table 7, the 4-year model only has 1 model, namely model 4A with an accuracy of 93.33% with a hyperparameter of the number of trees built totaling 500 trees. When compared to model 3A, the resulting accuracy is the same. This is because the class studied by both models is the same, namely 2021. However, the results of the 3A feature expansion are much less, namely 5 features, while the 4A model requires 10 features. Details regarding feature expansion are presented in Figure 2, Figure 3, and Figure 4.   In Figure 4, the 4-year model also adds features so that the features used are 52. As before, the test was carried out by comparing the accuracy of each model from 3 to 52 features and selecting the most optimal feature extension. In the expansion of 5 features, it has covered features for 4 years with an accuracy of 90%. Whereas in 10 features the accuracy produced is higher and the features have covered the previous 4 years. Therefore, the best 4-year model is in 10 features with an accuracy of 93.33%. The feature contains the population, the proportion of the male population, elementary school graduates, rainfall, temperature, humidity, blood type A, and blood type O.
Thus, it can be concluded that feature expansion greatly affects the performance of the model and can improve accuracy. In addition, the feature expansion patterns that come out a lot and have the most influence on the spread of dengue incidence rates are population size, proportion of male population, elementary school graduation, rainfall, blood type B, and blood type O.
The three best-selected models are used as models to predict the incidence rate of DHF in 2022, 2023, and 2024. The incidence rate of DHF in 2022 is predicted using the 2C model, the incidence rate in 2023 is predicted using the 3A model, and the incidence rate in 2024 is predicted by model 4A.   Table 8 and Table 9 are the results of the calculation of the RMSE and the theoretical semivariogram. The results of these calculations are used in the process of predicting the incidence rate of DHF with the ordinary kriging method. Table 8 shows that the data are anisotropic, indicated by the presence of major and minor range parameters. The best theoretical semivariogram model for the incidence rate in 2022, 2023, and 2024, respectively, is the Exponential, Spherical, and Gaussian model, with RMSE values of 0.762, 0.996, and 0.953, respectively. This shows that the semivariogram model for predicting incidence rates in 2022, 2023, and 2024 tends to be different. This difference is influenced by the different prediction results of the random forest model classification. The pattern for the distribution of semivariogram values is shown in Figure 5, Figure 6, and Figure 7. In Figure 5, it can be seen that the pattern of data distribution tends to the Southwest-Northeast with a value of 147.5. While Figure 6 and Figure 7 have a value of 171, 4, and 172,3 respectively, the data distribution pattern tends to be in the West-East direction. In addition, the incidence rate semivariogram values in Figure 6 and Figure 7 are closer to the average than in Figure 5 which tends to be stretched.

Prediction of IR DHF using Ordinary Kriging
The predicted pattern of the spread of the incidence rate of DHF is displayed in the form of a color map in Figure  8, Figure 9, and Figure 10. The color shows the interval of incident rate values in the area. There is a color gradation starting with dark blue which indicates the low incidence rate value, then followed by light blue, yellow, orange, pink, and dark red for the highest incidence rate value. Based on the lowest RMSE results in table 9, for the prediction of the incidence rate spread in 2022 the semivariogram model used is Exponential, while in 2023 using the Spherical semivariogram model, and in 2024 using the Gaussian semivariogram model.
On the contour maps of Figures 8 and 9, the area of Cimahi City and the western part of Bandung City is colored light blue to dark blue, so that the incidence rate is in the range of 71 to 149 per 100.000 population. On the other hand, the areas of West Bandung Regency, Bandung Regency, and the northern and eastern parts of Bandung City are yellow to dark red which means the  While in Figure 10, the area of Cimahi City and the western part of Bandung City are light blue and yellow, so the incidence rate is in the range of 83 to 151 per 100.000 population. Then, in the areas of West Bandung Regency, Bandung Regency, and the northern and eastern parts of Bandung City, the colors are yellow to dark red, which means that the incidence rate is in the range of 129 to 297 per 100.000 population. The results of the prediction of the incidence of dengue fever in sub-districts that have not been recorded are presented in table 10.

Discussion
Based on the results of the study, a prediction map for the classification of the distribution of the incidence rate of DHF in each sub-district in Bandung City was made using Random Forest and Ordinary Kriging with their respective advantages and disadvantages. Figure 11, figure 12, and figure 13 show maps created with Ordinary Kriging based on the best semivariogram model. Figure 11 is made with the Exponential model, Figure 12 is made with the Spherical model, and Figure  13 is made with the Gaussian model. While Figure 14, Figure 15, and Figure 16 show a map created with Random Forest based on the feature expansion in the best prediction model.   Figure 11 shows the predicted distribution of the incidence rate of dengue fever in 2022 at 33% for the medium category and 67% for the high category. Meanwhile, Figure 12 shows the distribution of the incidence rate of DHF in 2023 which is predicted to be around 27% for the medium category. This indicates a decrease in the incidence rate of 6% in the medium category and causes an increase in the high category to 73%. Then in Figure 13, the incidence rate distribution in 2024 for the medium category is predicted to experience an upward trend of 10% from 27% to 37%. Meanwhile, for the distribution of incidence rates in the high category, there is a downward trend which is predicted to be at 63%. Figure 14. Prediction map of dengue incidence rate classification in 2022 with Random Forest Figure 14 shows a map of the predicted distribution of dengue incidence rates in 2022 with Random Forest. Of the 30 sub-districts, 3 of them are categorized as a medium class, the remaining 27 sub-districts are categorized as high class. Three sub-districts with the medium class category are the Andir sub-district, Bandung Kulon sub-district, and Babakan Ciparay subdistrict. Thus, the prediction of the incidence rate distribution of DHF in 2022 is 10% for the medium class and 90% for the high class. This shows that the distribution of DHF is higher in the Northwest-Southeast to the eastern part of Bandung City. Based on Figure 14, Figure 15, and Figure 16 show a map of the distribution of incidence rates predicted by Random Forest. The three maps show the results of the incidence rate prediction with a distribution pattern that tends to be the same. and 2024, there are differences in the pattern of regional distribution. In 2022 the incidence rate in Andir District is predicted to be 79 which belongs to the medium category, then in 2023 and 2024 it is predicted that around 124 are included in the high category. This causes an increase in the incidence rate of DHF in Andir District by 49%. Meanwhile, in Batununggal District, the incidence rate decreased by 64%, which is predicted to be in the moderate category in 2023 and 2024. When comparing the results of the prediction map for the classification of random forest and ordinary kriging, the distribution of the incident rate is moderate in the western and southwest areas of Bandung City and the distribution of the high incidence rate occurs in the eastern area of Bandung City. In addition, for the medium and high categories, ordinary kriging has a lower distribution pattern than random forest. However, the advantages of ordinary kriging can be used to predict the incidence of DHF rates in areas whose values have not been recorded.
The prediction results of random forest classification and ordinary kriging are good enough to display in the form of a map. The random forest classification prediction model developed in this study has better performance than studies [8], [9], [10], and [11]. This is because the random forest model developed in this study applies feature expansion based on several previous years and obtained an accuracy value of 97% in model testing. While the model evaluation results in [8], [9], [10], and [11] have an accuracy value of less than 97%. Thus, feature expansion greatly affects the performance of the random forest model and can increase accuracy. In addition, the map produced by this research is better than research [4], [13], and [15]. This study combines random forest and ordinary kriging methods to produce a prediction map for the distribution of dengue incidence rates for the next three years. Whereas studies [4] and [13] using the random forest method produced prediction maps for one year only. While research [15] developed a map for the next few years using the kriging method, but the resulting map is not based on classification, so it does not know which areas have a DHF incidence rate in the low, medium, and high categories. This study produces a map that predicts the classification of the distribution of dengue incidence rates in the low, medium, and high categories.

Conclusion
Based on the research results, it can be concluded that the expansion of attributes based on time in the process of developing a classification prediction model with random forest affects the accuracy produced. The best classification prediction model for DHF with the random forest is based on the previous 2 years, 3 years, and 4 years with the resulting accuracy of 97%, 93%, and 93%, respectively. The model produced a prediction of the classification of the incidence rate of DHF with a moderate class distribution of 10% and a high-class distribution of 90% for 2022, 2023, and 2024. Furthermore, Ordinary Kriging predicted the distribution of incident rates in other locations and 30 sub-districts with RMSE values of 0.762 for 2022, 0.996 for 2023, and 0.953 for 2024. Meanwhile, the most influential features on the spread of dengue disease obtained by expanding features based on time are population, the proportion of the male population, rainfall, blood type B, blood type O, and elementary school graduation. Overall, this research can be used as a reference to reduce the spread of DHF. So that related parties can provide optimal solutions by utilizing the most influential causal factors based on the results of this study to reduce the incidence rate of DHF in each sub-district in Bandung City. For further research, prediction of the spread of DHF can be done by adding datasets, especially per village and other factors that cause DHF, and using other methods as a comparison in prediction and classification to get better accuracy results.