QSAR Study of Larvicidal Phytocompounds as Anti-Aedes Aegypti by using GA-SVM Method

Aedes aegypti is one of the most dangerous mosquitoes that can cause several deadly diseases, such as dengue fever, Chikungunya, Zika, and jaundice with high mortality rate. For now, no specific drug has been found that can cure the disease caused by Aedes Aegypti. One possible solution for handling this problem is to inhibit the growth and development of Aedes aegypti larvae. This study aims to implement Genetic Algorithm-Support Vector Machine to develop Quantitative Structure-Activity Relationship model for identification larvicidal phytocompounds as anti-aedes-aegypti. Hyperparameter tuning was performed to improve the performance of the models. Based on the result, we found that the best model was developed by the RBF kernel with the value of 𝑅 2 and 𝑄 𝑙𝑜𝑜2 score are 0.64 and 0.64, respectively.


Introduction
Mosquitoes are the main vector of several diseases those attack humans and animals that cause thousands of deaths every year. Aedes aegypti is one of the most dangerous mosquitoes that can cause several diseases, such as dengue fever, Chikungunya, Zika, and jaundice [1], [2]. Dengue fever is considered one of the dangerous diseases caused by Aedes aegypti because the mortality rate is high and continues to increase every year [1]. The mortality rate of dengue fever has grown significantly worldwide at this time [1]. It is estimated that nearly thirty-nine million people worldwide are infected annually [1]. Symptoms of dengue fever are characterized by high fever accompanied by severe headache, muscle and joint pain, nausea or vomiting, and swollen glands [3].
Recently, no specific drug has been found that can cure dengue fever [3]. One possible solution for the treatment of dengue fever is to inhibit the growth and development of Aedes aegypti larvae. Several chemical products have been tested against larvae of Aedes aegypti such as phenolic acids, spinosyns, coumarins, et. Al [2]. Those compounds can help to inhibit the growth and development of Aedes aegypti larvae [2], [4], [5]. However, some chemical products are toxic and harmful to the environment [2]. Therefore, it is expected that the larvicides of plant products will be a source of raw materials and a safer alternative, which results in little waste, and is non-toxic to non-target organisms [6], [7], one of them is a group of larvicidal phytocompound [8]. It is known that the design of conventional drugs is not effective because the new compounds with certain biological activities need to be synthesized to determine their activity [9]. Hence, we need a model that can predict drug candidate activities, such as Quantitative Structure-Activity Relationship (QSAR).
QSAR is an alternative method developed for linking chemical molecules with activity biologically based on their chemical structure [10]. QSAR uses chemometric methods to describe the biological activity or nature of varying physicochemical properties as a function of a molecular descriptor that describes the structure of the chemical molecule [11]. Therefore, computed descriptors can be used to predict new compounds [11]. One of the challenges in QSAR study is to obtain optimal feature. One of the solutions to solve is use meta heuristic algorithm, such as Genetic Algorithm (GA) to select the optimal feature. Bahesthi performed QSAR modeling to analyze the activity of 68 urea derivatives as antimalarials using the GA-Multiple linear regression method [12]. The model validation was validated using external validation, namely leave-one-out (LOO) cross validation and yrandomization test, and obtaining the squares of the correlation coefficients 2 0.801 and 0.803, respectively [12]. In 2017, Doucet, et. al. predicts the toxicity of a derivative of piperidine Aedes aegypti using QSAR models [13]. They predict the toxicity of 33 piperidine derivatives against Aedes aegypti [13]. The predicted toxicity was calculated using Ordinary Least Squares-Multi Linear Regression from QSARINS and Support Vector Machine (SVM), with the coefficient of determination (r 2 ) 0.85 and 0.80, respectively [13].
In 2020, Javidfar performed modeling larvicidal phytocompounds against Aedes aegypti using the index of ideality correlation [2]. They developed three QSAR models to predict pLC5062 plant-derived compounds to fight Aedes aegypti by method-based Monte Carlo on the IIC criteria, with the excellent predictive of the models (r Val 2 = 0.856 to 0.977) [2]. In 2020, Farisi Rahman, et. Al. carried out a QSAR model derived from Fusidic Acid as an Antimalarial Agent using the Simulated Annealing (SA) -SVM method [14]. The results showed that SA as a feature selection resulted in a satisfactory combination of features. Then, for the best validation results are generated by the RBF kernel [14].
In 2021, Fajar, et. al. predicts the activity of indenopyrazole derivatives as anti-cancer drugs using the QSAR model with the SA-SVM method, with three kernel models for SVM, namely the RBF kernel, linear kernel, and polynomial kernel [15]. Based on the three kernels, the RBF kernel produces an 2 score train and the best test is 0.79 and 0.60, respectively [15]. Also, QSAR Model has been implemented to identify other disease [16], [17], [18], [19], [20]. However, to the best of our knowledge there is no report of the implementation of meta heuristic, such as GA, to select the features for the case of larvicidal phytocompounds.
In this study, we aim to build QSAR Model to predict larvicidal phytocompound activity as anti-Aedes aegypti with the Genetic Algorithm-Support Vector Machine methods. GA is generally a search-based algorithm built on the concept of natural selection and descendants [21]. GA is a subdivision of a much larger area of computing known as Evolutionary computing [21]. Meanwhile, SVM is a supervised learning technique that determines to classify different categories of data from different disciplines for classification problem solving and regression analysis [22].

Research Methods
In this research, we aim to build QSAR Model to predict larvicidal phytocompound activity as anti-Aedes aegypti with the Genetic Algorithm-Support Vector Machine (SVM) methods. Genetic Algorithm is used as a feature selection technique, while the SVM is used as a prediction model. The flowchart design of the research procedures is depicted in Figure 1.

Datasets
A collection of 62 samples used in this study was obtained from Ref [2]. The observed data is regarding the larvicidal activity against Aedes aegypti is LC 50 which is converted into molar units and is expressed on a negative logarithmic scale (− log LC 50 ) or called pLC 50 [2]. The molecule descriptor of the phytocompound was calculated from SMILES structure by using the PaDEL application. Meanwhile, the observed value of pLC 50 is used as the target value for developing the QSAR model. We performed data reduction on the dataset by calculating the variance value of each feature and the feature with variance value less than 0.5 were removed. Then, the dataset is split into train and test set with the ratio of 70:30.

Feature Selection
Feature selection is used to reduce dimensions by reducing the number of irrelevant features. The feature selection techniques used in this study is Genetic Algorithm. In computer science, GA is a metaheuristic algorithm inspired by natural processes that belongs to the larger class of evolutionary algorithms [21]. GA are usually used to generate quality solutions for optimization and search problems using selection, crossover, and mutation operators [23]. Flowchart of GA is shown in Figure 2.
GA usually starts by initializing the population and runs in several iterations. At the end of each iteration, a new generation will be obtained and put into the next iteration, the algorithm will end when it reaches the maximum number of iterations or finds the best solution. To evaluate the optimal solution that generated from GA, we performed the fitness function. The equation of fitness function are formulated in Equation (1) where 1 represent the coefficient of determination ( 2 ) of selected feature weight and 2 is the number of feature weight. Meanwhile, the variable RSS and TSS represent sum of squares of residuals and total sum of squares value, respectively. The parameters used in genetic algorithm are presented in Table 1.

Prediction Model
To predict the model in this study we developed SVM model. Support Vector Regression (SVR) is one of the popular options for predicting and determining curve fitting in both linear and non-linear regression types [22]. This SVR model is the basic elements used in Support Vector Machine (SVM). SVM works by dividing between classes based on hyperplane division in N-dimensional space [22]. The essence of SVM is to get the optimal hyperplane location, then measure the margin and find the maximum point of the hyperplane [24]. The generalized equation for hyperplane represented in Equation (4) = + where w is weights and b are the intercept at X = 0. In this case we use the SVR model with its respective kernels i.e., linear kernel, polynomial, and RBF. Then, to improve the performance of the model, we perform a hyperparameter tuning procedure. The ranges of parameters values in the hyperparameter tuning of the model selected are presented in Table 2.

Model Validation
To validate the QSAR will be carried out test twice, namely the internal validation and the external validation test which later the value will be compared to the threshold value that determines model acceptance.
The internal validation test was carried out by calculating the coefficient of determination of ( 2 ) and Leave-One-Out (LOO) cross-validation ( 2 ) using training data. Meanwhile, in the test external validation is done by calculating the coefficient of determination ( 2 ) using test data [25]. Those parameters are formulated in Equation (5) the variable y and ŷ represent the actual value and the predicted value of the pLC 50 value, respectively. While y ̅ and ŷ ̅ represents the average actual value and predicted value, respectively. Models that can be considered as a valid model if they meet the criteria shown in Table 3.
The Applicability Domain (AD) of the model is determined to ensure the data set lies in the model domain. Determination of AD is calculated using the leverage method which is formulated in Equation (17) = ( ) −1 (17) where represents the score matrix obtained from the PLSR procedure and the critical leverage value (h*), and represent transpose of X. The equation for critical leverage (h*) represented in Equation (18) where p defined the number of attributes and n is the data involved in the training process. The predicted value of the data can be accepted if the calculated leverage value is less than the critical leverage [25].

Feature Selection
Genetic algorithms are mostly criteria-based probability in nature. On the contrary, the algorithm works well against local random search, which uses random solutions, cannot identify the best solution. Therefore, feature selection using the genetic algorithm is carried out with multiple runs of 20 times to ensure that the objective results of the scores obtained are consistent and this can be confirmed by looking at the results of the standard deviation of 20 multiple runs of each model used. The distribution of the objective scores for each kernel is shown in Figure 4.
Based on Figure 4, as for linear kernel, the distribution of the highest score objective value most often appears in the range 0.81 to 0.82 and the lowest score objective appears in the range 0.79 to 0.80. As for the polynomial kernel, the highest score objective value distribution most often appears in the range 0.76 to 0.80 and the lowest objective score appears in the range 0.72 to 0.74. As for RBF kernel, the distribution of the highest score objective value most often appears in the range 0.79 to 0.80 and the lowest score objective occurs in the range 0.77 to 0.79.   Table 4. Based on the result, we found that the optimal number of features for linear, polynomial, and RBF kernel are 234, 212, and 207, respectively. We also found that the linear and RBF kernel produced the lowest standard deviation compared to the other model. This indicates that the GA solution in the linear and RBF kernel performed almost similarly in every multiple-run scheme.
The convergence plot of GA shows in Figure 5, the highest objective score is obtained by the linear kernel, and in linear kernel gets the best score objective with the fastest iteration. While the lowest score objective is found in the RBF kernel, in the RBF kernel it is faster to get the best score objective (based on iterations) than the polynomial kernel.

Hyperparameter Tuning
The summary of hyperparameter tuning result is presented in Table 5, hyperparameter tuning is used to obtain the best parameters for all Kernel in SVR model. For parameters 'C' and 'gamma' we find that each kernel has a different value, while the 'degree' parameter in each kernel has the same value. We present a comparison of the 2 score between nontuned and tuned kernel in Figure 5. The result shows that a significant difference between the tuned and nontuned kernel. We found the improvement increase of 2 score of linear, polynomial, RBF kernel are 0.331, 0.389, and 0.309, respectively. The highest increase occurs in the polynomial kernel because, the polynomial kernel uses 3 parameters compared to other kernels. The comparison of hyperparameter tuning result presented in Figure 6. The comparison of the predicted value and the actual value of pLC50 for each kernel is shown in Figure 7. The x-axis and y-axis represent the actual value and the predicted value, respectively. Each model shows a strong relationship between the predicted model and the true value. We also find that each data point for all kernels is located close to the diagonal line with a not very significant difference. The summary of validation result is present in Table 6 and Table 7, to validate the QSAR model, several statistical parameters were calculated and compared with the threshold value [25]. In the training set, we found for each model met the criteria on the threshold value. However, in the testing set, the linear model is the only invalid model, because there are values that do not meet the threshold criteria. Then, the best 2 score is obtained by the RBF kernel which is caused by the low feature number of the RBF kernel is 207, and the worst 2 score is obtained by the linear kernel because the feature number of the linear kernel is 234. The William's plot that represents the applicability domain (AD) of the model is shown in Figure 7, based on Figure 7, for each data set and data train on each kernel there is no value higher than the critical leverage (h*), meaning that the data is reliable. The AD William's plot shows that our model results are proportional to the existing leverage approach as well as predictive models were mostly acceptable for all responses. The results of the analysis in Figure 8 show that the predictions of each model are most likely correct.

Conclusion
We have developed a QSAR model by using GA-SVM method, SVM method to identify larvicidal compounds as anti-Aedes aegypti. The number of features is reduced by the variance threshold. Then, feature selection is continued by calculating the statistical parameters of the genetic algorithm. We performed model performance improvement with hyperparameter tuning procedure. Based on the validation results, we found that the best model was developed by the RBF kernel that satisfies all criteria with the value of 2 and Q loo 2 score is 0.64 and 0.64, respectively.