DPP IV Inhibitors Activities Prediction as An Anti-Diabetic Agent using Particle Swarm Optimization-Support Vector Machine Method

Diabetes mellitus is a chronic illness that can affect anyone, while the medicine that can entirely cure diabetes has not been discovered yet. Dipeptidyl Peptidase IV (DPP IV) inhibitor is one of the agents with potency as an anti-diabetic treatment. In this work, we utilized the machine learning method to predict the activity of DPP IV as an anti-diabetic agent. We combined Particle Swarm Optimization (PSO) method for features selection and the Support Vector Machine (SVM) for the prediction model. Three SVM kernels, i.e., radial basis function (RBF), polynomial, and linear, were utilized, and their performance was compared. A Hyperparameter tuning procedure was conducted to improve the performance of models. According to the results, we found that the best model obtained from SVM with RBF kernel with the value R2 of train and test set are 0.79 and 0.85, respectively.


Introduction
Diabetes mellitus (usually known as just diabetes) is a metabolic disorder caused by a loss of β-cells in the pancreas that affects insulin production [1].Diabetes can be easily detected by a prolonged high blood sugar level.In general, diabetes can be divided into three types: type 1 diabetes, type 2 diabetes, and gestational diabetes [1], [2].Type 1 diabetes is due to the loss of β-cells in the pancreas, causing a deficiency in insulin produced by the body.Type 2 diabetes is caused by cells' failure to properly respond to insulin.Gestational diabetes occurs in pregnant women and is caused by a sudden weight gain during a gestational period [3].
Diabetes can be treated by oral anti-diabetic drugs that are widely available, such as metformin [4].Unfortunately, such drugs can have some side effects, for example, gas (flatulence) and diarrhea on metformin [5].Therefore, research for new anti-diabetic agents is needed to overcome any problems with diabetic treatments.One agent that has potency in controlling blood sugar levels is the dipeptidyl peptidase IV (DPP IV) inhibitor.DPP IV inhibitor is a class of oral antidiabetic drugs that inhibit DPP IV enzyme [6], [7].Some researches that have been done on DPP IV inhibitor show their potency as a treatment for diabetes [8], [9].To increase the effectivity of DPP IV inhibitor as an antidiabetic agent, a structural optimization process is needed [10].The drug design process can be accelerated by using a Quantitative Structure-Activity Relationship (QSAR) approach [11].QSAR method is already proven to be effective in the drug design process by building a relationship between the activities of tested compounds with their molecular structures.Some models, such as regression models and classification models, can be used on building an efficient QSAR model in drug design [12].
In [13], Sharma et al. already developed a QSAR model for some derivatives of trifluorophenyl as DPP IV inhibitors by using 3D-QSAR Comparative Molecular Field Analysis (CoMFA) and Comparative Molecular Similarity Indice Analysis (CoMSIA).In their study, the model that they developed based on its structural alignment shows a good prediction with r 2 values are 0.963 and 0.934 for CoMFA and CoMSIA, respectively.Their model is useful for designing new DPP IV inhibitors.In [14], Jiang et al. developed a QSAR model for a set of arymethylamines as a DPP IV inhibitor by using a CoMFA approach with r 2 0.953.In [15] [16] analyzed 45 derivatives of triazolopiperazime amida as DPP IV inhibitors by using a 3D-QSAR that shows values of r 2 is 0.868 for both CoMFA and CoMSIA approaches, and r 2 pred are 0.816 and 0.863 for CoMFA and CoMSIA, respectively, for their best model.All [12]- [16] show that CoMFA and CoMSIA approaches can be used in designing a new anti-diabetic agent effectively.One of the challenges in a QSAR study is to obtain an optimal number of features.This issue can be solved by implementing a metaheuristic method to select features.However, to the best of our knowledge, there is no report of the implementation of a meta-heuristic method in the QSAR study on DPP IV inhibitors as anti-diabetic agents.
In this work, we aim to develop a QSAR model to predict the activities of DPP IV inhibitors by using Particle Swarm Optimization (PSO) -Support Vector Machine (SVM).In the first step, the PSO method is used to do a feature selection process to produce the best combination of features [17].After that, the SVM method is used in the second step to getting the most accurate model.The SVM method itself is already used in many QSAR studies as a trusted method for building accurate models [18].

Research Method
In this work a dataset of DPP IV inhibitor compounds is used to build a model via a two-step process: (i) feature selection by a PSO method, and (ii) model building by an SVR method.Later the model is optimized by using a hyperparameter tuning process.After the model is optimized, finally the model is validated by using a Leave-One-Out Cross-Validation (LOO-CV) method.

Dataset
A dataset used in this work consists of 134 compounds as DPP IV inhibitors together with their half maximal inhibitory concentration values, IC50, in nano-molar (nM), collected from works of literature [12].The unit of IC50 is converted from nano-molar (nM) to molar (M), then the values of IC50 in molar are converted into pIC50 which is a negative logarithmic of IC50, so now it shows a more potent inhibitor characteristic as the value of pIC50 increases.After that, molecular descriptors from all compounds are calculated by using PaDELdescriptor software, resulting in 1875 molecular descriptors.Later all compounds are divided randomly into a training dataset and a test dataset with a ratio of 70:30 (107 compounds in a training dataset and 27 compounds in a test dataset).Such conversion steps with a descriptors calculation step and a random division step for training and test dataset are commonly done in QSAR calculation, such as in [19].The distribution of the pIC50 value of all compounds can be seen in Figure 1.

Features Selection
In the features selection step, the Pearson correlation coefficient (PCC) is used to reduce the number of descriptors.PCC is a metric in statistics that is used to measure the linear correlation between two sets of data.In this work, PCC is calculated as a criterion that determines the optimal reduction filter for descriptors [20].Before calculating the PCC value of compounds, some descriptors with zero variance and standard deviation less than 0.95 are removed.After that, PCC analysis is used to remove descriptors that bring similar information to other descriptors.Descriptors with weak correlation with target (PCC value < 0.1) and strong correlation with other descriptors (PCC value > 0.9) are removed [21].From 1875 descriptors, 100 descriptors with the highest correlation are chosen.After that, to select the best descriptor, a Particle Swarm Optimization (PSO) algorithm is used.PSO algorithm is an optimization algorithm which is invented by Kennedy and Eberhart based on the behavioral actions of a swarm of birds [22].Here, the performance of a swarm of particles is evaluated on every iteration by using equations ( 1) and (2) as follows [23]: where   (, ) is the best-known position for particle  at an iteration step ,   () is the best position of the entire swarm at an iteration step , Π  is a set of arguments for fitting function with Π  ∈ {  (,  − 1),   ()} ,  is a fitting function, and   () is a position of particle  at a given iteration .The position and velocity of particle  are updated by using the following equations: ( + 1) =   () +   ( + 1), where   () is a velocity of particle  at a given iteration ,  is an inertia weight,  The first term of equation ( 3) represents the inertiaweighted velocity from the previous iteration.The second term of (3), called a cognitive term, provides momentum for each particle to move guided by the bestknown position in its own search space.The third term of (3), called a social term, guides the movement of each particle by the swarm's best-known position [24].Equation ( 4) updates the position of each particle by using their velocity from equation (3).

Support Vector Machine (SVM)
The Support Vector Machine (SVM) method is a learning algorithm that is based on statistical learning frameworks on some given samples to construct a hyperplane that can be used for classification or regression [25].There are two types of SVM: Support Vector Classification (SVC) and Support Vector Regression (SVR) [26].The goal of SVM is to construct the best hyperplane that can divide a given set of samples into two different classes in the n-dimensions space.The best hyperplane maximizes the distance (margin) between the hyperplane with the nearest training data points (called support vectors) from both sides [27].An example of the classification of data by using a linear hyperplane can be seen in Figure 2. To find the ownership of point  can be calculated by using the following equation: where   is a training data point,   = 1 or   = −1 shows the ownership of point   belongs to which class,  is a number of training data points,   is a Lagrange multiplier for point  , and 〈, 〉 is an inner product operator.The sign of () shows the ownership of point .In many cases, the classification process cannot be done correctly in a limited dimension of the space so the classification process needs to be done in a higher dimension.In that cases, the ownership of point  is calculated by the following equation: where  is a mapping function from the original dimension to the higher dimension.To add another dimension to a data point, several types of functionswhich are called kernel functionscan be used.Several popular kernel functions are radial basis function (RBF), polynomial, and linear [28].In Support Vector Regression (SVR), the goal is to build a model so that no output falls outside a specified margin  from the model [29].Three already mentioned kernel functions that are commonly used in SVR can be written as: (, ) = (Υ.X T  + )  ; Υ > 0;  = (1,2, … ), where ( 7)-( 9) are linear kernel functions, polynomial kernel functions, and RBF kernel functions, respectively.The most optimum model is the one with the smallest value of Root Mean Square Error (RMSE) [18].

Hyperparameter Tuning
Hyperparameter tuning is used to improve the performance of the model [21].In SVR, hyperparameter tuning on the dataset and feature selection is used to maximize the performance of the prediction [29].To optimize parameters in SVR, a Particle Swarm Optimization (PSO) method is used [30].A list of all parameters on SVR that need to be optimized is shown in Table 1.The kernel parameter determines the prediction model used by the SVR method.The options for the kernel are RBF, linear, and polynomial functions.The C parameter is the cost value which determines the penalty value for data located outside the margin area.The gamma parameter is the value of the coefficient in a kernel function.The degree parameter is a degree coefficient in a polynomial kernel.The epsilon parameter is the error margin allowed between the data and the regression line [31].

Model Validation
In this work, the generated model will be validated by shown in (12) represents the correlation between the quadratic coefficient and predicted activity value without intercept.In [34], the calculation of  2 which is based on the test prediction shown in (13).Equation (14) shows the correlation between the quadratic coefficient with predicted activity data.Equations ( 15) -( 16) show parameters that represent overall internal and external contributions to validation techniques to check the external predictability of the QSAR model.Equation (17) shows a difference of an average of randomized quadratic coefficient correlation.
Thresholds for each validation parameter for a model to be accepted are shown in equations ( 18) -( 23).

Results and Discussions
To determine the best model, the accuracy of the model is used as the main criterion.In this work, QSAR modeling is done with several different numbers of descriptors (5, 10, 15, 20, and 25 descriptors) on each RBF, polynomial, and linear kernel.The mean-squared error (MSE) on each model for each descriptor can be seen in Figure 3. From Figure 3, we can see that all three models show a tendency to have a smaller value of MSE, which is good as the number of descriptors increases.All three models show the smallest number of MSE when the number of descriptors is set to 25.This indicates that the increase in descriptor number corresponds to the improvement of the model performance.However, we limit the number of descriptors to 25 descriptors to avoid too complex a model.
The profile of the feature selection process presented in the plot of MSE corresponds to iteration, as shown in Figure 5.We found that the MSE in the first six iterations significantly decreased.Then, the error gradually decreases in the next iteration.Figure 5 also points out that the optimization process is done as expected, which is indicated by the decreasing of MSE during the iteration.We plot the predicted value of pIC50 against the actual one to get an insight into the model performance, as shown in Figure 6.The deviation between the plot with the diagonal line indicates the magnitude of the error.We found that the deviation of data in the RBF kernel is relatively smaller than in other kernels.Meanwhile, the deviation of data in polynomial dan linear kernels is quite similar.This deviation will directly correspond to the validation parameter value.We calculated several validation parameters to evaluate the model performance and compared them with the threshold value, as shown in Table 3.As for the train set, we found that all models satisfied the threshold, in which SVM with RBF kernel gives the best value of the Q 2 parameter.This indicates that the RBF kernel is suitable for transforming the train set into a new dimension that is more linearly separable.However, by considering the test set, we found that SVM with a polynomial kernel is not valid because  2 − 0 2  2 parameters do not satisfy the threshold.In this study, we consider the R 2 value of the test set to determine the best model.By comparing the R 2 value, we found that SVM with RBF kernel also gives the best performance in the test set.This point out the general ability of the RBF kernel to map out the dimension of both the train and test set.The outperform of the RBF kernel is related to the flexibility of this kernel to transform the data set.

Conclusion
We have developed prediction models to predict the activities of DPP IV inhibitors as an anti-diabetic agent using the Particle Swarm Optimization-Support Vector Machine (PSO-SVM).According to the results, we found that the PSO algorithm can be used to obtain the optimal number of features.The performance of the model was improved after conducting a hyperparameter tuning procedure.Based on the validation results, we found that the SVM model with RBF kernel gives the best results, with the R2 score of the train and test set being 0.79 and 0.85, respectively.

Figure 1 .
Figure 1.The Distribution of pIC50 Activities

Figure 2 .
Figure 2. Example of Linear Classification of Data into Two Classes using a Leave-One-Out Cross-Validation (LOO-CV) method.LOO-CV works by removing one molecule from the original training dataset and then generating the QSAR model again based on the remaining dataset.Reza Rendian Septiawan, Bambang Hadi Prakoso, Isman Kurniawan Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 6 No. 6 (2022) DOI: https://doi.org/10.29207/resti.v6i6.4470Creative Commons Attribution 4.0 International License (CC BY 4.0) 977 Then the activity of the removed molecule can be measured from equations produced by QSAR.This cycle is repeated until all molecules from the training dataset are already removed once and the activities of all molecules in the training data set are already calculated which are used in calculations of internal validation parameters [32].This model is used to predict the pIC50 of all molecules in the training dataset [33].R 2 value represents a correlation level between observed and predicted activities data, shown in (10) as (  −  ̅) and ( ̂ −  ̅) , respectively. ̅ is an average of molecular activities in the training dataset, and  ̅ is an average of molecular activities in the test dataset.  and  ̂ show an experimental and predicted pIC50 value of a molecule. shows the slope of the regression data, shown in (11).  2

Figure 3 .
Figure 3. MSE of Each Model for Each Number of Descriptors.

Figure 4 .
Figure 4.The Graph of MSE vs IterationAfter performing feature selection, we optimize the model by conducting a hyperparameter tuning process.The optimal parameter of the SVM model for each kernel is presented in Table2.We found that the value of the C parameter of the polynomial and linear kernel is

Figure 6 .
Figure 6.The Graph of Experimental pIC50 vs Predicted pIC50 for (a) RBF, (b) Polynomial, and (c) Linear Model of the model, we analyzed the applicability domain (AD) by evaluating leverage values, as shown in Figure 7.The rectangle in the Figure indicates the domain of model applicability.According to the Figure, we found that only one train data and two test data are lying outside the region for the RBF kernel.This indicated that the model is applicable to almost all of the data set.Finally, we evaluated the probability of a systematical error occurring in the RBF model by presenting a plot of residual error, as shown in Figure8.We confirmed that is no systematical error found in the model according to the pattern presented in the Figure.

Table 2 .
The Results of the Hyperparameter Tuning Process

Table 3 .
Testing Result of QSAR Model