Cancer Detection based on Microarray Data Classification Using FLNN and Hybrid Feature Selection

Cancer is one of the second deadliest diseases in the world after heart disease. Citing from the WHO's report on cancer, in 2018 there were around 18.1 million cases of cancer in the world with a total of 9.6 million deaths. Now that bioinformatics technology is growing and based on WHO’s report on cancer, an early detection is needed where bioinformatics technology can be used to diagnose cancer and to help to reduce the number of deaths from cancer by immediately treating the person. Microarray DNA data as one of the bioinformatics technology is becoming popular for use in the analysis and diagnosis of cancer in the medical world. Microarray DNA data has a very large number of genes, so a dimensional reduction method is needed to reduce the use of features for the classification process by selecting the most influential features. After the most influential features are selected, these features are going to be used for the classification and predict whether a person has cancer or not. In this research, hybridization is carried out by combining Information Gain as a filtering method and Genetic Algorithm as a wrapping method to reduce dimensions, and lastly FLNN as a classification method. The test results get colon cancer data to get the highest accuracy value of 90.26%, breast cancer by 85.63%, lung cancer and ovarian cancer by 100%, and prostate cancer by 94.10%.


Introduction
Cancer is one of the second deadliest diseases in the world after heart disease. Citing from the WHO report on cancer [1], there will be at least one from six people who dies from cancer in the world. In 2018 it was recorded that around 18.1 million cancer cases in the world with the total death of 9.6 million people, WHO predicted that in 2040 there will be a possibility of an increase in cancer cases to 29.4 million cases with total deaths predicted to be almost double the death rate in 2018. Based on the presented data, the role of technology that is able to detect cancer early in order to reduce the number of cancer cases in the future is needed.
As time goes by bioinformatics technology is also getting more advanced, now microarray data becomes popular for use in the analysis and diagnosis of cancer in the medical world. DNA microarray data is often used to examine how large numbers of genes are expressed simultaneously at the same time. By utilizing the results of the analysis of gene expression, detecting whether a person is diagnosed with cancer will be more efficient than the traditional method where the medical team have to check the symptoms or signs of cancer of the patients [2].
DNA microarray data has an enormous dimensions that this can affect the level of accuracy when searching for informative genes in the DNA data [3]. A dimension reduction method is needed to identify informative genes that can be used to predict cancer. Mukesh Kumar et.al [4] conducted a research on the leukemia, ovarian, and breast cancer dataset using t-test dimension reduction method and Functional Link Neural Network classification. He explains based on the results of the research above, Legendre Polynomial is able to provide the best performance results compared to Functional Link Neural Network other three techniques and also he suggests the use of hybridization in dimensional reduction to reduce the complexity of the classification model. In that research, Kumar got an accuracy value of 97.22%, 98.42% and 85.57%. Putri Tsatsabilla Ramadhani et.al [5] with colon and leukemia cancer dataset got the accuracy value of 92.3% and 87.5%.
In this research, the author proposes an early cancer detection with the use of hybridization when reducing dimensions using IG-GA method and Functional Link Neural Network method based Legendre Polynomial to know how big the effect of differences during hybridization in reducing dimensions especially on the required computational time parameter and data classification's performance results along with the effect of Learning Rate parameter values on the FLNN method on the obtained performance value. The obtained results are expected to help the medical world to diagnose early symptoms or signs of cancer.  Figure 1, the system design that was built is divided into several stages of process, the first process is to pre-process the available dataset which will then be carried out a feature selection using Information Gain and Genetic Algorithm. The use of hybrid method in reducing the dimensions makes the features that will be used at the classification stage using the FLNN method less than the non-hybrid method.

Preprocessing Data
At the data pre-processing, there are two stages carried out by the author including solving the problem if missing values are found in the dataset along with standardizing the data. Solving the missing value problem is carried out in order to maintain good performance results, as for the used techniques are vary so there will be several scenarios/attempts to be carried out. Normalization will be done using Equation 1 as the min max scaler function.

Split Data
K-Fold Cross Validation is a method of dividing training data and test data. The proportion of the training data and test data distribution depends on the predetermined K value. In this research, author uses K with the value of five so that there will be five data partitions of four training data and one test data. During the process, the data that has been partitioned as training data and test data will be used for classification alternately and the classification results taken are the average results of partitions number. Figure 2 is the illustration of the K-Fold (K=5).

Information Gain
Feature selection is part of the dimension reduction process by selecting several features that are considered important for the classification process [11]. In this research author uses Information Gain as the filter method. Citing from [9] and [6] that filter method works without the influence of the classification technique/method. This explains that by ranking each feature, the feature selection is able to provide more Information Gain is the subtraction of the values of Entropy(S) and EntropyA(S) where Entropy(S) is the parent entropy as seen in Equation 3 and EntropyA(S) is the child entropy as seen in Equation 4. Entropy(S) as parent entropy with P(Ci, S) is the probability of class Ci on the S set. Si is the number of cases in the i-th partition where AI is the value of the attribute or feature of A.

Genetic Algorithm
According to Eric Cantu-Paz [12] Genetic Algorithm (GA) is a feature selection that is able to give good results and can produce higher performance results on certain datasets. In performing the feature selection using Genetic Algorithm, it is necessary to determine the proportion of training data and test data first. Referring to [5] the following are several stages of GA that have been adapted to the requirements of this study.
The first stage is Individual Representation. Where in this stage each individual will be represented as a binary number (0 or 1). Then initialization of the population based on the binary number is carried out randomly as much as the number of features and the size of the population. Feature selection is done by making each individual in the population as a representation of the to be selected feature. If a bit is equal to 0 then the feature will not be selected, whereas if the bit is equal to 1 then the feature will be selected.
After each individual is represented, Fitness Evaluation is conducted. In this second stage, the FLNN algorithm is used to produce performance results (F1-Score) as a function of the fitness as seen in Equation 5.
After the fitness value for each individual is obtained, the individual with the highest fitness value will be selected as The Elitism so that the fitness value does not disappear during the ongoing genetic operation.
The third stage is Genetic Operation. Where it has 3 substages of which the first sub-stage is Parent Selection. Referring to [6], in selecting the parent for the next generation, two individuals with the highest fitness value from the last generation will be selected. The second sub-stage is Crossover, where it needs to be done on the chromosomes that have been selected as parents to get the offspring or commonly called children. Each offspring chromosome will have inherited genes from the parent chromosome. Referring to [5], the crossover probability used is 0.8. Lastly, mutations need to be done by generating offspring chromosomes randomly based on the predetermined mutation probability. Binary numbers that have been randomly generated will be checked whether they meet the criteria for less than the mutation probability, if they meet the criteria the binary numbers will be inverted. Referring to [5], the mutation probability used is 0.01.
Then, Survivor Selection is conducted for the fourth stage. Generational Replacement is needed as the survivor selection for the next generation where the next generations will contain new chromosomes resulting from crossover and mutations, as well as the best chromosomes that have been stored in The Elitism.
After all of the above stages are done, Criteria Termination is conducted as the last stage. Where the iteration in GA will end when it reaches the maximum generations or target that has been set.
Listed in Table 2 are the required parameters in the Genetic Algorithm implementation.

Functional Link Neural Network
The next process that will be carried out after the dimension reduction is to classify the microarray data using the FLNN (Functional Link Neural Network) method with the Legendre Polynomial base function so that the results of microarray data classification into cancer classes are represented by a value of 1 and classified as negative with the value of 0. Functional Link Neural Network is an artificial neural network that has a single layer architecture, so that FLNN does not have a hidden layer [5]. Based on [8] from [5] when compared to neural networks that use hidden layers, it can be said that FLNN has more efficient and faster computation when compared to Multilayer Neural Network (MNN). This is supported in [4] which explains in his research that the Legendre Polynomial base function is able to provide the most optimal results compared to other FLNN base function in classifying microarray data. The following are the steps of the Functional Link Neural Network classification algorithm according to [8] in [11]: Based on Figure 3, the first step is to find the value of The Legendre Polynomial with Equation 6 as the base function.
Where Li is the Legendre Polynomial, i is the order of polynomial and x is the original data input value.
The second step is to sum the value of the Legendre Polynomial as seen in the Figure 3 with Equation 7 .
Where Si is the linear sum value of Legendre Polynomial, wi is the weight value, bi is the bias value and n is the amount of data (feature) in one object.
Then, the obtained linear sum value will be activated by using the sigmoid activation in the third step with Equation 8.
The next step is to evaluate the obtained classification results using Equation 9 as the mean square error function.
Where d1 is the prediction target value and yi is the prediction results value.
Finally in the last step of FLNN, the backpropagation learning that is used by the algorithm has two calculation stages. The first calculation is a forward calculation to calculate the error between the prediction class with the target class, and then the second calculation is a backward calculation to propagate the error backwards to update the w value with Equation 10.
Where is the learning rate, ̂ is the first momentum while ̂ is the second momentum and is epsilon.

Performance Evaluation
The last step in this research is to evaluate the performance to find out how well the system that has been built uses hybrid in dimension reduction and FLNN as the classification method. The use of confusion matrix as the basis to determine the actual data and predicted data  Table 3, TP is the value for the system successfully classifying the data as positive for cancer according to the actual data, FP is the value for the system failing to classify the data as negative for cancer according to the actual data, FN is the value for the system failing to classify the data as positive for cancer according to the actual data, and TN is the value for the system successfully classifying the data as negative cancer according to the actual data.
Precision is the value of the match or compatibility between the requested information and the results provided by the system which can be obtained with Equation 11.

= +
Recall is the value of the success of the system in finding back information which can be obtained with Equation 12.
F1-score is the average of precision and recall which can be obtained with Equation 13.
Accuracy is the value of the system's success in predicting true positive and true negative compared to all data which can be obtained with Equation 14. IG+GA to examine the effect of hybridization in the dimension reduction process. The next scenario is to examine the effect of the learning rate parameter on the FLNN classification model. The limitations in this study are the use of order 2-4 for the Legendre Polynomial and the use of GA parameter as the feature selection as shown in Table 2.

Test Result
Attached are the results of the test that have been carried out using the Information Gain and Genetic Algorithm hybridization feature selection method and using the FLNN classification method with predetermined parameters. Attached in Table 4 experiments using the Information Gain dimension reduction method without the wrapping method and the learning rate of 0.6. Author obtains the optimal accuracy values for breast cancer data on order 2 of 52.53%, colon cancer, ovarian cancer, and prostate cancer data on order 3 of 59.99%, 84.44% and 54.39%, and for lung cancer data on order 4 of 98.90%.
Attached in Table 5 is the performance results after wrapping with the Genetic Algorithm method and the learning rate value is 0.6. Author obtains the optimal accuracy values for colon cancer and lung cancer data on all orders at 64.62% and 100%, on ovarian cancer data on order 4 with the value of 98.42%, and for breast cancer data the value for both order 2 and 3 is 53.58%, and prostate cancer data for order 2 is 61.69%.
Attached in Table 6, experiments using the Information Gain dimension reduction method without the wrapping method and the learning rate value of 0.001. Author obtains the optimal performance results for breast cancer data on order 2 with the accuracy value of 69.05%, for ovarian cancer data the optimal accuracy value for both order 2 and 3 is 99.61%, while for colon cancer and lung cancer data on order 4 the optimal accuracy values obtained are 84.10% and 99.44% and on prostate cancer data on order 3 of 91.19%.
Attached in Table 7 is the performance results after wrapping with the Genetic Algorithm method and the learning rate value of 0.001. Authors obtains the optimal accuracy value for colon cancer and prostate cancer data on order 4 of 90.26% and 94.10%, on breast cancer data on order 3 with the optimal value is 85.63%, while for lung cancer and ovarian cancer data on all orders with the values of 99.44% and 100%.

Effect of Hybridization Method on Performance Results
There have been tests on five cancer datasets used by author and the use of predetermined parameters. The parameter values used refer to the author's research reference. Attached are the results of the tests that have been conducted.  Based on Figure 4 to 13, it can be seen that the use of Information Gain and Genetic Algorithm as hybridization method is able to increase the accuracy value and F1-score in almost all scenarios for each dataset. This is because the use of the Genetic Algorithm method as the wrapper method in the dimension reduction process is able to optimize the FLNN model which is used as the fitness function in the Genetic Algorithm process. Those can be seen on the colored bars where compared to the blue bars, the red bars will always be higher and goes the same for green bars compared to purple bars. As for citing from [9] in [6], the use of Genetic Algorithm as the wrapper has a weakness of inefficient computation time due to taking hypotheses model into training and testing on the used feature space. Increasing the order value of the FLNN can also affect the computation time due to the increase in the input space so it will require a longer computation time.

The Effect of Learning Rate on Performance Results
In the FLNN classification method, there is a Learning Rate parameter which is very influential on the performance results of a dataset. Referring to the research conducted by Putri [5], Putri explains that the Learning Rate parameter had a role during the training process where Putri used the Learning Rate parameter values of 0.6 and 0.01. Based on the experiment, Putri explains that the Learning Rate parameter of 0.6 is able to provide more optimal performance results on colon cancer and leukemia cancer data. In contrast to the test conducted by Putri, author uses five datasets which include colon cancer, breast cancer, lung cancer, ovarian cancer, and prostate cancer. Based on various tests on the five datasets, author gets different results like what Putri gets [5] where colon cancer tends to have more optimal performance results using the LR of 0.6 compared to the LR of 0.01. In the colon, breast, ovarian, and prostate cancer datasets, author gets the optimal performance results using the LR parameter of 0.001 as respectively seen in Figure 4 and 5, Figure 8 and 9, Figure 10 and 11, and lastly Figure 12 and 13. As for lung cancer, it tends to be more optimal with the use of the LR parameter of 0.6 as seen in Figure 6 and 7. The difference in results on the colon cancer dataset with Putri [5] in the colon cancer dataset can be caused by differences in the used parameters in the used FLNN algorithm. Attached in Figures 12 and 13, the performance results from prostate cancer data are able to increase significantly with the use of the LR of 0.001 compared to the LR of 0.6. It is the same for breast cancer and ovarian cancer data, although the increase is not as significant as the prostate cancer data. According to [5], the difference in performance results obtained can occur due to differences in the characteristics possessed by each dataset so that determining the value of the LR parameter is one aspect that should be considered in the FLNN classification model because it greatly affects the performance of the neural network in achieving the expected results. Determining the value of LR will have an impact on the performance of backpropagation learning where LR is the parameter used in the process of updating the weights for each input. If the LR value is too small, the training process will take longer because the steps to reach the minimum point of the loss function will be smaller, while if the LR is too large, the training process will be divergent. The LR value that is too large can also cause a very large weight change so that the optimizer can worsen the loss value.

Conclusion
Based on the test that have been conducted on the five datasets that the author uses, the author is able to obtain a cancer detection using the hybridization method when reducing dimensions where Information Gain and Genetic Algorithm can optimize the performance results and the required time consumption. The use of Information Gain serves to optimize the consumption of computational time by taking the best 100 features based on the ranking that has been done, while the use of Genetic Algorithm functions to optimize the results of data performance that has been previously selected by Information Gain. From a series of test scenarios that have been conducted, author finds that the value of the Learning Rate parameter has a major influence on performance results where LR of 0.6 is able to provide optimal values for lung cancer data with the highest accuracy values of 100%. In contrast to lung cancer, colon cancer, breast cancer, ovarian cancer and prostate cancer datasets have the highest accuracy values of 90.26%, 85.63%, 100% and 94.10%. The increase in the Legendre Polynomial's order referring to [5] and [11] cannot guarantee that it will increase the performance value and tends to increase the input space so that it will require a longer computation time. From the obtained results, it can be concluded that the use of hybridization method is able to optimize the performance result of FLNN model and the consumption of computational time whereas Learning Rate is a hyperparameter which the optimal value can be obtained by trying different values and see which one gives the best loss without sacrificing speed of training model.
In future research, it can be done by changing the combination of dimensional reduction methods used during hybridization like t-test and Genetic Algorithm as recommended by Mukesh Kumar [4] or by optimizing the parameters used in the FLNN algorithm.