K Nearest Neighbor Imputation Performance on Missing Value Data Graduate User Satisfaction

A missing value is a common problem of most data processing in scientific research, which results in a lack of accuracy of research results. Several methods have been applied as a missing value solution, such as deleting all data that have a missing value, or replacing missing values with statistical estimates using one calculated value such as, mean, median, min, max, and most frequent methods. Maximum likelihood and expectancy maximization, and machine learning methods such as K Nearest Neighbor (KNN). This research uses KNN Imputation to predict the missing value. The data used is data from a questionnaire survey of graduate user satisfaction levels with seven assessment criteria, namely ethics, expertise in the field of science (main competence), foreign language skills, foreign language skills, use of information technology, communication skills, cooperation, and self-development. The results of testing imputation predictions using KNNI on user satisfaction level data for STMIK PPKIA Tarakanita Rahmawati graduates from 2018 to 2021. Where using the five k closest neighbors, namely 1, 5, 10, 15, and 20, the error value of the k nearest neighbors is 5 in RMSE is 0, 316 while the error value using MAPE is 3,33 %, both values are smaller than the value of k other nearest neighbors. K nearest neighbor 5 is the best imputation prediction result, both calculated by RMSE and MAPE, even in MAPE the error value is below 10%, which means it is very good.


Introduction
Implementation of a more specific tracer study on the assessment of graduate users is very much needed by universities because it is a feedback medium from graduate users in an effort to improve education systems and management. Tracer study, in this case, the assessment of graduate users towards universities, are carried out in almost the same way in every university, namely by distributing questionnaires to the agencies/companies/institutions where the alumni work. The agency/company/institution is asked to assess each alumnus. Assessment is usually carried out by superiors in the field of alumni work so that the assessment can be carried out more objectively. The questionnaire media used varied at each university, such as providing paper questionnaires, google forms, or applications both desktop/website/mobile owned by the assessed universities. STMIK PPKIA Tarakanita Rahmawati is one of several private universities in North Kalimantan that conducts tracer study, in this case, the assessment of graduate users on graduates from universities. PPKIA, the name that is usually attached to STMIK PPKIA Tarakanita Rahmawati, evaluates graduates still using a questionnaire in the form of a sheet of paper folded into an envelope to maintain confidentiality which is then handed over to graduate users to be returned after being filled out. The results of the graduate user assessment questionnaire are usually recapitulated as college evaluation material if there is a bad assessment of graduates. However, in the process of recapitulating the graduate user questionnaire, there is an important problem, namely missing value, Missing data or a missing value is a condition where there are incomplete or empty values on one or more criteria. A missing value is a common problem in most scientific research in fields such as Biology, Medicine, or Climate Science. They can arise from various sources such as sample handling error, low signal-to-noise ratio, measurement error, non-response or deleted deviant values. [1], Rubin (2022) defines missing data based on three loss mechanisms: data are missing completely at random (MCAR) when the probability of a case having an error value for the variable does not depend on the Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. known value or missing data; data are missing at random (MAR) when the probability of a case having a missing value for a variable may depend on the known value but not on the missing data value itself; data is missing not at random (MNAR) when the probability of an instance having a missing value for a variable can depend on the value of that variable [2]. Missing value creates an element of ambiguity when analyzing data and which can affect the nature of statistical estimators and result in loss of power and misleading conclusions [3]. So that it is so important to handle missing values in data processing in other processes to obtain information.
Missing value has occurred several times in the graduate user assessment questionnaire for college graduates.A missing value that often occurs is the loss of some values on certain assessment attributes, empty questionnaires are almost never found or not assessed at all. When confirmed regarding a questionnaire whose attributes do not have complete scores, most graduate users answered with answers that they could not judge and forgot. In the case of forgetting, it may still be possible to reload, but in the case of not being able to judge, this needs a way to solve the problem. Cases cannot judge, for example, on the example of the attribute of foreign language proficiency assessment, because not all companies/institutions and agencies use foreign languages at graduate users are very difficult to assess this. The form of missing values can vary, such as the most frequently encountered data is empty/NaN, 0, and -. There is a lot of literature with various methods that have been used or applied to deal with missing observations or missing values [4]. These methods are divided into four main categories as follows [5]. The first is the deletion of all datasets that have missing values. This approach is the simplest for dealing with missing values by removing incomplete data from the data set and analyzing only the available data. Deletion is done listwise or pairwise [6]. Despite the simplicity of this method, the removal of too much data can significantly hinder the analysis and reduce the statistical significance of its conclusions, which then adversely affects the prediction process [2]. Second, the single imputation method, by replacing the missing values with statistical estimates using one calculated value such as, mean, median, min, max, and mode [2]. The three model-based imputation methods, such as the maximum likelihood method [7] and expectancy maximization [8], where more than one plausible value is used to predict one missing value observation. Fourth, machine learning methods such as K Nearest Neighbor [9], In this study, the authors used K Nearest Neighbor Imputation (KNNI) to predict the missing value of the questionnaire data for assessing graduate user satisfaction for college graduates. The purpose of predicting the missing value in this study is as part of the pre-processing process, that later the graduate user satisfaction level data can be grouped or classified to get more optimal results than without the imputation process. This study uses two evaluation methods, namely Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) to test the accuracy of the predicted missing value,

Object of research
The object of the research used questionnaire data on the level of satisfaction of graduate users or tracer study from users of STMIK PPKIA graduates, Tarakanita Rahmawati. The graduate user satisfaction data has seven assessment criteria in the form of questions as shown in Table 1 with 5 assessment models as shown in Table 2. The assessment criteria are based on 7 aspects that are indicators of graduate user satisfaction assessment in the BAN-PT accreditation guidelines [10] , criteria This is also used by STMIK PPKIA Tarakanita Rahmawati in making a graduate user satisfaction questionnaire. The dataset used consists of 100 graduate user data filled in by agencies/institutions/companies in 2018 to 2021. Not enough 2 5 Very less 1

Research Stages
The research stage begins with preparing a research data set, namely data tracer study alumni STMIK PPKIA Tarakanita Rahmawati. The data is divided into two parts, the first data contains complete data that will be used as training data, the second data contains complete data which we will eliminate some of the data on certain attributes to become data testing, this stage is called preprocessing data. Imputed KNN implementation is carried out in two ways, namely manually calculating with Microsoft Office Excel as control data and the main calculation using the scikit-learn KNN imputer library, The evaluation was carried out to test the accuracy by comparing the actual data and the predicted data from the KNN imputer library from scikit-learn. E -valuation uses two methods, namely Root Mean Squared Error Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol.  The imputation method with K Nearest Neigbor ( KNN ) is one of the most popular methods for solving missing value problems [11] . KNN is popular because of its simplicity and proven effectiveness in many missing value imputation problems [12] . The superiority of the imputation method with KNN can be used to predict 2 types of data, discrete data (mode value) and continuous data ( mean value ), Imputing with KNN does not require the formation of a forecasting model for each data criterion that has data missing values, The weakness of imputation using KNN is that when looking for the most appropriate observations with observations that have missing values, imputation with KNN will search all training data or datasets, [9] . This weakness will affect when a large number of datasets or training data are used, so it will take a long time to observe, Even so, imputation with KNN is still a good method for imputing data on missing values, [13] . The sequence of steps in the process of finding the value of missing value with imputation KNN [14] .
First, determine K, the number of closest observations used. Second, Calculate the distance between observations that have missing value in jth with other observations that do not have a missing value on the variable in accordance with the calculation of the Euclidean distance in the formula 1 [15] .
Where, d(x i,x j ) is the distance from i to the center of cluster j, Xi is the training data, Xj is the testing data, n is the number of attributes. k is the attribute, x ik is the ith data on the kth attribute, and x jk is the jth centroid of the kth attribute. Third, look for the shortest K observations based on the smallest distance value. The value of j in the shortest k observations will be used in the imputation process for observations that have a missing value. Fourth, calculate the weight of all the k shortest observations. The closest observation will get the highest score. Fifth, calculate the average value in the shortest k observations that do not have a missing value using the formula 2 [16] .
Where, X j is the weighted average, V kj = the value of the complete data on the missing variable value, and k is the closest observation used.
Where, X i is the actual data for the i -th period, Fi is the predicted result for the i-period, and n is the number of time periods.

Results and Discussions
The dataset in this study consisted of 100 sample questionnaire data on the level of user satisfaction of STMIK PPKIA Tarakanita Rahmawati graduates from 2018 to 2021. The data is considered by the authors to be sufficient to represent the distribution of assessments from respondents. The limitation of 100 data is also carried out as a form of time efficiency in the KNNI calculation process because as has been written KNNI will calculate from all datasets, of course, the more data, the longer the calculation process. Susanti [14] concluded that her research on imputation is that the more missing values compared to the dataset, the smaller the accuracy obtained. Training data consists of 90 data and testing data consists of 10 data, which means the author only uses 10 data. % data will be used as testing data.

Training Data
training data used is 90 initial data from user satisfaction data for graduates at STMIK PPKIA Tarakanita Rahmawati from 2018 to 2021. The following snippets of data from the training data that will be used can be seen in Table 3. 3. 2. Testing Data testing data used are the last 10 data from graduate user satisfaction data. The data is not data that has a missing value but will randomly remove some values on several attributes or assessment criteria. This is done so that the data still has its original value so that later it can be measured the level of prediction accuracy using the imputation method that the researchers used in this study. The data before it has a missing value can be seen in Table 4 and the data that has been deleted has some values in the sense that it has a missing value, written with "NaN" can be seen in Table 5.

Calculation of KNNI
The application of K Nearest Neighbor Imputation at this stage calculates manually using Microsoft Office Excel. This calculation is used as control data from the main calculation using the KNN imputer library from scikit-learn. The first step is to determine the K value of the nearest neighbor to be used. This study uses four K values, namely K=1, K=5, K=10, and K=20. So that later it will be able to calculate which K has the best accuracy value. The second step is to calculate the distance value between the testing data and the training data using the encludian distance calculation as in equation 1. The following is an example of calculating the distance between alternatives R1 and R91. that have been sorted from the smallest distance can be seen in Table 6. The next step is to calculate the weight of each K value that we have previously determined using equation 2 which will then become the imputed value. The following is an example of calculating the weight of R91 against a dataset with k=5.
R91K5= 1 ( 7 1 + 23 1 + 28 1 + 32 1 + 39 1) imputed value of each K value in the R91 testing data, which can be seen in Table 7. Steps for calculating distances and calculating weights are carried out on all testing data on all training data so as to produce all imputed results from all testing data, which can be seen in Table 8.

4. KNNI Calculation with the KNN Imputer Library
Calculations are carried out using the KNN imputer library from scikit-learn as used by Yazan Jian [21] and also Laboni Akter [22] , the first step by importing numpy which stands for Numerical Python function Python library which is used to create single and multidimensional array class objects. Then import the KNN imputer library from scikit-learn. The command to import numpy is "import numpy as np" where np is just a variable. The command to import KNN Imputer from scikit learn is "from sklearn.impute import KNNImputer".
Next, determine the desired value of k nearest neighbors, according to this study using five k values, namely 1, 5, 10, 15 and 20. The command determines the value of k as follows "imputer = KNNImputer(n_neighbors=20)". Where imputer is a variable and 20 is the selected value of k.
Finally, just call the imputed results from the KNN imputer library, the command to display the results is as follows "imputer.fit_transform(X)", X is the variable that holds the dataset. The result of imputation on data R91 with k nearest neighbors 10 on criteria A1 is 5.
The recap of the imputation results on the R91 to R100 testing data using the k nearest neighbors 1, 5, 10, 15, and 20 shows almost the same results as manual calculations, only I show slightly different results in the hundredth or smaller value like on R9 2 k 20. Where the manual calculation shows the results of 4.55 while the KNN imputer is 4.1. The following is a complete recap of the imputation results using the KNN imputer in Table 9.  Table 10.

Comparison of KNNI and Statistical Methods
The author will compare the results of missing value predictions with KNNI and simple statistical methods that are widely used as imputation methods, namely mean, median, most frequent /mode. Calculations with statistical methods also use the library from scikitlearn. The following are the results of missing value predictions with KNNI when compared with the statistical method, which can be seen in graphical form in Figure 2. Seen from Figure 2, the accuracy of using either RMSE or MAPE for all KNNI scores (K1, K5, K10, K15, and K20) shows much better results than all statistical method models (mean, median, and most frequent). In the statistical method the best accuracy value in the median model is the RMSE value of 0.7071 and the MAPE 14.5%, meaning that the value is still not below 10%.

Conclusion
Based on the results of the discussion of the results of the imputation prediction test using K Nearest Neighbor Imputation (KNNI) on user satisfaction data from STMIK PPKIA Tarakanita Rahmawati graduates from 2018 to 2021. Where using the five k closest neighbors, namely 1, 5, 10, 15, and 20 shows the results the error value of the k nearest neighbor 5 on the RMSE is 0.316 while the error value using MAPE is 3.33%, both values are smaller than the k values of the other closest neighbors. This means that the k nearest neighbors 5 are the best imputation prediction results, both calculated by RMSE and MAPE, even in MAPE the error value is below 10% which means it is very good. In further research, the KNNI will be compared with other imputation prediction methods. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol.