K-Means Algorithm Implementation for Project Health Clustering

Indonesia has several companies that are engaged in the telecommunications sector. Various projects run in parallel to support the success of telecommunications companies. A project’s potential can boost the company’s revenue and productivity. On the other hand, there are some risks that need to be considered for every project when it is about to start. Project data is recorded from start to finish so that the project's progress and improvements can be monitored and analyzed. As the project runs, the project team at one of Indonesia's telecommunication companies, which is responsible for the processes leading to project success, requires a project health category. Therefore, this study is conducted to develop a process for clustering project health, which is included in a type of unsupervised learning that runs on unlabeled data. One of the clustering algorithms is K-Means, which groups data based on similar criteria. Researchers also use dimensionality reduction with the Principal Component Analysis (PCA) method to determine its impact on the clustering process with the K-Means algorithm. From this study, the researcher obtained three clusters or project health categories, consisting of clusters 0, 1, and 2. Evaluation results with the Calinski-Harabasz Index showed that the K-Means model on the dimensionality reduction data with PCA performed better than the standard K-Means model with a Calinski-Harabasz Index value of 55633,12776405707, which is higher than 25914,578262576793.


Introduction
Project management has experienced rapid development in recent years, especially because of digitalization in almost all fields.The number of projects that run in a certain period requires managers to help identify these projects to see their potential and risks.Project health is an important component that shows the status of an overall project running towards project success.This component is important for a company because it can directly affect client satisfaction, productivity, and business success.Project health can be measured using various indicators including the time component, financial success, employee productivity, budgeting funds, and the quality of the project itself [1].Every project certainly has a baseline or target for project completion in order to make a profit according to the expected time period [2].
Numerous businesses in Indonesia are devoted to the telecommunications industry.One of the main projects of the telecommunications company is in the field of radio access networks (RAN).This project can be in the form of tower construction, network installation, or capacity increases.The number of projects running in parallel, complex data, and a long processing time for manually determining project health make the project team at one of Indonesia's telecommunications companies feel difficult.
Machine learning is a computational paradigm in which the capacity for problem-solving is built from previous examples [3].The unsupervised learning method is a type of machine learning that handles unlabeled data [4].In this type of learning, it is assumed that all training examples are unlabeled.Unlabeled examples are learned depending on their similarities [3].Clustering includes several algorithms that can be used based on the data and problems.In this study, we used the K-Means algorithm to cluster project health, an algorithm that is widely used and has proven results for clustering problems.
Dimensionality reduction is a technique for finding lower-dimensional representations of data while retaining properties that are key to a particular problem.The classic technique for dimensionality reduction is the principal component analysis (PCA).
Based on the literature reviewed and the problems found, we conducted this study in order to group and find out the project health of the RAN project at one of Indonesia's telecommunications companies.The aim of this research is to help the project team at a telecommunications company monitor the progress of the RAN project and follow up on projects that have poor project health in the hope that the project can later be completed on time.By conducting this study, we hope to facilitate the project team at a telecommunications company in determining project health based on the project baseline in a fairly short time and in a more effective way.

Research Methods
To achieve this study objectives, we compiled the research stages presented in the form of a flow chart which can be seen in Figure 1.

Problem Identification and Formulation
First, we started this study by identifying and formulating problems that exist in one of Indonesia's telecommunications companies, especially in the project team.In accordance with company regulations, hereinafter the name of this company is referred to as PT XYZ.The problem experienced by the project team is the lack of effectiveness in determining project health in a Radio Access Network (RAN) project if it is determined manually by the project team.
Based on the results of interviews with the project team at PT XYZ, there are several stages of the RAN project at PT XYZ, from project planning to project closure.These stages are represented in Figure 2.
Purchase Order (PO): The first stage in the RAN project at PT XYZ is the existence of a purchase order, which is a commercial document issued when the vendor and the company have agreed to work together to build the project.POs are used to initiate purchases and provide a means of ensuring that transactions are covered by the right contract.Kick-Off Meeting (KOM): According to the discussion with the PT XYZ project team, the Kick-Off Meeting (KOM) is a strategy to increase a project's chance of success.KOM also holds the key to ensuring a shared understanding of the project objectives.At this stage, stakeholders begin to determine the target or project baseline based on the contract between the two parties, timeline, organizational structure, business processes, and Responsible, Accountable, Consulted, and Informed (RACI) matrix.
Ready for Installation (RFI): RFI is a project status where the tower is ready for installation of base transceiver station tower equipment.
Material On-Site (MOS): As the name implies, material on site, or MOS, indicates the stage at which construction materials have arrived at the construction site [10].
Installation: Base Transceiver Station (BTS) tower installation stage.
Ready for Service (RFS): Configuration of the installed tower, and if this configuration is successful, it will be connected to the network.
Unlock: When the tower construction is in the unlock stage, the network is ready to use.
Acceptance Test Procedure (ATP): Based on the discussion with project team at PT XYZ, Acceptance Test Procedure (ATP) is a stage that can take the form of checking devices, validating sites with field testing, and verifying device performance to be able to take the project to commercial services.At this stage, there are various conditions and testing criteria that must be met so that the project development results can be accepted and completed.ATP is one of the essential stages because it is the main step in the risk management of a project.
Goods Receipt (GR): Goods Receipt is the stage when the company has received the goods and or services that have been ordered using a purchase order (PO) [11].GR is not always at the end, but it depends on the vendor contract.There are some vendors who also have GR stages in the middle of the project, for example, a goods receipt when the materials have been received.
Therefore, based on the problems that have been identified and formulated, this research focuses on how researchers can determine the category of project health in RAN projects at PT XYZ based on project baseline using the K-Means clustering algorithm.

Literature Study
Literature studies were conducted to understand the important terms that were further examined at the research stage.We conducted observations and searches for books, journals, and other research publications related to research topics and problems to obtain information that can support research.With this basic knowledge, it is hoped that we can proceed to the next stage of research and complete it properly.

Data Acquisition
In this study, the data used and processed in the data modeling were obtained from PT XYZ's website-based Project Management Information System (PMIS), managed by the project team.The data acquired by are part of PT XYZ's 2022 Radio Access Network (RAN) project.

Data Preprocessing
The input data in the machine learning process has its own standard and structure that depend on the problem and the machine learning work to be performed.Data preprocessing is a pivotal step in both data analytics and machine learning.However, it is crucial to understand that the preprocessing performed for data analytics is significantly different from that of machine learning.[12].Missing data and noise are some of the issues addressed by data cleaning [13].Data preprocessing is performed when there are still data that need to be cleaned from the dataset obtained.We also carried out this stage to minimize errors that occurred before entering the data modeling process.One commonly used method is the label encoding process.The label encoding process encodes all categories into numeric labels so that they can be processed in the clustering stage [14].
In this study, label encoding is required for columns that have text as their data type and that consist of several categorical values.In this stage, we converted categorical columns into numerical data consisting of several values so that it can facilitate the next process, namely, data modeling.
Feature engineering: Features are numerical representations of an aspect of the raw data.Features sit between the data and the model in the machine learning pipeline.The number of features in the data is also essential to machine learning.If it does not have enough informative features, then this can result in the model not being able to perform the final task.If there are many features but they are not relevant, then the model will be more expensive and difficult to train.In this regard, in machine learning, there is a term called feature engineering, which is an activity to extract features from previously acquired raw data and then transform them into a form that is suitable for machine learning models.Therefore, if appropriate to the data, problem, and research objectives, feature engineering can enable machine learning to produce higher quality output [15].
In this stage, we extracted new columns from the existing columns in the raw data, and then transformed the columns into a form that was suitable for machine learning models.After preprocessing the data, we conducted experiments under two conditions.In the first experiment, clustering was conducted using the standard K-Means algorithm.Furthermore, in the second experiment, we reduced the data dimension using the Principal Component Analysis (PCA) method and then continued the clustering process using the K-Means algorithm.These two experiments were conducted to determine the effect of PCA on the clustering process using the K-Means algorithm.
2.5 Dimensionality Reduction using Principal Component Analysis (PCA) Dimensionality reduction is one of the processes that can reduce the representation of data that were previously high-dimensional into lower dimensions while maintaining the main components of the data [5].The main goal of PCA is to reduce the dimensionality of a dataset where there are a large number of interrelated variables and also retain as much of the variation in the data as possible.This reduction is achieved by transforming the variable data into a new set of variables called principal components (PCs) that are uncorrelated and ordered so that the first few variables can represent and retain most of the variation in the overall variables [16].In general, the dimension reduction stage with PCA is described as follows: Calculate the covariance matrix: The main goal of reducing the dimensions is to obtain the main components without eliminating the data characteristics.In the process of obtaining the principal components, we need a covariance matrix to determine the correlation between the columns.
The simple PCA approach is as follows.Suppose there are data samples = [ 1  2 . . .  ] ∈ ℝ × , where each sample is in column vector form with the covariance matrix defined as in Formula (1).
̅ is the mean or average of the sample.After that, a lowdimensional basis that covers most of the data variance can be found by extracting the most significant eigenvectors from the covariance matrix .
Eigenvalue Decomposition: The covariance matrix obtained from the previous stage was used for the Eigenvalue calculation.Eigenvalue is a constant that indicates the level of representation of a feature or attribute relative to the overall attribute.The Eigenvalue formula is represented in Formula (2).We performed dimension reduction on the data using the principal component analysis (PCA) method to compare the performance of the clustering result model using the K-Means algorithm on data through PCA and without PCA (standard K-Means).

Data Modelling using K-Means Clustering
K-Means is one of the clustering algorithms that groups data based on the average vector or mean of the cluster.Parameter K is the number of clusters that need to be determined before the K-Means clustering process begins.The cluster mean vector is given as a cluster prototype in the algorithm execution.K-Means is a type of unsupervised learning because it belongs to a learning paradigm where data has no label and "learns" based on the similarity of existing data.This learning process involves optimizing the cluster prototype based on the similarity between the prototype and individual items [17].
The K-Means algorithm is an iterative method that consists of partitioning a set of n objects into  ≥ 2 clusters so that the objects are similar to each other and different from other clusters.In general, the K-Means clustering algorithm consists of four main steps: Step 1: Determine the desired number of clusters (K) or groups.Then, k points are randomly generated in the field, where k is used as the initial centroid.
Step 2: Calculate the distance from each object to all the centroids.This distance is commonly referred to as the Euclidean distance.The formula for calculating the Euclidean distance between two points is given in Formula (3).
Step 3: Move each object to the cluster with the closest centroid distance.
Step 4: If there is a change, then the process continues to the centroid calculation stage.The new centroid is calculated using the average value of the objects that are members of each cluster.The process will repeat from the second step.Otherwise, if there is no change, the K-Means algorithm will stop running [18].The K-Means algorithm is generally represented in Figure 3. Clean and complete data indicate that the data are ready to be modeled using the K-Means clustering algorithm.
Subsequently, we started the data processing using machine learning.The process carried out in the data modeling stage is clustered with the K-Means algorithm using the Scikit-Learn library and package in the Python programming language.
In this study, the number of clusters is determined using the elbow method.First, we built a machine learning model using clean data (without the PCA process) using K-Means.Then, the study continued by building a model on clean data that went through the PCA process.The results of both the experiments were stored in different data frames to evaluate the performance of each model.

Model Evaluation using Calinski-Harabasz Index
The Calinski-Harabasz Index is a commonly used measurement index to evaluate the quality of cluster results [19].Formally, cluster evaluation is defined as quantitatively evaluating clustering results.The motivation for this evaluation process is that almost every clustering algorithm will find clusters in data sets that do not even have clusters naturally.Therefore, a validation measure is needed so that it can be known how well the clustering results have been obtained.
In this study, we built two models, which consisted of the standard K-Means model and the K-Means with PCA dimensionality reduction model.To compare the performance of the two built models, we used the Calinski-Harabasz index based on Formula (4).CH is the Calinski-Harabasz index, k is the number of clusters, N is the number of data point, SSB is the sums of squares between-cluster, and SSW is the sums of squares within-cluster.
= ∑   *  2 (  , )  =1 (5) We evaluated the machine learning models using the Calinski-Harabasz Index, which shows how close the characteristics are between members in the same cluster and how far one cluster is from another.

Conclusion
In the last stage of the study, we analyzed the characteristics of each cluster and examined the advantages and disadvantages of the machine learning model.From the evaluation and analysis, we can draw conclusions from this study that has been done.

Data Acquisition
In this study, the dataset used was obtained from PT XYZ's website-based Project Management Information System (PMIS).The data used were obtained from the RAN 2022 projects.This dataset contains information on the RAN project managed by PT XYZ and related vendors.We checked the number of rows and data in the dataset using the shape function from the Pandas library.After checking, the amount of raw data in this dataset was 65074 rows and 13 columns.Some examples of the data rows are listed in Table 1.The Index column in the table shows the data sequence number or index when the data are first loaded with Python.Handling duplicate data: The obtained RAN project data have identity attributes obtained from four columns, namely PT Index, Region, NE Type, and Build Type, which show the identity of each RAN project running at PT XYZ.Therefore, we need to check whether there is more than one row of data that has the same value in the four attributes in the data, so that the data is clean when processed using machine learning.We checked duplicate data using the duplicated function in Python.The duplicated function returns a Boolean value that is true or false, such that when there is a data row with the same PT Index, Region, NE Type, and Build Type, the output of the function returns a true value.Conversely, if there are no duplicate identity columns, the output of the function is false.To determine the total number of duplicates, we combined the use of the duplicated function with the aggregate sum function, which returns the sum of the true values from the duplicate checking result.After the functions were run, we obtained 4101 rows of duplicate identity attributes.We used the drop_duplicates function from the Pandas library with the subset parameters set as the PT Index, Region, NE Type, and Build Type as attributes that determine if duplicates will   4 shows the values of the Region, NE Type, and Build Type attributes before and after encoding.Feature engineering: Feature engineering was started by creating the get_days_between function to obtain the number of days between stages.The get_days_between function accepts two parameters: the date it enters the first stage and the date it enters the second stage.The function determines the difference in days between the two dates and returns a duration value in days using the dt.days function from Pandas Python.This function calculates the difference in days between the two dates, which were previously set as the parameters.After the data were preprocessed, we obtained clean data that could be processed using machine learning.However, the dataset still contains the PT Index identity column and date column at each stage.For the clustering process, we did not use the PT Index and date columns because the clustering process did not require the use of these columns.To observe patterns from the data using machine learning, researchers used the Region column, Build Type, NE Type, and duration column between stages (PO-KOM, KOM-RFI, RFI-MOS, MOS-Install, Install-RFS, RFS-Unlock, Unlock-ATP, and ATP-GR).
Table 5 shows the clean data processed by machine learning.After the preprocessing stage, we performed dimensional reduction using the Principal Component Analysis (PCA) method because the dataset used has a large number of attributes, making it difficult to analyze the characteristics of each cluster produced.PCA was used to reduce the dimensionality of this dataset to obtain the principal components of all attributes but still retain as much variation as possible in the data.
Calculate the covariance matrix: Table 6 shows the structure of covariance matrix for this dataset.The column names in Table 6 refer to the column names in Table 5.
For example, we used the Region attribute (a) to calculate the variance and covariance matrix.Table 7 shows an example of data calculation to obtain the covariance matrix based on equation (1).

Eigenvalue Decomposition
Based on formula (2), we obtained eleven Eigenvalues, as shown in Table 8.Table 9 shows the rank of the Eigenvalues that have been obtained.
From Table 9, it can be observed that two principal components can be obtained from the Eigenvalues of  1 and  2 with percentages of 43.69% and 34.103%, respectively.Eigenvectors can be obtained by multiplying the Eigen values with the covariance matrix.

Transformation of the New Dataset of PCA Results
The initial dataset had dimensions of 36086×11, whereas the Eigenvector had dimensions of 11×2.Thus, we obtained a new dataset with dimensions of 36086×2, as shown in Table 10.The created model is a machine learning model using the K-Means algorithm.We used the KMeans package from the Scikit Learn Python library to utilize functions or methods to process data that have been acquired and cleaned.The K-Means clustering algorithm consists of five steps, as describe: Determine the number of clusters K and randomly obtain the centroid: The number of clusters is determined using the elbow method.We displayed a line graph visualization in Figure 4. where the coordinate points that appeared showed the sum of squared error values of the data at each number of k.
As shown in Figure 4, an elbow was formed when the number of clusters was three.Therefore, in this study, three clusters k were used.
The K-Means package initializes the centroid randomly in the first iteration.An example of the centroid of each cluster set on the first iteration is presented in Table 11.Calculate the new centroid: The new centroid was calculated using the average value of the objects that were members of each cluster.This process is repeated in the second step.Otherwise, if there is no change, the K-Means algorithm stops running [18].
Convergence checking: The K-Means algorithm continues to iterate until convergence is achieved.In this case, convergence occurs when there is no movement of the cluster members from one cluster to another.
There are two scenarios for processing data, that consist of making a standard K-Means model and a model that uses the K-Means algorithm and PCA dimensionality reduction.
Standard K-Means Model: Clusters or groups were obtained for each data row.We also displayed the number of iterations, number of members in each cluster, and centroid in each cluster.Figure 5 shows the clustering results using the standard K-Means algorithm with the Scikit Learn library using Python.First, we calculated the inter-and intra-cluster sums of squares to determine the level of difference in the characteristics of members between clusters and the level of similarity of members within the same cluster.We evaluated both models using the Calinski-Harabasz Index, whether using PCA or not.

Result Analysis
We analyzed cluster characteristics from categorical attributes such as Region, NE Type, and Build Type to numerical fields that inform durations such as PO-KOM, KOM-RFI, and so on.
Figure 8 shows that in the standard K-Means model, the number of members is evenly distributed with a ratio close to 1:1:1.The members of cluster 0 totaled 12256, cluster 1 totaled 13244, and cluster 2 totaled 10586 rows of data.According to Figure 10, from all stages of project development, cluster 2 is a cluster with an average duration that is relatively faster than other clusters.In contrast, cluster 0 had a negative average PO-KOM duration.This shows that projects in cluster 0 carry out the kick-off meeting (KOM) stage before the purchase order (PO). of the number of cluster members with a ratio close to 1:1:1.The difference is that in the standard K-Means model, the order of clusters based on the number of members is cluster 1, 0, and 2. Unlike the case with K-Means with the PCA model, which is preceded by clusters 1 and 2, and finally cluster 0. Cluster 0 had 10520 members, cluster 1 had 13189 members, and cluster 2 had 12377 rows of data.
Cluster 0 of the K-Means model on the data resulting from dimension reduction with PCA shows that the majority of its cluster members are in the WEST Region, have NE Type U2100, and have Build Type EXISTING.Furthermore, the categorical attribute distribution of cluster 1 in Figure 12 shows that the majority of its members are in the EAST Region, have NE Type L900, and have Build Type EXISTING.Cluster 2 of the K-Means model on the data resulting from dimension reduction with PCA has the same categorical attribute characteristics as cluster 1, namely that the majority of its members are in the EAST Region, have NE Type L900, and have Build Type EXISTING.
Next, we also analyzed the average duration between stages in days.From Figure 13, it can be said that cluster 0 had a long ATP-GR duration in average.Overall, when compared to other clusters, cluster 1 members have the fastest average duration between stages, relatively, as seen from the duration or aging of each stage to another stage.

Conclusion
According to the findings of this study, the project health category on the RAN project at PT XYZ was effectively determined using the K-Means clustering algorithm, yielding three clusters, namely clusters 0, 1, and 2. We also discovered that the PCA technique affects the implementation process and performance of K-Means clustering in determining project health in the RAN project at PT XYZ.As a result, data that goes through the PCA dimension reduction process when implemented in K-Means clustering produces a higher Calinski-Harabasz Index (CH Index) value of 55633.12776405707.
Future research should consider several suggestions.First, the research can be developed using the latest RAN project data and a wider scope, for example, all active RAN project data, not limited by the project year.Furthermore, from the aspect of information technology, exploration is needed regarding the parameters used in the K-Means algorithm and the exploration of tools, supporting algorithms, and other approaches to improve the results and performance of the clustering model.

Figure 1 .
Figure 1.Flow Chart of the Research StagesPurchases of goods or services are processed through the company's financial system and must be preceded by a purchase order given to the vendor[9].Project Charter (PC): This step is one that can be done in parallel with the PO and kick-off meeting.A project charter is the process of developing a document that

Figure 2 .
Figure 2. Flow Chart of the Radio Access Network Project Development (PT XYZ's Document) a is the Eigenvector and λ is the corresponding eigenvalue.The eigenvalues describe the variance retained by eigenvectors.Sort the Eigen Values and obtaining the Principal Component (PC): Eigenvalues obtained at the previous point are sorted by the largest Eigenvalue.The greater the Eigenvalue, the more representative the component is of the overall data features.Transformation of the New Dataset of PCA Results: The principal component (PC) obtained can be used to transform the dataset into a new dataset with smaller dimensions.The PC is the main component resulting from the PCA dimension reduction process, which is representative of other attributes.A new dataset can be obtained by multiplying the Eigenvector with the initial dataset.

Figure 4 .
Figure 4. Optimal Number of Cluster Search Results

Figure 5 .
Figure 5. Model of Standard K-Means Clustering K-Means with PCA Dimensionality Reduction Model:Similar to the standard K-Means model, we obtained clusters or groups for each data object.We displayed the cluster name, the number of iterations, members of each cluster, and the centroid of each cluster in Figure6.

Figure 6 .
Figure 6.Model of K-Means with PCA Dimensionality ReductionWe also visualized the clustering results using a scatter plot graph with three different colors, as shown in Figure7.Different colors indicate clusters of data points.Clusters 0, 1, and 2 are represented by green, yellow, and blue, respectively.The symbol 'x' indicates the center point or centroid of each cluster.

Figure 7 .
Figure 7. Cluster Visualization of the K-Means with PCA Dimensionality Reduction Model

Figure 8 .
Figure 8. Cluster Members Distribution on the Standard K-Means ModelIn addition to the distribution of the number of members in each cluster, we also visualized the number of members in each cluster in the categorical column.As shown in Figure9, the majoristy of cluster 0 members are in the EAST Region, have NE Type L900, and have Build Type EXISTING.For cluster 1, the majority of its members are in the EAST Region, have NE Type L900, and have Build Type EXISTING.Unlike clusters 0 and 1, cluster 2 members are mostly located in the WEST Region and have NE Type U2100.All three clusters have a majority of Build Type EXISTING,

Figure 9 .Figure 10 .
Figure 9. Distribution of Categorical Columns on the Standard K-Means Model

Figure 11 .
Figure 11.Cluster Members Distribution on the K-Means with PCA Model Not much different from the standard K-Means model, based on Figure 11, K-Means with PCA Dimensionality Reduction Model also shows a fairly even distribution

Figure 12 .
Figure 12.Distribution of Categorical Columns on the K-Means with PCA ModelFurthermore, cluster 2 is the only cluster that has a negative average PO-KOM attribute.This indicates that there are many data anomalies in that column because by default, RAN projects are preceded by the PO stage before the KOM.

Figure 13 .
Figure 13.Average of Duration Columns on the K-Means with PCA Model

Table 1 .
RAN Project Data (PT XYZ's Project Management Information System) and the inplace parameter with a value of True that will change the initial dataframe to reflect the result of removing duplicate values.By default, the drop_duplicates function in the Pandas library leaves the first row among other duplicate values.For example, when there are three rows of data with the same PT Index, Region, NE Type, and Build Type, the first row remains in the dataset.Table2shows the data without duplicate values for PT Index, Region, NE Type, and Build Type, with a total of 60973 rows.

Table 4 .
Label Encoding Result

Table 5 .
Clean Data After Preprocessing

Table 6 .
Covariance Matrix Structure

Table 7 .
Covariance Matrix Calculation for Region Feature

Table 9 .
Eigenvalue Decomposition List

Table 11 .
Centroid on the First IterationCalculate the Euclidean distance from all objects to the centroid of each cluster: After obtaining the Euclidean distance for each centroid, we compared the three Euclidean distances for each row of the data.At this stage, the smallest Euclidean distance is obtained or the one with the closest distance.