K-Means Clustering Algorithm Approach in Clustering Data on Cocoa Production Results in the Sumatra Region

Cocoa agricultural production in Indonesia is currently very low while demand continues to increase every year, so it is very important to build a model that can categorize cocoa farming data. The main objective of this research is to analyze agricultural data using data mining techniques that specifically use the K-Means Clustering algorithm, and Gaussian Mixture Models. In this research, we used quantitative research because it measure number-based data. The results of cocoa production so far still depend on land area, then the number of cocoa trees has a significant effect on the amount of production so it is very important for the government and researchers to develop technologies that can increase cocoa production yields where the demand for cocoa is currently very high in demand worldwide because it can classify the cocoa quality from good quality to poor quality. Based on testing the K-Means Clustering and Gaussian Mixture Model algorithms on data on cocoa production in four provinces, namely North Sumatra, West Sumatra, Lampung and Aceh which were optimized by the Silhouette method, it produced cluster values of 2, 3 and 4. second with a value of 59.8%.


Introduction
Indonesia is one of the Cocoa agricultural production countries, but the growth of Cocoa agricultural production has decreased by an average of 2.01% in the last ten years from the top five cocoa producing countries. [1]. Meanwhile, demand for the product is growing at 3% per year and the top five cocoa producing countries produce more than 95% of the world's cocoa demand. The low growth of cocoa production and productivity of cocoa farms in Indonesia is based on several reasons such as limited knowledge of farmers in the cultivation and management of cocoa plants, marketing problems and others. [1]. Therefore, it is very important to understand the trend of cocoa production especially clustering data of farmers and their production so as to produce accurate information that can be useful for decision making.
In recent years, data mining methods are one of the methods that are widely used to extract useful information from a set of data, one of which is cluster algorithms such as K-Means Clustering, dbscan clustering, hierarchical clustering, fuzzy c means and others. Cluster analysis is one of the algorithms used to group databases so that data in a cluster is similar, and as different as possible from data in other clusters. Clustering allows for in-depth interpretation with many implications about which data should be targeted with specific data that is most likely to be of interest. [2]. Partition clustering algorithms, such as K means assign objects into k (a predetermined number of clusters) clusters, and reallocate objects iteratively to improve the quality of the clustering results. [3].
Hierarchical clustering algorithms assign objects in tree-structured clusters, i.e., a cluster can have representatives of data points from lower-level clusters. The idea of Density-based clustering method is that for each point of a cluster, the neighborhood of a given distance unit contains at least a minimum number of points, i.e., the density in the neighborhood must reach some threshold. The idea of a density-based clustering algorithm is that, for any point of a cluster, the neighborhood of a given distance unit must contain at least a minimum number of points. The application of cluster analysis is widely applied in various fields such as customer data clustering [2], [4]- [6], crop productivity mapping [7], agricultural data [3], palm oil production results [8], fruit yield grouping [9], grouping  [10] and others. Data mining clustering techniques such as K-Means is one of the algorithms that are widely applied by many researchers including [11] using clustering algorithms for data grouping, data mapping, data classification, and so on. [12]- [15].
In this paper, we aim to apply cluster algorithms in the field of agriculture. [16]− [18] specifically applied to mapping cocoa farming data in four provinces in Indonesia, namely North Sumatra, West Sumatra and Aceh. The source dataset is a set of survey data that contains information about farmers, land and production. K-Means Clustering and Gaussian Mixture Models clustering algorithm approaches are used to cluster cocoa production data. [19]- [21].

Research Methods
This research uses quantitative research because it uses number-based measurements. Quantitative research is an investigation of social problems based on testing a theory consisting of variables, measured by numbers, and analyzed by statistical procedures to determine whether the predictive generalization of the theory is true. The dataset is obtained from surveys and interviews of cocoa farmers in 4 (four) regions of Sumatra, namely Aceh Province, North Sumatra Province, West Sumatra Province and Lampung Province.
The data used is data from the Swisscontact program, namely the database of Sustainable Cocoa Production Program (SCPP) farmers in the Sumatra Region Batch I 2017 with details of data for North Sumatra province totaling 1,492, West Sumatra Province totaling 4,594, Aceh Province totaling 4480, and Lampung Province totaling 2,007 with a total of 12,573. Table 1. is a partial dataset of the data used. Research is conducted to obtain information that has a relationship with processing on the dataset. The stages in this study begin with collecting datasets/training data from SCPP data in the Sumatra region Batch I 2017, followed by a pre-processing stage which includes data cleaning which includes missing values, smooth noise data, identifying and removing outliers, resolving inconsistencies. Data integration process from several databases and data transformation in the form of normalization and aggression.
Next, the Principal Component Analysis stage is carried out, the point is to make the dataset simpler by the linear transformation method so that a new coordinate system with maximum variation is formed. Followed by selecting a subset of data that is relevant to the problem from the existing set of features, without transforming and combining all features to improve prediction capabilities followed by a pre-process that gets raw features so that the right amount of data is not always large.
The data that has been obtained is then segmented in order to separate and analyze the subset of data based on these data segments. the last step is to evaluate, which is to display the information patterns generated from the data maining process in a form that is easily understood by interested parties.

Result and Discussion
In this section we describe the research results of analyzing and clustering Cocoa production data in Indonesia specifically in the provinces of North Sumatra, West Sumatra, Lampung and Aceh. We will observe the statistical description of the data set, consider the relevance of each feature, and select a few sample data points from the data set that we will track throughout this project.  Figure 1 is a collection of cocoa production data from four provinces in Indonesia, namely North Sumatra, West Sumatra, Lampung and Aceh. This data set of cocoa production results is the result of survey data in 2017. There are 12,522 rows and 23 columns of the total. Furthermore, an analysis is carried out to get missing data or null values, this is very important so that no errors occur during the clustering process. In the data set there is 0.1% missing data in the Productivity column and 3.5% in the ShadeTreesNr column, the data will be deleted. After cleaning the missing data, the total data for further analysis is 12,053. In Figure 2. it can be seen the results of Cocoa production in four provinces where Aceh province produces the largest Cocoa production every year, then Sumatra Goods, North Sumatra and lastly Lampung. Cocoa production is highly influenced by the area of plantations in each province as shown in Figure 3. Aceh province has the most extensive plantations among the three provinces so it is natural to produce the highest cocoa production. In Figure 3, it can be seen that the plantation area still dominates cocoa production in each province, this is something that naturally occurs but in the current era of technological development, it can utilize technology to increase production with limited plantation area. With this condition, it can be concluded that cocoa farmers in Indonesia currently still use traditional techniques. The results of the data analysis presented are still many that have not been described, due to data limitations only available in 2017 so that analysis of the development of cocoa production every year cannot be done.
Next, the results of clustering cocoa production data in the four provinces will be described where the K-Means Clustering and Gaussian Mixture Model (GMM) algorithms are applied to the dataset. Both algorithms will be evaluated using the Silhouette method in determining the optimal cluster points, then applied to both algorithms. Before applying the clustering algorithm, the first stage is carried out various models to determine the optimal features. From the results of several experiments from the data set, there are 9 features that have an impact on data analysis, namely 'FarmerID', 'CacaoAge', 'GardenDistance', 'Cocoa Ha', 'Production', 'Productivity', 'Trees', 'Trees Ha' and 'Tree_Productivity'. From the nine features, the correlation of each variable will be analyzed in the form of a heatmap. A correlation heatmap is a graphical representation of a correlation matrix that represents the correlation between different variables with values of -1 to 1, the closer to 1 the better the correlation between the two variables.
In Figure 4. is the result of the correlation heatmap between nine variables, from the results of the figure it can be seen that the variables Cocoa Ha, Production, Productivity, Trees and Tree_Productivity produce a correlation value closest to number 1 of each variable so that in this study the 5 variables will be used as feature models in the K-Means Clustering and GMM algortima. In the K-Means Clustering algorithm, the first step before applying cocoa production data to the K-Means Clustering algorithm is to analyze the determination of the K value using the Silhouette method. In Figure 5. it can be seen the results of the Silhouette method analysis of the K value, where for the value of K = 2 produces the highest score with a value of 0.37, then a score of 0.36 for each K = 3 and K = 4. Although the value of K 2 is the highest value, the difference is not too much different from the value of K = 3 as well as the value of K = 4, so in this study all three values will be applied. The results of the K-Means Clustering algorithm for clustering cocoa production data in the four provinces are shown in Figure 6. where this value indicates that the level of plantation area is still highly dependent on production, then the rest is the number of cocoa farmers. Furthermore, the value of K=3 results in 18% cluster0, 52% cluster1 and 30% cluster2 where these results indicate that the number of trees depends on cocoa yield. Finally, the value of K=4 resulted in 45.5%, 11.7%, 14.6% and 28.2% in each cluster. The final results of the clustering data are presented in the appendix.
The application of the Gaussian Mixture Model (GMM) on cocoa production data of the four provinces, as described in the previous section that the GMM model will be evaluated using the Silhouette method in determining the optimal component value. The results of the Silhouette method analysis can be seen in Figure 7. results will be attached to the Cocoa production data, as shown in Figure 8. In Figure 8. is the result of clustering using the GMM algorithm on cocoa production, it can be seen that the results have differences with the K- Based on the results of testing the K-Means Clustering algorithm and the Gaussian Mixture Model on cocoa production data in four provinces namely North Sumatra, West Sumatra, Lampung and Aceh optimized by the Silhouette method resulted in cluster values 2, 3 and 4 where from these results it can be concluded that cocoa production so far still depends on land area, then the number of cocoa trees has a significant effect on the amount of production so it is very important for the government and researchers to develop technology that can increase cocoa production where cocoa needs are currently very high demand worldwide.

Conclusions
Based on the results of testing the K-Means Clustering algorithm and Gaussian Mixture Model on cocoa production data in four provinces, namely North Sumatra, West Sumatra, Lampung and Aceh, a conclusion can be drawn, namely based on the results of testing the K-Means Clustering algorithm and Gaussian Mixture Model on cocoa production data in four provinces, namely North Sumatra, West Sumatra, Lampung and Aceh which are optimized by the Silhouette method to produce cluster values 2, 3 and 4. Cocoa production so far is still dependent on land area, then the number of cocoa trees has a significant effect on the amount of production so it is very important for the government and researchers to develop technology that can increase cocoa production where cocoa demand is currently very high worldwide.