Antlion Optimizer Algorithm Modification for Initial Centroid Determination in K-means Algorithm

Nanang Lestio Wibowo; Moch Arief Soeleman; Ahmad Zainul Fanani

doi:10.29207/resti.v7i4.4997

Nanang Lestio Wibowo Universitas Dian Nuswantoro
Moch Arief Soeleman Universitas Dian Nuswantoro
Ahmad Zainul Fanani Universitas Dian Nuswantoro

DOI: https://doi.org/10.29207/resti.v7i4.4997

Keywords: clustering, k-means, centroid, antlion optimizer, sum of intra-distance clusters, initial center

Abstract

Clustering is a grouping of data used in data mining processing. K-means is one of the popular clustering algorithms, is easy to use, and is fast in clustering data. The K-means method groups the data based on k distances and randomly determines the initial centroid as a reference for processing. Careless selection of centroids can result in poor clustering processes and local optima. One of the improvements in determining the initial centroid on the k-means method is to use the optimization method to determine the initial centroid. The modified Antlion Optimizer (ALO) method is used to improve poor clustering in the initial centroid determination and as an alternative to determining the initial centroid in the k-means method for better clustering results. The results of the research on the use of the proposed method for determining the initial centroid provide an increase in clustering compared to the usual k-means and k-means++ methods. This is evidenced by the evaluation of the sum of intragroup distance (SICD) with UCI datasets, namely iris, wine, glass, ecoli, and cancer, in each method, the best SICD value was obtained in the proposed method. Then measuring the best SICD value for each method and dataset is measured by providing a ranking proving that the proposed method on the iris, wine, and cancer datasets gets the first rank, and on the ecoli and glass datasets the proposed method and the k-means++ method both get the first rank. From the average ranking value, the proposed method is ranked first, which provides evidence that the proposed method can improve the clustering results and can be an alternative method for determining the initial center of a cluster using the k-means method.

Downloads

Download data is not yet available.

References

J. Han and M. Kamber, “Data Mining Concepts and Techniques Second Edition,” San Fr. Morgan Kauffman, 2001.

D. T. Larose, “Discovering Knowledge in Data: In An Introduction to Data Mining,” Wiley-Interscience, 2005.

P. N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining, (First Edition),” Addison-Wesley Longman Publ. Co, 2005.

J. Chen, X. Qi, L. Chen, F. Chen, and G. Cheng, “Quantum-inspired ant lion optimized hybrid k-means for cluster analysis and intrusion detection,” Knowledge-Based Syst., vol. 203, p. 106167, 2020, doi: 10.1016/j.knosys.2020.106167.

K. M. Kumar and A. R. M. Reddy, “An efficient k-means clustering filtering algorithm using density based initial cluster centers,” Inf. Sci. (Ny)., vol. 418–419, pp. 286–301, 2017, doi: 10.1016/j.ins.2017.07.036.

N. Nidheesh, K. A. Abdul Nazeer, and P. M. Ameer, “An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data,” Comput. Biol. Med., vol. 91, pp. 213–221, 2017, doi: 10.1016/j.compbiomed.2017.10.014.

J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proc. fifth Berkeley Symp. Math. Stat. Probab., vol. 1, no. 14, pp. 281–297, 1967.

E. Forgy, “Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification,” Biometrics, vol. 21, pp. 768–769, 1965.

M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Syst. Appl., vol. 40, no. 1, pp. 200–210, 2013, doi: 10.1016/j.eswa.2012.07.021.

G. Zhang, C. Zhang, and H. Zhang, “Improved K-means algorithm based on density Canopy,” Knowledge-Based Syst., vol. 145, pp. 289–297, 2018, doi: 10.1016/j.knosys.2018.01.031.

Y. Zhou, R. Xie, T. Zhang, and J. Holguin-Veras, “Joint Distribution Center Location Problem for Restaurant Industry Based on Improved K-Means Algorithm with Penalty,” IEEE Access, vol. 8, pp. 37746–37755, 2020, doi: 10.1109/ACCESS.2020.2975449.

M. Capó, A. Pérez, and J. A. Lozano, “An efficient approximation to the K-means clustering for massive data,” Knowledge-Based Syst., vol. 117, pp. 56–69, 2017, doi: 10.1016/j.knosys.2016.06.031.

D. Arthur and S. Vassilvitskii, “K-means++: The advantages of careful seeding,” Proc. Annu. ACM-SIAM Symp. Discret. Algorithms, vol. 07–09, pp. 1027–1035, 2007.

Z. Wu and Z. Wu, “An Enhanced Regularized k-Means Type Clustering Algorithm with Adaptive Weights,” IEEE Access, vol. 8, pp. 31171–31179, 2020, doi: 10.1109/ACCESS.2020.2972333.

T. Su and J. G. Dy, “In search of deterministic methods for initializing K-means and Gaussian mixture clustering,” Intell. Data Anal., vol. 11, no. 4, pp. 319–338, 2007, doi: 10.3233/ida-2007-11402.

P. S. Bradley and U. M. Fayyad, “Refining Initial Points for K-Means Clustering,” Proc. 15th int. conf. Mach. Learn., vol. 54, no. 6, pp. 91–99, 1998, doi: 10.7567/JJAP.54.061701.

T. F. Gonzalez, “Clustering to minimize the maximum intercluster distance,” Theor. Comput. Sci., vol. 38, no. C, pp. 293–306, 1985, doi: 10.1016/0304-3975(85)90224-5.

I. Cabria and I. Gondra, “Potential-K-Means for Load Balancing and Cost Minimization in Mobile Recycling Network,” IEEE Syst. J., vol. 11, no. 1, pp. 242–249, 2017, doi: 10.1109/JSYST.2014.2363156.

E. Parzen, “On the Estimation of Probability Density Functions and Mode,” Ann. Math. Stat., vol. 33, pp. 1065–1076, 1962.

S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, 1982, doi: 10.1109/TIT.1982.1056489.

S. Khanmohammadi, N. Adibeig, and S. Shanehbandy, “An improved overlapping k-means clustering method for medical applications,” Expert Syst. Appl., vol. 67, pp. 12–18, 2017, doi: 10.1016/j.eswa.2016.09.025.

B. Zhang, M. Hsu, and U. Dayal, “K-Harmonic means - A data clustering algorithm,” HP Lab. Tech. Rep., no. 124, p. Hewlett-Packard Labs, 1999.

B. Zhang, “Generalized K-Harmonic Means,” Tech. Rep., p. Hewlett-Packard Laboratoris, 2000.

X. Lan, Q. Li, and Y. Zheng, “Density K-means: A new algorithm for centers initialization for K-means,” Proc. IEEE Int. Conf. Softw. Eng. Serv. Sci. ICSESS, vol. 2015-Novem, pp. 958–961, 2015, doi: 10.1109/ICSESS.2015.7339213.

A. McCallum, K. Nigam, and L. H. Ungar, “Efficient clustering of high-dimensional data sets with application to reference matching,” Proceeding Sixth ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 169–178, 2000, doi: 10.1145/347090.347123.

A. H. Kashan, “League Championship Algorithm: A new algorithm for numerical function optimization,” SoCPaR 2009 - Soft Comput. Pattern Recognit., pp. 43–48, 2009, doi: 10.1109/SoCPaR.2009.21.

T. Wangchamhan, S. Chiewchanwattana, and K. Sunat, “Efficient algorithms based on the k-means and Chaotic League Championship Algorithm for numeric, categorical, and mixed-type data clustering,” Expert Syst. Appl., vol. 90, pp. 146–167, 2017, doi: 10.1016/j.eswa.2017.08.004.

S. Mirjalili, “The ant lion optimizer,” Adv. Eng. Softw., vol. 83, pp. 80–98, 2015, doi: 10.1016/j.advengsoft.2015.01.010.

B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,” EBSE Tech. Rep. Version 2.3, EBSE-2007, 2007.

R. S. Wahono, “A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets, Methods and Frameworks,” J. Softw. Eng., vol. 1, no. 1, pp. 1–16, 2015, doi: 2356-3974.

C. Catal, “Software fault prediction: A literature review and current trends,” Expert Syst. Appl., vol. 38, no. 4, pp. 4626–4636, 2011, doi: 10.1016/j.eswa.2010.10.024.

S. Kant, T. Mahara, V. Kumar Jain, D. Kumar Jain, and A. K. Sangaiah, “LeaderRank based k-means clustering initialization method for collaborative filtering,” Comput. Electr. Eng., vol. 69, pp. 598–609, 2018, doi: 10.1016/j.compeleceng.2017.12.001.

S. S. Yu, S. W. Chu, C. M. Wang, Y. K. Chan, and T. C. Chang, “Two improved k-means algorithms,” Appl. Soft Comput. J., vol. 68, pp. 747–755, 2018, doi: 10.1016/j.asoc.2017.08.032.

H. Ismkhan, “I-k-means−+: An iterative clustering algorithm based on an enhanced version of the k-means,” Pattern Recognit., vol. 79, pp. 402–413, 2018, doi: 10.1016/j.patcog.2018.02.015.

S. Zhang and S. Ge, “User Power Interaction Behavior Clustering Analysis That is Based on the Self-Organizing-Center K-Means Algorithm,” IEEE Access, vol. 7, pp. 175879–175888, 2019, doi: 10.1109/ACCESS.2019.2957922.

S. F. Hussain and M. Haris, “A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data,” Expert Syst. Appl., vol. 118, pp. 20–34, 2019, doi: 10.1016/j.eswa.2018.09.006.

Y. Gu, K. Li, Z. Guo, and Y. Wang, “Semi-supervised k-means ddos detection method using hybrid feature selection algorithm,” IEEE Access, vol. 7, pp. 64351–64365, 2019, doi: 10.1109/ACCESS.2019.2917532.

H. Xie et al., “Improving K-means clustering with enhanced Firefly Algorithms,” Appl. Soft Comput. J., vol. 84, p. 105763, 2019, doi: 10.1016/j.asoc.2019.105763.

J. Chen, M. Tian, X. Qi, W. Wang, and Y. Liu, “A solution to reconstruct cross-cut shredded text documents based on constrained seed K-means algorithm and ant colony algorithm,” Expert Syst. Appl., vol. 127, pp. 35–46, 2019, doi: 10.1016/j.eswa.2019.02.039.

K. Tian, J. Li, J. Zeng, A. Evans, and L. Zhang, “Segmentation of tomato leaf images based on adaptive clustering number of K-means algorithm,” Comput. Electron. Agric., vol. 165, no. March, p. 104962, 2019, doi: 10.1016/j.compag.2019.104962.

J. Zhu, Z. Jiang, G. D. Evangelidis, C. Zhang, S. Pang, and Z. Li, “Efficient registration of multi-view point sets by K-means clustering,” Inf. Sci. (Ny)., vol. 488, pp. 205–218, 2019, doi: 10.1016/j.ins.2019.03.024.

H. Kim, H. K. Kim, and S. Cho, “Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling,” Expert Syst. Appl., vol. 150, p. 113288, 2020, doi: 10.1016/j.eswa.2020.113288.

S. Manochandar, M. Punniyamoorthy, and R. K. Jeyachitra, “Development of new seed with modified validity measures for k-means clustering,” Comput. Ind. Eng., vol. 141, p. 106290, 2020, doi: 10.1016/j.cie.2020.106290.

E. Umamaheswari, S. Ganesan, M. Abirami, and S. Subramanian, “Cost Effective Integrated Maintenance Scheduling in Power Systems using Ant Lion Optimizer,” Energy Procedia, vol. 117, pp. 501–508, 2017, doi: 10.1016/j.egypro.2017.05.176.

M. Jain, V. Singh, and A. Rani, “A novel nature-inspired algorithm for optimization: Squirrel search algorithm,” Swarm Evol. Comput., vol. 44, no. November 2017, pp. 148–175, 2019, doi: 10.1016/j.swevo.2018.02.013.

J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, 2006.

Antlion Optimizer Algorithm Modification for Initial Centroid Determination in K-means Algorithm

Abstract

Downloads

References

Most read articles by the same author(s)