High Scalability Document Clustering Algorithm Based On Top-K Weighted Closed Frequent Itemsets
Abstract
Documents clustering based on frequent itemsets can be regarded a new method of documents clustering which is aimed to overcome curse of dimensionality of items produced by documents being clustered. The Maximum Capturing (MC) technique is an algorithm of documents clustering based on frequent itemsets that is capable of producing a better clustering quality in compared to other similar algorithms. However, since the maximum capturing technique employed frequent itemsets, it still suffers from such several weaknesses as the emergence of items redundancy that may still cause curse of dimensionality, difficult to determine the minimum support value from a set of documents to be clustered, and no weighting on items incurred to the resulting frequent itemsets. To cope with those various weaknesses, in this research, an algorithm of documents clustering based on weighted top-k closed frequent itemsets, which is called as Weighted Maximum Capturing (WMC) algorithm, is developed. The proposed algorithm involves the frequent pattern tree algorithm to mine closed frequent itemsets from a set of documents without specifying the minimum support value of items to be generated. Experimental results showed that improvement on the resulting clustering accuracy was produced. The resulting average values of F-measure of 0.713 and purity of 0.721 with improvement ratio of 1.4% for F-measure and 2% for purity. Nevertheless, results of the scalability test showed very significant improvement. The WMC algorithm only requires the average computing time of 623.77 minutes, 518.05 minutes faster than the average computing time required by the MC algorithm.
Downloads
References
Zhao, Y. & Karypis, G., Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55 (3), pp. 50-62, 2004.
Jain, A.K., Murty, M.N. & Flynn, P.J., Data Clustering : A Review. ACM Computing Survey, 31(3), pp. 264-323, 1999.
Steinbach, M., Krypis, G. & Kumar, V., 2000. A Comparison of Document Clustering Techniques. Proceeding Text Mining Workshop, KDD 2000.
Beil, F., Ester, m. & Xu, X., Frequent Term-Based Text Clustering. Proceeding of The 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436-442, 2002.
Fung, B., Wang, K. & Ester, M., Hierarchical Document Clustering using Frequent Itemsets. Proceeding of The 3rd SIAM International, 2003.
Li, Y.J., Chung, S.M. & Holt, J.D., Text Document Clustering based on Frequent Word Meaning Sequences. Data & Knowledge Engineering, pp. 381-404, 2008.
Zhang, W., Yoshida, T., Tang, X. & Wang, Q., Text Clustering using Frequent Itemsets. Knowledge-Based System, 23, pp. 379-388, 2010.
Pradnyana G.A. & Djunaidy, A., Metode Weighted Maximum Capturing untuk Klasterisasi Dokumen Berbasis Frequent Itemsets. Jurnal Ilmu Komputer Udayana University, 6 (2), pp. 1-10., 2013.
Tan, P.N., Stainbach, M. & Kumar, V., Introduction to Data Mining. 4th ed. New York: Pearson Addison Wesley, 2006.
Wang, J., Han, J., Lu, Y., & Tzvetkov, P., TFP: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Transactions on Knowledge and Data Engineering, 17(5), pp.652-663, 2005.
Dalli, A., Adaptation of The F-measure to Cluster-Based Lexicon Quality. Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods metrics and resources reusable, pp. 51-56, 2003.
Copyright (c) 2021 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;