Analysis of Stylometric Features and Segmentation Strategies in Intrinsic Plagiarism Detection System
Abstract
Two different paradigms in the field of plagiarism detection resulting in External Plagiarism Detection (EPD) and Intrinsic Plagiarism Detection (IPD) systems. The most common applied system is EPD, which requires its algorithm to make a heuristic comparison between a suspicious document with documents in a corpus. In contrast, given a suspicious document only, an algorithm of IPD should be able to find the plagiarism section by looking for text segments having different writing styles. Previous researches for Indonesian texts fell only in the field of the EPD development system. Therefore, this research focuses on and contributes to experimenting and analyzing the stylometric features and segmentation strategies to build an IPD system for Indonesian texts. The experimentation results show that the paragraph segment performs better by scoring 0.92 for Macro Averaged-Accuracy and 0.54 for Macro Averaged-F1. The stylometric features achieving the highest scores of F-1 and Accuracy are the frequency of punctuation, the average paragraph length, and the type-token ratio.
Downloads
References
Halvani, O., 2015. Register & Genre Seminar: Towards Intrinsic Plagiarism Detection, Citeseer, Darmstadt.
A. Rexha, M. Kröll, H. Ziak and R. Kern, 2018. Authorship identification of documents with high content similarity, Scientometrics, vol. 115, p. 223–237
Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., and Potthast, M., 2016. Clustering by Authorship Within and Across Documents, in PAN CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers, Évora, Portugal.
Kuznetsov, M., Motrenko, A., Kuznetsova, R., and Strijov, V., Methods for Intrinsic Plagiarism Detection and Author Diarization. CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers,, Évora, Portugal.
Foltýnek, T., Meuschke, T., and Gipp, B., 2019. Academic Plagiarism Detection: A Systematic Literature Review, ACM Computing Survey, vol. 52, no. 6, pp. 1-42.
Haryanto, N., Krisnawati , L.D., and Chrismanto, A.R., 2020. Temu Kembali Dokumen Sumber Rujukan Dalam Sistem Daur Ulang Teks. Jurnal Teknologi dan Sistem Komputer, vol. 8, no. 2, pp. 140-150.
Chowdhury, H., and Bhattacharya, D., 2016. Plagiarism: Taxonomy, tools and detection techniques. Proceedings of the 19th National Convention on Knowledge, Library and Information Networking (NACLIN’16).
Krisnawati, L.D., 2016. Plagiarism Detection for Indonesian Text. Ph.D. Mumchem:Ludwig-Maximilians-Universität.
Eissen, S., and Stein, B., 2006. Intrinsic Plagiarism Detecion. ECIR 2006, LNCS 3936.
Stamatatos, E., 2009. Intrinsic Plagiarism Detection Using Character n-gram Profiles. SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), vol. 2, pp. 38-46.
Rahman, R., 2015. Information Theoretical and Statistical Features for Intrinsic Plagiarism Detection. Proceedings of the SIGDIAL 2015 Conference. Prague, czech Republic.
Krause, M., 2015. Stylometry-based Fraud and Plagiarism Detection for Learning at Scale. 5th KSS Workshop. Karlsruhe, Germany.
Elamine, M., Mechti, S., and Belguith, L., 2017. Intrinsic Detection of Plagiarism based on Writing Style Grouping. LPKM2017, Computer Science, Pshychology..
Sunardi, Yudhana, A., and Mukaromah, I., 2017. Perancangan Aplikasi Deteksi Plagiarisme Karya Ilmiah Menggunakan Algoritma Winnowing. Seminar Nasional Serba Informatika. Samarinda, Indonesia.
Bianto, M., Rahayu, I., Huda, M., and Kusrini, 2018. Perancangan Sistem Deteksi Plagiarisme Terhadap Topik Penelitian Menggunakan Metode K-Means Clustering dan Model Bayesian. Seminar Nasional Teknologi Informasi dan Multimedia. Yogyakarta, Indonesia.
Ratna, A., Purnamasari, P., Adhi, B.A., Ekadiyanto, F.A., Salman, M., Mardiyah and Winata, D,J., 2017. Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization. Algorithms, vol. 10, no. 69, pp. 1-14.
Stein, B., Lipka, N., and Prettenhofer, P., 2011. Intrinsic plagiarism analysis. Language Resources and Evaluation, vol. 45, no. 1, pp. 63-82.
Copyright (c) 2020 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;