Utilization of Optical Character Recognition and Text Feature Extraction to Build a Workforce Complaint Database
Pemanfaatan Optical Character Recognition Dan Text Feature Extraction Untuk Membangun Basisdata Pengaduan Tenaga Kerja
Abstract
The examination of complaints of labor violations is part of the main activity of the labor inspection section within the Department of Manpower. Monitors will examine companies that are considered to have violated labor laws based on a letter of complaint sent by the relevant union organization or legal aid agency. The easy way to communicate at this time, making the submission of complaint letters can be directly sent in the form of images through electronic media such as whatsapp or email. This makes it difficult for administrative staff to recapitulate incoming complaints because they have to read and enter data manually into the system. Therefore, this research was conducted to create a system that utilizes OCR technology and text feature extraction to be able to input complaints data automatically. This research resulted in a prototype of letter input and a database of letter storage that can be further utilized for Data Mining and Business Intelligent. OCR implementation is done by using the Tesseract library while the text feature selection utilizes the Natural Language Toolkit (NLTK) library. The results of testing of the prototype showed an accuracy of 66.7% of the OCR results and 91.67% of the manually typed letters.
Downloads
References
Patel, Chirag, "Optical Character Recognition by Open Source OCR Tool Tesseract : A Case Study," International Journal of Computer Applications, vol. 55, no. 10.
Algun, Selcuk. (2018, December) Review for Tesseract and Kraken OCR for Text recognition. [Online]. HYPERLINK "https://medium.com/datadriveninvestor/review-for-tesseract-and-kraken-ocr-for-text-recognition-2e63c2adedd0" https://medium.com/datadriveninvestor/review-for-tesseract-and-kraken-ocr-for-text-recognition-2e63c2adedd0
Liang, Hong, "Text feature extraction based on deep learning: a review," EURASIP Journal on Wireless Communications and Networking, 2017.
Piskorski, Jacub and Yangarber, Roman, "Information Extraction Past, Present and Future," in Theory and Applications of Natural Language Processing. Berlin: Springer, 2013, pp. 23-49.
Bird, Steven; Klein, Ewan; and Loper, Edward, Natural Language Processing with Python, 1st ed. USA: O'Reilly, 2009.
M.,Viny Christianty; Pragantha, Jeany, and Purnamasari, Endah, "Implementasi Bill Tagger untuk memberikan POS Tagging pada Dokumen Bahasa Indonesia," Jurnal Teknik dan Ilmu Komputer, vol. I, no. 3.
Yumusak, S; Dongdu, E.; and Kodaz, H., "Tagging Accuracy Analysis on Part-of-Speech Taggers," Journal of Computer and Communications, no. 2, pp. 157-162.
Ismaya, Agny, "Algoritma Ekstraksi Informasi berbasis Aturan," JNTETI, vol. III, no. 4, 2014.
Smith, R., "An Overview of the Tesseract OCR Engine," in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Parana, 2017.
Rashel, Fam; Luthfi, Andri; Dinakaramani, Arawinda, and Manurung, Ruli, "Building an Indonesian Rule-Based Part-of-Speech Tagger," in International Conference on Asian Language Processing (IALP 2014), Kuching, 2014.
Copyright (c) 2020 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;