Utilization of Optical Character Recognition and Text Feature Extraction to Build a Workforce Complaint Database

Pemanfaatan Optical Character Recognition Dan Text Feature Extraction Untuk Membangun Basisdata Pengaduan Tenaga Kerja

  • Yan Puspitarani Universitas Widyatama
  • Yenie Syukriyah
Keywords: OCR, text feature extraction, database

Abstract

The examination of complaints of labor violations is part of the main activity of the labor inspection section within the Department of Manpower. Monitors will examine companies that are considered to have violated labor laws based on a letter of complaint sent by the relevant union organization or legal aid agency. The easy way to communicate at this time, making the submission of complaint letters can be directly sent in the form of images through electronic media such as whatsapp or email. This makes it difficult for administrative staff to recapitulate incoming complaints because they have to read and enter data manually into the system. Therefore, this research was conducted to create a system that utilizes OCR technology and text feature extraction to be able to input complaints data automatically. This research resulted in a prototype of letter input and a database of letter storage that can be further utilized for Data Mining and Business Intelligent. OCR implementation is done by using the Tesseract library while the text feature selection utilizes the Natural Language Toolkit (NLTK) library. The results of testing of the prototype showed an accuracy of 66.7% of the OCR results and 91.67% of the manually typed letters.

 

Downloads

Download data is not yet available.

References

Patel, Chirag, "Optical Character Recognition by Open Source OCR Tool Tesseract : A Case Study," International Journal of Computer Applications, vol. 55, no. 10.

Algun, Selcuk. (2018, December) Review for Tesseract and Kraken OCR for Text recognition. [Online]. HYPERLINK "https://medium.com/datadriveninvestor/review-for-tesseract-and-kraken-ocr-for-text-recognition-2e63c2adedd0" https://medium.com/datadriveninvestor/review-for-tesseract-and-kraken-ocr-for-text-recognition-2e63c2adedd0

Liang, Hong, "Text feature extraction based on deep learning: a review," EURASIP Journal on Wireless Communications and Networking, 2017.

Piskorski, Jacub and Yangarber, Roman, "Information Extraction Past, Present and Future," in Theory and Applications of Natural Language Processing. Berlin: Springer, 2013, pp. 23-49.

Bird, Steven; Klein, Ewan; and Loper, Edward, Natural Language Processing with Python, 1st ed. USA: O'Reilly, 2009.

M.,Viny Christianty; Pragantha, Jeany, and Purnamasari, Endah, "Implementasi Bill Tagger untuk memberikan POS Tagging pada Dokumen Bahasa Indonesia," Jurnal Teknik dan Ilmu Komputer, vol. I, no. 3.

Yumusak, S; Dongdu, E.; and Kodaz, H., "Tagging Accuracy Analysis on Part-of-Speech Taggers," Journal of Computer and Communications, no. 2, pp. 157-162.

Ismaya, Agny, "Algoritma Ekstraksi Informasi berbasis Aturan," JNTETI, vol. III, no. 4, 2014.

Smith, R., "An Overview of the Tesseract OCR Engine," in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Parana, 2017.

Rashel, Fam; Luthfi, Andri; Dinakaramani, Arawinda, and Manurung, Ruli, "Building an Indonesian Rule-Based Part-of-Speech Tagger," in International Conference on Asian Language Processing (IALP 2014), Kuching, 2014.

Published
2020-08-17
How to Cite
Yan Puspitarani, & Yenie Syukriyah. (2020). Utilization of Optical Character Recognition and Text Feature Extraction to Build a Workforce Complaint Database. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 4(4), 704 - 710. https://doi.org/10.29207/resti.v4i4.2107
Section
Information Systems Engineering Articles