The Design of a C1 Document Data Extraction Application Using a Tesseract-Optical Character Recognition Engine

  • Ircham Aji Nugroho Politeknik Siber dan Sandi Negara
  • Bety Hayat Susanti Politeknik Siber dan Sandi Negara
  • Mareta Wahyu Ardyani Politeknik Siber dan Sandi Negara
  • Nadia Paramita R.A. Politeknik Siber dan Sandi Negara
Keywords: affine transformation, digital signature, automatic data entry, optical character recognition, RSA-2048, SHA-256, tesseract-OCR

Abstract

The 2019 election process used the Vote Counting Information System, also known as Sistem Informasi Penghitungan Suara (Situng), to provide transparency in the recapitulation process. The data displayed in Situng is from document C1 for 813,336 voting stations in Indonesia. The data collected from the C1 document is entered and uploaded into Situng by the officers of the Municipal General Election Commission (GEC). Since this process is performed by humans, it is not immune to errors. In the recapitulation process of the 2019 election results, there were 269 data entry errors, and the data entry process also did not run according to the specified target, resulting in delays. Furthermore, there were cases of C1 document modification, raising concerns about the data's authenticity. To avoid human errors and increase data entry speed, automatic data entry is a plausible option. The data entered are text data in image documents with the same template format, so that optical character recognition (OCR) can be used to read the text while improving image quality and alignment, resulting in a more accurate OCR reading area. In this study, we developed a C1 document data extraction application using the waterfall SDLC method, which has undergone a systematic and thorough process. The application was developed using Tesseract optical character recognition. Tesseract is an open-source OCR engine and command-line program that allows for the recognition of text characters within a digital image. The accuracy obtained by using this method is still not optimal as a substitute for Situng's data entry officer. To guarantee the integrity of the C1 document, we use the RSA-2048 digital signature scheme. The use of the Tesseract-OCR Engine for character recognition, combined with digital signature capabilities, provides a comprehensive solution to reduce the human error factor that can lead to miscalculations and inaccurate processes.

Downloads

Download data is not yet available.

References

The General Election Commission of the Republic of Indonesia, Decree Number 536 of 2019 GEC concerning Instructions for Use of the 2019 General Election Vote Counting Information System.

A. Ardipandanto, "Problems of Implementing the Connective Elections in 2019," Research Center of the Indonesian House of Representatives Expertise Board, p. 6, 2019.

“GEC Finds 269 Situng Data Input Errors,” CNN Indonesia, 2019. https://www.cnnindonesia.com (accessed Nov. 21, 2019).

"About Data Input Errors, GEC Admits There Was a Human Error," CNN Indonesia, 2019. https://www.cnnindonesia.com (accessed Nov. 21, 2019).

FC Farisa, “GEC: The Situng Data Calculation Process Missed the Target,” Kompas, 2019. https://nasional.kompas.com (accessed Nov. 21, 2019).

M. A. Awel and A. I. Abidi, “Review on Optical Character Recognition” in International Research Journal of Engineering and Technology Vol. 6 Issue 6, 2019

V. Geetha, Ch. V. V. Sudheer, A. V. Saikumar, and C. K. Gomathy, “Optical Character Recognition” in Journal Of Engineering, Computing & Architecture, 2022

V. Sellam, A. Aruna, A. Joseph, S. Rahul, A. Rahul, “Optical character recognition using localization techniques” in AIP Conference Proceedings Volume 2463 Issue 1, 2022

K. M. Sai, H. Chandrika, K. Bebe, G. S. R. Pramila, G. S. Rao, “Optical Character Recognition using CRNN” in International Journal of Innovative Technology and Exploring Engineering (IJITEE) Volume 9 Issue 8, 2020

F. Shafait and R. Smith, “Table detection in heterogeneous documents,” in Proceedings of the 8th IAPR International Workshop on Document Analysis Systems - DAS '10, Boston, Massachusetts, 2010, pp. 65–72.

Min Cai, Jiqiang Song, and MR Lyu, “A new approach for video text detection,” in Proceedings. International Conference on Image Processing, Rochester, NY, USA, 2002, vol. 1, pp. I-117-I–120.

P. Sanguansat, “Robust and low-cost Optical Mark Recognition for automated data entry,” in 2015 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Hua Hin, Cha-am, Thailand , Jun. 2015, pp. 1–5.

A. Singh and S. Desai, “Optical character recognition using template matching and backpropagation algorithm,” in 2016 International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, Aug. 2016, pp. 1–6.

R. Smith, “An Overview of the Tesseract OCR Engine,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, Curitiba, Parana, Brazil, Sep. 2007, pp. 629–633.

R. Karpinski, D. Lohani, and A. Belaid, “Metrics for Complete Evaluation of OCR Performance,” The 22nd International Conference on Image Processing, Computer Vision, & Pattern Recognition, p. 8, Jul. 2018.

AJ Menezes, PC Oorschot, and SA Vanstone, Handbook of Applied Cryptography, 1st ed. USA: CRC Press Inc., 1996.

A. Regenscheid, “Digital Signature Standard (DSS): Elliptic Curve Domain Parameters,” National Institute of Standards and Technology, preprint, Oct. 2019.

LTM Blessing and A. Chakrabarti, DRM: A Design Research Methodology. Springer, London, 2009.

BPN, “Uncovering the Pattern of GEC Fraud Using IT Forensics,” p. 32, 2019.

A. Dennis, BH Wixom, and D. Tegarden, “System Analysis & Design With UML Version 2.0.,” United States of America: John Wiley & Sons, 2009.

A. Dennis, BH Wixom, and RM Roth, Systems Analysis and Design, 5th Edition. WIley, 2012.

R. Minister of Home Affairs, Regulation of the Minister of Home Affairs of the Republic of Indonesia Number 72 of 2019.

Published
2024-02-04
How to Cite
Ircham Aji Nugroho, Susanti, B. H., Mareta Wahyu Ardyani, & Nadia Paramita R.A. (2024). The Design of a C1 Document Data Extraction Application Using a Tesseract-Optical Character Recognition Engine. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 8(1), 42 - 53. https://doi.org/10.29207/resti.v8i1.5151
Section
Information Systems Engineering Articles