Folk Games Image Captioning using Object Attention
Abstract
The result of a deep learning-based image captioning system with encoder-decoder framework relies heavily on the image feature extraction technique and the caption-based model. The accuracy of the model is heavily influenced by the proposed attention mechanism. The inability to distinguish between the output of the attention model and the input expectation of the decoder can cause the decoder to give incorrect results. In this paper, we proposed an object-attention mechanism using object detection. Object detection outputs a bounding box and an object category label, which is then used as an image input into VGG16 for feature extraction and into a caption-based LSTM model. The experimental results showed that the system with object attention performed better than the system without object attention. BLEU-1, BLEU-2, BLEU-3, BLEU-4, and CIDER scores for the image captioning system with object attention improved 12.48%, 17.39%, 24.06%, 36.37%, and 43.50% respectively compared to the system without object attention.
Downloads
References
L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on Attention for Image Captioning,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4633–4642, doi: 10.1109/ICCV.2019.00473.
T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting Image Captioning with Attributes,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4904–4912, doi: 10.1109/ICCV.2017.524.
R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 966–973, doi: 10.1109/CVPR.2010.5540112.
B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2T: Image Parsing to Text Description,” Proc. IEEE, vol. 98, no. 8, pp. 1485–1508, 2010, doi: 10.1109/JPROC.2010.2050411.
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems, 2015, vol. 28, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, and Q. Ye, “C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2194–2203, doi: 10.1109/CVPR.2019.00230.
K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734, doi: https://doi.org/10.3115/v1/d14-1179.
X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-Encoding Scene Graphs for Image Captioning,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10677–10686, doi: 10.1109/CVPR.2019.01094.
H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation Networks for Object Detection,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588–3597, doi: 10.1109/CVPR.2018.00378.
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” CoRR, vol. abs/1502.03044, 2015, [Online]. Available: http://arxiv.org/abs/1502.03044.
L. Chen et al., “SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6298–6306, doi: 10.1109/CVPR.2017.667.
J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3242–3250, doi: 10.1109/CVPR.2017.345.
D. Yu, J. Fu, X. Tian, and T. Mei, “Multi-Source Multi-Level Attention Networks for Visual Question Answering,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 15, no. 2s, Jul. 2019, doi: 10.1145/3316767.
A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, 2017, vol. 30, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
G. Tang, M. Müller, A. Rios, and R. Sennrich, “Why Self-Attention? {A} Targeted Evaluation of Neural Machine Translation Architectures,” CoRR, vol. abs/1808.08946, 2018, [Online]. Available: http://arxiv.org/abs/1808.08946.
X. Yang, “An Overview of the Attention Mechanisms in Computer Vision,” J. Phys. Conf. Ser., vol. 1693, no. 1, p. 12173, 2020, doi: 10.1088/1742-6596/1693/1/012173.
H. Zhao, J. Jia, and V. Koltun, “Exploring Self-Attention for Image Recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10073–10082, doi: 10.1109/CVPR42600.2020.01009.
S. Yang, X. Yu, and Y. Zhou, “LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review Dataset as an Example,” in 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), 2020, pp. 98–101, doi: 10.1109/IWECAI50956.2020.00027.
G. Wentzel, “Funkenlinien im Röntgenspektrum,” Ann. Phys., vol. 371, no. 23, pp. 437–461, Jan. 1922, doi: https://doi.org/10.1002/andp.19223712302.
R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575, doi: 10.1109/CVPR.2015.7299087.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, pp. 311–318, doi: 10.3115/1073083.1073135.
Copyright (c) 2023 Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright in each article belongs to the author
- The author acknowledges that the RESTI Journal (System Engineering and Information Technology) is the first publisher to publish with a license Creative Commons Attribution 4.0 International License.
- Authors can enter writing separately, arrange the non-exclusive distribution of manuscripts that have been published in this journal into other versions (eg sent to the author's institutional repository, publication in a book, etc.), by acknowledging that the manuscript has been published for the first time in the RESTI (Rekayasa Sistem dan Teknologi Informasi) journal ;