Folk Games Image Captioning using Object Attention

Saiful Akbar; Benhard Sitohang; Jasman Pardede; Irfan Amal; Kurniandha Yunastrian; Marsa Ahmada; Anindya Prameswari

doi:10.29207/resti.v7i4.4708

Saiful Akbar Institut Teknologi Bandung
Benhard Sitohang Institut Teknologi Bandung
Jasman Pardede Institut Teknologi Nasional
Irfan Amal Institut Teknologi Bandung
Kurniandha Yunastrian Institut Teknologi Bandung
Marsa Ahmada Institut Teknologi Bandung
Anindya Prameswari Institut Teknologi Bandung

DOI: https://doi.org/10.29207/resti.v7i4.4708

Keywords: image captioning, folk games, object attention

Abstract

The result of a deep learning-based image captioning system with encoder-decoder framework relies heavily on the image feature extraction technique and the caption-based model. The accuracy of the model is heavily influenced by the proposed attention mechanism. The inability to distinguish between the output of the attention model and the input expectation of the decoder can cause the decoder to give incorrect results. In this paper, we proposed an object-attention mechanism using object detection. Object detection outputs a bounding box and an object category label, which is then used as an image input into VGG16 for feature extraction and into a caption-based LSTM model. The experimental results showed that the system with object attention performed better than the system without object attention. BLEU-1, BLEU-2, BLEU-3, BLEU-4, and CIDER scores for the image captioning system with object attention improved 12.48%, 17.39%, 24.06%, 36.37%, and 43.50% respectively compared to the system without object attention.

Downloads

Download data is not yet available.

References

L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on Attention for Image Captioning,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4633–4642, doi: 10.1109/ICCV.2019.00473.

T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting Image Captioning with Attributes,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4904–4912, doi: 10.1109/ICCV.2017.524.

R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 966–973, doi: 10.1109/CVPR.2010.5540112.

B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2T: Image Parsing to Text Description,” Proc. IEEE, vol. 98, no. 8, pp. 1485–1508, 2010, doi: 10.1109/JPROC.2010.2050411.

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems, 2015, vol. 28, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf.

F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, and Q. Ye, “C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2194–2203, doi: 10.1109/CVPR.2019.00230.

K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734, doi: https://doi.org/10.3115/v1/d14-1179.

X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-Encoding Scene Graphs for Image Captioning,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10677–10686, doi: 10.1109/CVPR.2019.01094.

H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation Networks for Object Detection,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588–3597, doi: 10.1109/CVPR.2018.00378.

K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” CoRR, vol. abs/1502.03044, 2015, [Online]. Available: http://arxiv.org/abs/1502.03044.

L. Chen et al., “SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6298–6306, doi: 10.1109/CVPR.2017.667.

J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3242–3250, doi: 10.1109/CVPR.2017.345.

D. Yu, J. Fu, X. Tian, and T. Mei, “Multi-Source Multi-Level Attention Networks for Visual Question Answering,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 15, no. 2s, Jul. 2019, doi: 10.1145/3316767.

A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, 2017, vol. 30, [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

G. Tang, M. Müller, A. Rios, and R. Sennrich, “Why Self-Attention? {A} Targeted Evaluation of Neural Machine Translation Architectures,” CoRR, vol. abs/1808.08946, 2018, [Online]. Available: http://arxiv.org/abs/1808.08946.

X. Yang, “An Overview of the Attention Mechanisms in Computer Vision,” J. Phys. Conf. Ser., vol. 1693, no. 1, p. 12173, 2020, doi: 10.1088/1742-6596/1693/1/012173.

H. Zhao, J. Jia, and V. Koltun, “Exploring Self-Attention for Image Recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10073–10082, doi: 10.1109/CVPR42600.2020.01009.

S. Yang, X. Yu, and Y. Zhou, “LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review Dataset as an Example,” in 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), 2020, pp. 98–101, doi: 10.1109/IWECAI50956.2020.00027.

G. Wentzel, “Funkenlinien im Röntgenspektrum,” Ann. Phys., vol. 371, no. 23, pp. 437–461, Jan. 1922, doi: https://doi.org/10.1002/andp.19223712302.

R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575, doi: 10.1109/CVPR.2015.7299087.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, pp. 311–318, doi: 10.3115/1073083.1073135.