Comparing Word Representation BERT and RoBERTa in  Keyphrase Extraction using TgGAT

Novi Yusliani; Aini Nabilah; Muhammad Raihan Habibullah; Annisa Darmawahyuni; Ghita Athalina

doi:10.29207/resti.v9i2.6279

Novi Yusliani Universitas Sriwijaya
Aini Nabilah Universitas Sriwijaya
Muhammad Raihan Habibullah Universitas Sriwijaya
Annisa Darmawahyuni Universitas Sriwijaya
Ghita Athalina Universitas Sriwijaya

DOI: https://doi.org/10.29207/resti.v9i2.6279

Keywords: Keyphrase Extraction, BERT, RoBERTa, Pre-Trained Language Models, Topic-Guided Graph Attention Networks

Abstract

In this digital era, accessing vast amounts of information from websites and academic papers has become easier. However, efficiently locating relevant content remains challenging due to the overwhelming volume of data. Keyphrase Extraction Systems automate the process of generating phrases that accurately represent a document’s main topics. These systems are crucial for supporting various natural language processing tasks, such as text summarization, information retrieval, and representation. The traditional method of manually selecting key phrases is still common but often proves inefficient and inconsistent in summarizing the main ideas of a document. This study introduces an approach that integrates pre-trained language models, BERT and RoBERTa, with Topic-Guided Graph Attention Networks (TgGAT) to enhance keyphrase extraction. TgGAT strengthens the extraction process by combining topic modelling with graph-based structures, providing a more structured and context-aware representation of a document’s key topics. By leveraging the strengths of both graph-based and transformer-based models, this research proposes a framework that improves keyphrase extraction performance. This is the first to apply graph-based and PLM methods for keyphrase extraction in the Indonesian language. The results revealed that BERT outperformed RoBERTa, with precision, recall, and F1-scores of 0.058, 0.070, and 0.062, respectively, compared to RoBERTa’s 0.026, 0.030, and 0.027. The result shows that BERT with TgGAT obtained more representative keyphrases than RoBERTa with TgGAT. These findings underline the benefits of integrating graph-based approaches with pre-trained models for capturing both semantic relationships and topic relevance.

Downloads

Download data is not yet available.

References

X. Zhu, Y. Lou, J. Zhao, W. Gao, and H. Deng, “Generative non-autoregressive unsupervised keyphrase extraction with neural topic modeling,” Eng. Appl. Artif. Intell., vol. 120, p. 105934, Apr. 2023, doi: 10.1016/j.engappai.2023.105934.

D. Wu, W. Ahmad, and K.-W. Chang, “Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study,” Sep. 2022. doi: 10.48550/arXiv.2212.10233.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North, Stroudsburg, PA, USA: Association for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.

H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun, “Pre-Trained Language Models and Their Applications,” Engineering, vol. 25, pp. 51–65, Jun. 2023, doi: 10.1016/j.eng.2022.04.024.

S. Singla, “Comparative Analysis of Transformer Based Pre-Trained NLP Models,” Int. J. Comput. Sci. Eng., vol. 8, pp. 40–44, Dec. 2020, doi: 10.26438/ijcse/v8i11.4044.

M. Song, Y. Feng, and L. Jing, “A Survey on Recent Advances in Keyphrase Extraction from Pre-trained Language Models,” in Findings of the Association for Computational Linguistics: EACL 2023, A. Vlachos and I. Augenstein, Eds., Dubrovnik, Croatia: Association for Computational Linguistics, May 2023, pp. 2153–2164. doi: 10.18653/v1/2023.findings-eacl.161.

N. Giarelis and N. Karacapilidis, “Deep learning and embeddings-based approaches for keyphrase extraction: a literature review,” Knowl. Inf. Syst., Jul. 2024, doi: 10.1007/s10115-024-02164-w.

M. Song, Y. Feng, and L. Jing, “A Survey on Recent Advances in Keyphrase Extraction from Pre-trained Language Models,” in Findings of the Association for Computational Linguistics: EACL 2023, Stroudsburg, PA, USA: Association for Computational Linguistics, 2023, pp. 2153–2164. doi: 10.18653/v1/2023.findings-eacl.161.

R. Devika, S. Vairavasundaram, C. S. J. Mahenthar, V. Varadarajan, and K. Kotecha, “A Deep Learning Model Based on BERT and Sentence Transformer for Semantic Keyphrase Extraction on Big Social Data,” IEEE Access, vol. 9, pp. 165252–165261, 2021, doi: 10.1109/ACCESS.2021.3133651.

Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” CoRR, vol. abs/1907.11692, 2019, [Online]. Available: http://arxiv.org/abs/1907.11692

R. Liu, Z. Lin, and W. Wang, “Addressing Extraction and Generation Separately: Keyphrase Prediction With Pre-Trained Language Models,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3180–3191, 2021, doi: 10.1109/TASLP.2021.3120587.

N. Giarelis and N. Karacapilidis, “LMRank: Utilizing Pre-Trained Language Models and Dependency Parsing for Keyphrase Extraction,” IEEE Access, vol. 11, pp. 71459–71471, 2023, doi: 10.1109/ACCESS.2023.3294716.

Y. Sun, H. Qiu, Y. Zheng, Z. Wang, and C. Zhang, “SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model,” IEEE Access, vol. 8, pp. 10896–10906, 2020, doi: 10.1109/ACCESS.2020.2965087.

S.-E. Kim, J.-B. Lee, G.-M. Park, S.-M. Sohn, and S.-B. Park, “RoBERTa-Based Keyword Extraction from Small Number of Korean Documents,” Electronics, vol. 12, p. 4560, Sep. 2023, doi: 10.3390/electronics12224560.

Z. Zhang, X. Liang, Y. Zuo, and C. Lin, “Improving unsupervised keyphrase extraction by modeling hierarchical multi-granularity features,” Inf. Process. Manag., vol. 60, no. 4, p. 103356, Jul. 2023, doi: 10.1016/j.ipm.2023.103356.

Y. Ying, T. Qingping, X. Qinzheng, Z. Ping, and L. Panpan, “A Graph-based Approach of Automatic Keyphrase Extraction,” Procedia Comput. Sci., vol. 107, pp. 248–255, 2017, doi: 10.1016/j.procs.2017.03.087.

M. Garg and M. Kumar, “KEST: A graph-based keyphrase extraction technique for tweets summarization using Markov Decision Process,” Expert Syst. Appl., vol. 209, p. 118110, Dec. 2022, doi: 10.1016/j.eswa.2022.118110.

A. Mishra et al., GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation. 2024.

A. Vaswani et al., “Attention Is All You Need,” CoRR, vol. abs/1706.03762, 2017, [Online]. Available: http://arxiv.org/abs/1706.03762

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained LanguageModel for Indonesian NLP,” CoRR, vol. abs/2011.00677, 2020, [Online]. Available: https://arxiv.org/abs/2011.00677

I. R. Hidayat and W. Maharani, “General Depression Detection Analysis Using IndoBERT Method,” Int. J. Inf. Commun. Technol., vol. 8, no. 1, pp. 41–51, Aug. 2022, doi: 10.21108/ijoict.v8i1.634.

H. Al-Jarrah, R. Al-Hamouri, and M. AL-Smadi, “HR@JUST Team at SemEval-2020 Task 4: The Impact of RoBERTa Transformer for Evaluation Common Sense Understanding,” in Proceedings of the Fourteenth Workshop on Semantic Evaluation, Stroudsburg, PA, USA: International Committee for Computational Linguistics, 2020, pp. 521–526. doi: 10.18653/v1/2020.semeval-1.64.