Overview and Exploratory Analyses of CICIDS 2017 Intrusion Detection Dataset
Abstract
Intrusion detection systems are used to detect attacks on a network. Machine learning (ML) approaches have been widely used to build such intrusion detection systems (IDSs) because they are more accurate when built from a very large and representative dataset. Recently, one of the benchmark datasets that are used to build ML-based intrusion detection models is the CICIDS2017 dataset. The data set is contained in eight groups and was collected from the Data Set & Repository of the Canadian Institute of Cyber Security. The data set is available in both PCAP and net flow formats. This study used the net flow records in the CIDIDS2017 dataset, as they were found to contain newer attacks, very large, and useful for traffic analysis. Exploratory data analysis (EDA) techniques were used to reveal various characteristics of the dataset. The general objective is to provide more insight into the nature, structure, and issues of the data set so as to identify the best ways to use it to achieve improved ML-based IDS models. Furthermore, some of the open problems that can arise from the use of the dataset in any machine learning-based intrusion detection systems are highlighted and possible solutions are briefly discussed. The EDA techniques used revealed important relationships between the input variables and the target class. The study concluded that the EDA can better influence the decision about future IDS research using the dataset.
Downloads
References
Aggarwala Preeti & Kumar Sharmab Sudhir (2015). Analysis of KDD Dataset Attributes - Class wise For Intrusion Detection, 3rd International Conference on Recent Trends in Computing 2015(ICRTC-2015), Procedia Computer Science 57, 842 – 851
Beyan C. & Fisher R.(2015).Classifying imbalanced datasets using similarity based hierarchical decomposition, Pattern recognition, 48(5), 1653-16728
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Gameng H.A., Gerardo B. B. &Medina R. P., (2019). Modified Adaptive Synthetic SMOTE to Improve Classification Performance in Imbalanced Datasets,2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Kuala Lumpur, Malaysia, 2019, 1-5, doi: 10.1109/ICETAS48360.2019.9117287.
Ghurab Mossa , Gaphari Ghaleb, Alshami Faisal, Alshamy Reem & Othman Suad (2021). A Detailed Analysis of Benchmark Datasets for Network Intrusion Detection System, Asian Journal of Research in Computer Science,7(4): 14-33,DOI: 10.9734/ajrcos/2021/v7i430185
Gibson David C & Freitas Sara de (2015). Exploratory Analysis in Learning Analytics, Technology, Knowledge, and Learning 21(1), DOI: 10.1007/s10758-015-9249-5
Komorowski Matthieu Marshall Dominic C. , Salciccioli Justin D & Crutain Yves (2016).Exploratory Data Analysis, In book: Secondary Analysis of Electronic Health Records, 10.1007/978-3-319-43742-2_15
Malowidzki Marek, Berezinski Przemyslaw & Mazur Micha (2015). Network Intrusion Detection: Half a Kingdom for a Good Dataset, Conference: NATO STO- IST-139 Visual Analytics for Exploring, Analysing and Understanding Vast, Complex and Dynamic Data retrieved from https://pdfs.semanticscholar.org/b39e/0f1568d8668d00e4a8bfe1494b5a32a17e17.pdf?_ga=2.237473350.756880770.1576358584-422052986.1572640169
Mashkanova Aigerim (2019). Exploratory Data Analysis toward Cloud Intrusion Detection, A Master Thesis submitted to University of Victoria for the award of M.Sc. Computer Science
Mohammad Hamid Abdulraheem & Najla Badie Ibraheem (2019). A Detailed Analysis of New Intrusion Detection Dataset, Journal of Theoretical and Applied Information Technology 15th September 2019. 97(17)
Oyelakin A.M. & Jimoh R.G. (2021), A Survey of Feature Extraction and Feature Selection Techniques Used in Machine Learning-Based Botnet Detection Schemes, VAWKUM Transactions on Computer Sciences, 9 (2021),1-7, available at https://vfast.org/journals/index.php/VTCS/article/view/604/658
Panigrahi Ranjit & Borah Samarjeet (2018). A detailed analysis of CICIDS2017 dataset for designing Intrusion Detection Systems, International Journal of Engineering & Technology, 7(3):479-482
Protić Danijela D.(2018). Review of Kdd Cup ‘99, Nsl-Kdd and Kyoto 2006+ Datasets, Military Technical Courier, 66(3), DOI: 10.5937/vojtehg66-16670; https://doi.org/10.5937/vojtehg66-16670
Sharafaldin I., Lashkari A. H. Ghorbani A. A. (2019). A Detailed Analysis of the CICIDS2017 Data Set. Springer, International Conference on Information Systems Security and Privacy, 2019.
Smola Alex & Vishwanathan S.V.N. (2008).Introduction to Machine Learning Cambridge university press. The Edinburgh Building, Cambridge, UK
Santosh Kumar Sahul, Sauravranjan Sarangi & Sanjaya Kumar Jena(2014). A Detail Analysis on Intrusion Detection Datasets, 2014 IEEE International Advance Computing Conference (IACC)
Sharafaldin Iman, Lashkari Arash Habibi, and Ghorbani Ali A. (2018). Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018
Swamynathan Manoha (2017). Mastering Machine Learning with Python in Six steps, A Practical Implementation Guide to Predictive Data Analytics Using Python, DOI:10.1007/978-1-4842-2866-1_3, or https://tanthiamhuat.files.wordpress.com/2018/04/mastering-machine-learning-with-python-in-six-steps.pdf
Tavallace M, Bagheri E., Lu W. & Ghorbani A. A. (2009). A Detailed Analysis of the KDD CUP 99 Dataset, Proceedings of the 2009IEEE Symposium on Computational Intelligence in Security and Defense Applications (CISDA 2009)
Copyright (c) 2023 Journal of Systems Engineering and Information Technology (JOSEIT)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).