Overview and Exploratory Analyses of CICIDS 2017 Intrusion Detection Dataset

Akinyemi Oyelakin; Ameen A.O; Ogundele T.S; Salau-Ibrahim T; Abdulrauf U.T; Olufadi H.I; Ajiboye I.K; Muhammad-Thani S; Adeniji I. A

doi:10.29207/joseit.v2i2.5411

Akinyemi Oyelakin Al-Hikmah University
Ameen A.O University of Ilorin
Ogundele T.S Al-Hikmah University
Salau-Ibrahim T Al-Hikmah University
Abdulrauf U.T Al-Hikmah University
Olufadi H.I University of Ilorin
Ajiboye I.K Abdulraheem College of Advanced Studies
Muhammad-Thani S University of Ilorin
Adeniji I. A University of Ilorin

DOI: https://doi.org/10.29207/joseit.v2i2.5411

Keywords: Intrusion Detection, Data Set Exploration, Machine Learning, Dataset Preprocessing

Abstract

Intrusion detection systems are used to detect attacks on a network. Machine learning (ML) approaches have been widely used to build such intrusion detection systems (IDSs) because they are more accurate when built from a very large and representative dataset. Recently, one of the benchmark datasets that are used to build ML-based intrusion detection models is the CICIDS2017 dataset. The data set is contained in eight groups and was collected from the Data Set & Repository of the Canadian Institute of Cyber Security. The data set is available in both PCAP and net flow formats. This study used the net flow records in the CIDIDS2017 dataset, as they were found to contain newer attacks, very large, and useful for traffic analysis. Exploratory data analysis (EDA) techniques were used to reveal various characteristics of the dataset. The general objective is to provide more insight into the nature, structure, and issues of the data set so as to identify the best ways to use it to achieve improved ML-based IDS models. Furthermore, some of the open problems that can arise from the use of the dataset in any machine learning-based intrusion detection systems are highlighted and possible solutions are briefly discussed. The EDA techniques used revealed important relationships between the input variables and the target class. The study concluded that the EDA can better influence the decision about future IDS research using the dataset.

Downloads

Download data is not yet available.

References

Aggarwala Preeti & Kumar Sharmab Sudhir (2015). Analysis of KDD Dataset Attributes - Class wise For Intrusion Detection, 3rd International Conference on Recent Trends in Computing 2015(ICRTC-2015), Procedia Computer Science 57, 842 – 851

Beyan C. & Fisher R.(2015).Classifying imbalanced datasets using similarity based hierarchical decomposition, Pattern recognition, 48(5), 1653-16728

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Gameng H.A., Gerardo B. B. &Medina R. P., (2019). Modified Adaptive Synthetic SMOTE to Improve Classification Performance in Imbalanced Datasets,2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Kuala Lumpur, Malaysia, 2019, 1-5, doi: 10.1109/ICETAS48360.2019.9117287.

Ghurab Mossa , Gaphari Ghaleb, Alshami Faisal, Alshamy Reem & Othman Suad (2021). A Detailed Analysis of Benchmark Datasets for Network Intrusion Detection System, Asian Journal of Research in Computer Science,7(4): 14-33,DOI: 10.9734/ajrcos/2021/v7i430185

Gibson David C & Freitas Sara de (2015). Exploratory Analysis in Learning Analytics, Technology, Knowledge, and Learning 21(1), DOI: 10.1007/s10758-015-9249-5

Komorowski Matthieu Marshall Dominic C. , Salciccioli Justin D & Crutain Yves (2016).Exploratory Data Analysis, In book: Secondary Analysis of Electronic Health Records, 10.1007/978-3-319-43742-2_15

Malowidzki Marek, Berezinski Przemyslaw & Mazur Micha (2015). Network Intrusion Detection: Half a Kingdom for a Good Dataset, Conference: NATO STO- IST-139 Visual Analytics for Exploring, Analysing and Understanding Vast, Complex and Dynamic Data retrieved from https://pdfs.semanticscholar.org/b39e/0f1568d8668d00e4a8bfe1494b5a32a17e17.pdf?_ga=2.237473350.756880770.1576358584-422052986.1572640169

Mashkanova Aigerim (2019). Exploratory Data Analysis toward Cloud Intrusion Detection, A Master Thesis submitted to University of Victoria for the award of M.Sc. Computer Science

Mohammad Hamid Abdulraheem & Najla Badie Ibraheem (2019). A Detailed Analysis of New Intrusion Detection Dataset, Journal of Theoretical and Applied Information Technology 15th September 2019. 97(17)

Oyelakin A.M. & Jimoh R.G. (2021), A Survey of Feature Extraction and Feature Selection Techniques Used in Machine Learning-Based Botnet Detection Schemes, VAWKUM Transactions on Computer Sciences, 9 (2021),1-7, available at https://vfast.org/journals/index.php/VTCS/article/view/604/658

Panigrahi Ranjit & Borah Samarjeet (2018). A detailed analysis of CICIDS2017 dataset for designing Intrusion Detection Systems, International Journal of Engineering & Technology, 7(3):479-482

Protić Danijela D.(2018). Review of Kdd Cup ‘99, Nsl-Kdd and Kyoto 2006+ Datasets, Military Technical Courier, 66(3), DOI: 10.5937/vojtehg66-16670; https://doi.org/10.5937/vojtehg66-16670

Sharafaldin I., Lashkari A. H. Ghorbani A. A. (2019). A Detailed Analysis of the CICIDS2017 Data Set. Springer, International Conference on Information Systems Security and Privacy, 2019.

Smola Alex & Vishwanathan S.V.N. (2008).Introduction to Machine Learning Cambridge university press. The Edinburgh Building, Cambridge, UK

Santosh Kumar Sahul, Sauravranjan Sarangi & Sanjaya Kumar Jena(2014). A Detail Analysis on Intrusion Detection Datasets, 2014 IEEE International Advance Computing Conference (IACC)

Sharafaldin Iman, Lashkari Arash Habibi, and Ghorbani Ali A. (2018). Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018

Swamynathan Manoha (2017). Mastering Machine Learning with Python in Six steps, A Practical Implementation Guide to Predictive Data Analytics Using Python, DOI:10.1007/978-1-4842-2866-1_3, or https://tanthiamhuat.files.wordpress.com/2018/04/mastering-machine-learning-with-python-in-six-steps.pdf

Tavallace M, Bagheri E., Lu W. & Ghorbani A. A. (2009). A Detailed Analysis of the KDD CUP 99 Dataset, Proceedings of the 2009IEEE Symposium on Computational Intelligence in Security and Defense Applications (CISDA 2009)