Overview and Exploratory Analyses of CICIDS 2017 Intrusion Detection Dataset

Intrusion detection systems are used to detect attacks on a network. Machine learning (ML) approaches have been widely used to build such intrusion detection systems (IDSs) because they are more accurate when built from a very large and representative dataset. Recently, one of the benchmark datasets that are used to build ML-based intrusion detection models is the CICIDS2017 dataset. The data set is contained in eight groups and was collected from the Data Set & Repository of the Canadian Institute of Cyber Security. The data set is available in both PCAP and net flow formats. This study used the net flow records in the CIDIDS2017 dataset, as they were found to contain newer attacks, very large, and useful for traffic analysis. Exploratory data analysis (EDA) techniques were used to reveal various characteristics of the dataset. The general objective is to provide more insight into the nature, structure, and issues of the data set so as to identify the best ways to use it to achieve improved ML-based IDS models. Furthermore, some of the open problems that can arise from the use of the dataset in any machine learning-based intrusion detection systems are highlighted and possible solutions are briefly discussed. The EDA techniques used revealed important relationships between the input variables and the target class. The study concluded that the EDA can better influence the decision about future IDS research using the dataset. Thus, improved machine learning-based intrusion detection systems can be built from the data set once it is well understood and pre-processed.


Introduction
An intrusion detection system (IDS) is a protection mechanism for detecting network attacks on a network.Machine learning (ML) approaches have become popular for building such intrusion detection systems (IDSs) due to the limitations of signature-based detection schemes.ML is a sub-filed of Artificial Intelligence that allows algorithms to learn from data and its applications have been found promising across many domains [1].ML-based IDSs are more accurate when built from a very large and representative data set.Several machine learning-based intrusion detection systems have been proposed in the literature.These machine learning-based models have been built from different datasets.Some of such datasets include KDD CUP-99, NSL-KDD, Kyoto 2006+.
However, some of these data sets are old and have been used extensively in intrusion detection studies, while some are very small.In recent times, one of the benchmark data sets that are becoming popular to build MLbased intrusion detection models is the CICIDS2017 data set.The data set consists of eight different captures.Malowidzki Marek, Berezinski Przemyslaw, and Mazur Micha (2015) pointed out that a very large dataset that has representative attacks is better used for building intrusion detection models.The limitations observed in some of these datasets led to the design of the CICIDS2017 datasets as argued by Sharafaldin et al. (2018).Aside this, several works in the past have used these datasets to build machine learning-based intrusion detection systems.
This study specifically improves on a recent study that focused on exploratory analysis of some selected intrusion detection datasets.The work was authored by Ghurab, Gaphari, Alshami, Alshamy, and Othman (2021).However, the study did not detail the characteristics of the CICIDS2017 dataset.The approach used in this study is to perform a more detailed analysis of the CICIDS2017 dataset and then point out some of the open problems that researchers may face when using the dataset to build machine learning-based intrusion detection models.It is believed that this approach will be more comprehensive and can provide leading insights to researchers working in this area.
This paper focuses on reporting an overview of the data set and providing results of its exploratory analyzes.Komoroski, Marshall, and Saiciccioli (2016) and Gibson and Freisas (2015) have argued that exploratory analysis is a crucial step in every data analytics research, and this serves as the basis for the approach in this work.In any study based on machine learning, it is essential to identify the patterns that could be present in the chosen data set to know the best approach to using the data set for model building.Therefore, the focus of this study is to perform an overview and exploratory analysis (EDA) of the IDS data set.Generally, an EDA is the process of getting to know data in depth so as to have a better understanding of how to use it in building MLbased models.
Furthermore, exploratory data analysis enables machine learning researchers to remove irregularities, outliers, and unnecessary values from the dataset, thereby promoting the building of improve models in different domains.This paper first provided an overview of the different captures in the dataset and emphasizes the need to address the many features contained in each dataset capture.Then, this study used different EDA approaches to provide better insight into the data set and then discussed some of the challenges of using the data set in IDS studies.The general objective is to provide more information on the data set that can aid in the construction of improved ML-based IDS models.

Related Studies
Ghurab et al. ( 2021) performed an analysis of some benchmark data sets that are used to build network intrusion detection systems.The study generally discussed old and new datasets for IDS studies.However, it was observed that the analyzes were general and a detailed report was not made on a recent dataset named CICIDS2017.Similarly, Panigrahi and Borah (2018) carried out an analysis of the CICIDS2017 data set that is being recently used to build intrusion detection systems.The paper explored general characteristics of the data set and mentioned some of the inherent issues with respect to it without focusing on exploratory analyses.Aggarwala Preeti & Kumar Sharmab Sudhir (2015) carried out an analysis of the KDD CUP 99 dataset attributes class-wise for intrusion detection.The experimental analysis in the study revealed better insights on the KDD CUP dataset, which is also popular for intrusion detection studies.Apart from this, Proti (2018) conducted a review of three datasets, namely the KDD Cup '99, NSL-KDD and Kyoto 2006+ datasets, which are popular for research on intrusion detection studies [2].
Iman, Arash, and Ali (2018) argued that some of the major limitations observed in the previous IDS dataset brought about the need for the development of the CICIDS2017 dataset.The authors carried out an analysis of the CICIDS2017 data set.The study discussed some of the key features and components of the dataset.However, the study did not reveal some issues from the analysis and did not extend to reporting the open problems found in the data set.Specifically, the authors claimed that their evaluations of about 11 previous datasets showed that most of them are out of date and unreliable.Some of them also suffer from the lack of diversity and traffic volumes, as they do not cover the variety of known attacks.Similarly, Mashkanova (2019) carried out Exploratory Data Analysis of Cloud-based Data Set that can be used for identifying intrusions in Cloud computing environment.The focus of the work was only on cloud computing security issues.
Komorowski et al. ( 2016) listed some tools used to explore a dataset, which is essential to gain a good understanding of the features and potential issues of the dataset.Gibson and Freitas (2015) presented the research contexts, the tools and methods used in the exploratory phases of the analysis, the main findings, and the implications for learning analytics research methods.Santosh, Sahu, Sarangi and Jena (2014) carried out an analysis of some intrusion detection datasets such as KDD-99, NSL-KDD, etc.The data sets used in the investigation were the ones that have been reported to be very old.Tavallaee et al. ( 2009) conducted a statistical analysis on the KDD CUP 99 dataset and reported that there are two important issues that highly affect the performance of intrusion detection systems built with it.Therefore, the authors proposed a new data set named NSL-KDD, which consists of selected records of the entire KDD 99 data set but improved on the mentioned shortcomings of the old data set.

Methods
The data set used in this study was collected from the Canadian Institute of Cyber Security Data Sets repository.It is available for download at https://www.unb.ca/cic/datasets/ids-2017.html.The methods used in this study are two-fold.First, an overview of the intrusion detection dataset named CICIDS2017 was provided.Thereafter, the focus is on performing detailed exploratory analyzes of the eight different captures in the dataset1.The data set was chosen because it is very large and contains several attacks and intrusion traces, which is good for security studies.The exploratory analyze procedure includes the following: dataset description, computing the statistical summary, identification of the properties in the datasets, and data visualization.Then some of the open problems of the data set identified in the exploratory analyses are discussed.All experiments were carried out in the Python programming language environment.

Data sets for Intrusion Detection Studies
Several data sets have been released for intrusion detection studies.In fact, they are too numerous to mention.Some of these datasets are listed below.They include: KDD CUP 99, NSL-KDD, IoT Healthcare Security Datasets, IoT DOS datasets, IoT DDOS Security datasets, Kyoto 2006+, datasets on malware of different types, and many others.In this work, CICIDS2017 is studied, which is one of the most popular IDS datasets in recent times, with a view to revealing some of the issues with it and how to use it to build IDS models better.

Results and Discussion
The findings of this study are grouped into two.The first reported an overview of the eight captures in the data set.The second results are based on detailed exploratory analyzes.Some of the open problems identified in the data set based on the EDA are also discussed.

Overview of the CICIDS2017 data set
From the analysis carried out, it was discovered that CICIDS2017 is a large and representative data set that is good for evaluating intrusion detection systems.The data set was originally developed at the Faculty of Computer Science; University of New Brunswick.The data set was built and released by Sharafaldin et al. (2018) purposely to advance studies on the building of intrusion detection systems.The data set contains up-todate benign common attacks, which resembles the true real-world data (PCAP).It also includes the results of the network traffic analysis using CICFlowMeter with labeled flows based on the time stamp, source and destination IPs, source and destination ports, protocols, and attacks.
The CICIDS2017 dataset consists of labeled network flows, including full packet payloads in pcap format, the corresponding profiles and the labeled flows that are publicly available for researchers (Sharafaldin et al., 2018).As argued by Sharafaldin et al. (2018), they built the abstract behavior of 25 users based on the HTTP, HTTPS, FTP, SSH, and email protocols in the dataset.The authors pointed out that the data capture period for the CISIDS2017 data set started at 9 am on Monday, 3 July 2017 and ended at 5 p.m. on Friday, 7 July 2017, for a total of 5 days.Also in the dataset, the available attacks include Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack, Infiltration, Botnet, and DDoS.
During the data set building, the attacks were executed both morning and afternoon on Tuesday, Wednesday, Thursday, and Friday.There are eight different captures in the data set.Each of these captures contains attacks recorded during the data set building.Based on the period of capture in those periods, the different captures in the data set were renamed in this study FriAfternoonPortScan, FriAfternoonDDOS, FriMorning, MonMorningHour, ThurAfternoonInfiltration, ThursdayMornWebAttacks, TueWorking, WedHour for easy referencing purposes.The experimental results obtained in Table 1 are the true description of the features and instances in the CICIDS2017 data set.

Data Distributions in the Dataset Captures
The data distributions in the data set are as shown in Figure 1:  From the summary statistics, it was observed that the distributions are similar on the basis of the values obtained from each statistical result.This further confirms that the intrusion captures in each of the net flow dataset behaved in a similar manner.

Visualization of patterns in the data set
The visualization of each of the sets in the dataset is captured as shown in Figures 17 to 24.They are all basic scatter plots that represent the patterns in the dataset, the more.Generally, the summary of the features and samples in the eight captures of the data set is summarized as shown in Table 2.For this reason, handling the big data issue while using the data set is required to build intrusion detection models is required.Furthermore, it was observed that the data set has complex data patterns.Thus, machine learning algorithms that have the ability to handle complex distributions have to be chosen when building machine learning-based models.
Another open problem in the data set that has to be addressed is the high-class imbalance.
The other issue is that the data set has many features that cannot be used to build the model.Therefore, as argued by Oyelakin and Jimoh (2021), the selection of features will be very essential.This approach will allow researchers to build an ML-based intrusion detection model based on the reduced features in the CICIDS2017 dataset.Thus, the models will be less complex, more interpretable, and will have excellent performance.Lastly, proper scaling of the dataset features has to be addressed, as well, because of the high variation in some of the feature scaling.This study used a recent and rich intrusion detection data set named the CICIDS2017 dataset for experimental analyzes.First, an overview of the data set was reported.The focus was then shifted to the use of different exploratory data analysis (EDA) approaches to get a better understanding of the data set.The study first revealed the different data frames in the data set.Subsequently, summary statistics were obtained for each set of data set captures.The statistical summary provided essential statistical information about the characteristics and samples of the data set.From the EDA, it was also discovered that there are 79 missing (NaN) values in each of the dataset captures.
Aside this, analyses revealed that the input features (attributes) in the dataset are of numeric data type (integer and floating types) while the output feature is categorical (Benign and non-benign).On the basis of the exploratory analysis of the dataset, it was equally found that the input features are of different values and ranges.The data set was also observed to have a high-class imbalance.This study observed that the features in the dataset have complex data patterns, which require innovative approaches during the pre-processing stages so as to be able to build more effective intrusion detection models from the dataset.It was equally discovered that the eight different captures in the data set reported various attacks, and the numerical data are of integer and floating-point type.The exploration revealed the structure of the dataset, some of the problems that need to be addressed, and better approaches to address the dataset shortcomings in a machine learning classification problem.
Some of the issues identified with the data set are summarized in Table 2.For example, since some of the ML-based IDS cannot learn from a data set with missing values, the issue has to be addressed.The popular arguments for handling missing values include: deleting the columns whenever missing values are found, using imputation (mean or mode imputation).For instance, Swamynathan (2017) pointed out that once a data set is very large and the missing values are less than 5%, the missing ones can be deleted.This study agrees with this argument, since the CICIDS2017 dataset is very large, running to several gigabytes of information and the unknown (missing) values are very minimal.
This study hereby recommends that researchers using the dataset may consider imputation techniques to handle the unknown or missing values and then use the preprocessed dataset to build an improved ML-based intrusion detection system.Further analysis carried out showed that there is a need to address the class imbalance in each of the capture using any suitable method in the literature.Some of the summarized solutions are mentioned in Table 3 and can be of great help to any machine learning researcher who proposes to use the data set for IDS studies.Table 4 was used to present the results of the experimental analysis of the data set with respect to the feature space and sample sizes before and after the removal of missing values.Visualizations of the data set carried out in the study also provided some insight into pattern distributions.It is believed that understanding the distributions can help researchers better use the data set in future research.

Conclusions
This study used innovative approaches to provide a detailed analysis of the data set.The work focused on investigating the basic characteristics of the benchmark intrusion detection data set named CICIDS2017 using some exploratory data analysis techniques.The data set used in this study was collected from a repository in a Canadian university laboratory.The experimental analyzes of the data set are detailed and can provide adequate information to researchers using it to build intrusion detection systems.The patterns in the data set were also visualized using a simple scatter plot.It is believed that the exploratory data analysis further revealed some of the underlying structures/patterns in the data set, which can help build improved ML-based intrusion detection models.Equally important, some of the suggestions made in this study to handle open problems in the data set can serve as information for researchers working in the IDS area.The EDA techniques used in this study may be useful to reveal important relationships between input variables and the target class.The study concluded that the EDA can better influence the decision about future IDS research using the dataset.A future study will focus on building efficient ML-based models from the CICIDS2017 dataset, with an emphasis on the impact of innovative data cleaning approaches on the performance of the targeted ML models.

Figure 1 .
Figure 1.Data frame of the first data capture4.4Summary statistics in the dataset capturesThe statistical summary provides some statistical details about the distributions in the chosen dataset.The summary statistics of the eight captures in the data set are shown in Figure2.

Figure 2 .
Figure 2. Summary statistics for the first data capture

Figure 3 :
Figure 3: Visualization of First Data Capture Statistical summaries and diagrams are used to show the description of the patterns in the dataset.For example, it can be seen from Figure 1 that different patterns exist from the eight sets of data set capture.The

Table 1 .
Dataset feature space and sample size

Table 2 .
Key Summary of the Captures in the CICIDS2017 Dataset

Table 3 .
Summary of Suggested Solutions for tackling the Issues in the Dataset

Table 4 :
Records of each data set before and after that were deleted when unknown or missing values were deleted