Introduction
The rapid evolution of cybersecurity threats, including Distributed Denial of Service (DDoS) attacks, ransomware, and botnet activities, has exposed the limitations of traditional defense mechanisms like antivirus software and firewalls in detecting sophisticated and emerging malware. As cyber threats grow more diverse, with attackers exploiting cloud computing, IoT, and mobile technologies, traditional security measures based on predefined rules are no longer sufficient. These attacks can lead to severe financial and operational damage, making it essential to develop more adaptive defense strategies that can respond to both known and unknown threats. As a result, there is a growing need for adaptive, real-time detection systems capable of responding to both known and unknown threats.
To address these challenges, ML-based cybersecurity solutions are increasingly focusing on behavior-based detection and automated threat mitigation. This approach allows for the identification of zero-day attacks—previously unknown vulnerabilities that traditional methods cannot detect—by building intelligent, adaptive systems that enhance malware detection and improve response times, strengthening overall cybersecurity defenses. ML has emerged as a key tool for enhancing malware detection; by analyzing large volumes of data and identifying patterns, ML models can detect anomalies and adapt to new malware types. Supervised and unsupervised algorithms like decision trees, neural networks, and anomaly detection are now being used to classify threats, enabling more effective and proactive defense systems against modern cyberattacks.
The objective of this article is to create an effective malware detection system using advanced machine learning techniques, leveraging the CICIDS2017 dataset as a benchmark. It aims to address the growing challenge of detecting a wide range of cybersecurity threats in network traffic, including DoS/DDoS, Port Scans, Brute Force attacks, and Botnet activities. By preprocessing the data, selecting the most relevant features, and applying various classification models, the research seeks to improve the detection accuracy and robustness of cybersecurity systems. The overall goal is to create a reliable framework that enhances the detection of malicious network traffic, helping to strengthen cybersecurity measures and mitigate potential threats. The article assesses the performance of multiple advanced ML models, including XGBoost, Random Forest, Naïve Bayes, and Multilayer Perceptron (MLP), in identifying network-based threats such as DoS/DDoS attacks, Brute Force attacks, and Port Scans.
Methodology
Preprocessing steps were employed to clean the data by removing outliers, handling missing values, and applying Min-Max normalization to ensure consistency across numerical features. Feature selection techniques were applied to retain the most relevant attributes for effective model training. A total of 80% of the data was used for training purposes, while 20% was kept aside for testing purposes. Several classification models were applied to detect malware, and their performance was evaluated employing metrics like F1-score, precision, recall, and accuracy, assessing a system's effectiveness in identifying cybersecurity threats.
Data Collection
CICIDS2017 satisfies all 11 of the fundamental standards for a trustworthy malware detection dataset that were released in 2016, making it a thorough and up-to-date benchmark. The dataset contains over 80% benign data and covers several types of malware attacks, categorized into six main groups during preprocessing from an original set of thirteen subcategories. For analysis, a dataset was split into two parts: one for training and one for testing, ensuring an effective evaluation process for malware detection systems.
Data Preprocessing
Preprocessing techniques are used to improve the quality of inconsistent (erroneous, outlier) and incomplete data before applying data mining techniques to improve accuracy and efficiency. The following preprocessing steps are listed below:
- Remove outliers: Finding missing, erroneous, inaccurate, or superfluous data and then replacing, editing, or removing it from the dataset constitute the cleaning process.
- Missing Values: Handling missing values is crucial for improving the performance of machine learning models. Common tactics include using sophisticated imputation methods, deleting rows with missing data, or substituting the column's mean, median, or mode.
- Scaling features with Min Max Scaler help models learn more effectively and boost performance. The Min-Max normalization technique can be used to transform all numerical values in a dataset to a standardized range between 0 and 1, ensuring that the relative differences between values remain intact without altering the underlying distribution.
Y = a – (min(a)/max(a)) – min(a), where Y represents the normalized value, and a represents the original value.
Data Splitting
Data splitting is a crucial step in determining how well ML models perform. With 80% of the data set designated for training and 20% for testing, the basic concept is to split the dataset into two separate subsets for each purpose.
Classification Models
This study explores the use of XGBoost, Naïve Bayes, MLP, and Random Forest models to classify malware attacks on the CICIDS2017 dataset, with the goal of enhancing the accuracy and effectiveness of cybersecurity threat detection.
Extreme Gradient Boosting (XGBoost): An approach that incorporates several decision trees is XGBoost, and it is based on gradient boosting. Its agility, portability, and efficiency are its defining features. No more than a sub-model relevant to a current step is optimized by XGBoost in each iteration.
Random forest (RF): The RF ensemble is just a collection of individual DTs (Decision Trees). The model's forecast is based on the highest-ranked class after collecting predictions from all the random forest trees. Where it is discovered that the ensemble's forecast is more accurate than the sum of the individual predictions, low correlation across models becomes crucial.
Naïve Bayes (NB): The NB classifier proves the feature's unrelatedness in the data by using Bayes' theorem to separate irrelevant features from a collection of relevant characteristics. Although Naïve Bayes is utilized in many different fields throughout the globe, it has been employed in cloud security services to identify risks and DDoS assaults.
Multilayer Perceptron (MLP): MLP is a supervised learning algorithm or model that uses a feedforward neural network to classify and learn from data. This kind of ANN is known as feedforward. The bare minimum for an MLP is three layers: input, hidden, and output. There is at least one neuron in each stratum. Every neuron in the next layer is connected to every other neuron with a certain weight. After each data input, the weights are changed to facilitate learning in MLP.
Performance Metrics
There are four metrics used to assess and compare the algorithms' output: confusion matrix, accuracy, recall, precision, and F1-Score.
Confusion Matrix: A confusion matrix, often known as an error matrix, is a useful tool for displaying algorithm output. Real class occurrences are found in each matrix row, whereas anticipated class instances are represented in each column, or vice versa. It is used in the computation of F1 scores, recall, accuracy, and precision. The following terms are employed in the confusion matrix:
True Positive (TP): A data point is said to have this property when its anticipated and actual classes are both one.
True Negative (TN): A data point is considered to have this property when its expected and actual classes are both zero.
False Positive (FP): A data point is considered to have a class of 0 when, in fact, its predicted class is one.
False Negative (FN): This happens when the expected class of a data point is zero, and the actual class is one.
Accuracy: The accuracy may be expressed as the ratio of the number of correctly predicted instances to the total number of examples in the dataset, as below:
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Precision: A precision, which provides data on the proportion of relevant things among the retrieved items, is defined as follows:
Precision = TP/(TP+FR)
Recall: The proportion of anticipated attack instances to all attack-corresponding occurrences. The corresponding equation is:
Recall = TP/(TP+FN)
F1 Score: The recall, sensitivity, and precision harmonic mean. The most significant metric is this one. Thus, the F measure receives greater attention. The following equation of f1-score is mentioned below:
F1-score = (2×recall×precision)/(recall+precision)
Result Analysis & Discussion
The results acquired from the various ML models that were put into use are compared and discussed in this section. The experimental setup for this study was conducted on a Dell PC that is equipped with 16GB of RAM, 500GB of SSD, and an NVIDIA RTX 3060 GPU for fast processing. The project is implemented using Python programming in a Jupyter Notebook environment. Key libraries include NumPy and Pandas, Matplotlib and Seaborn, and sci-kit-learn.
Performance Measure | Naïve Bayes | MLP | Random Forest | XGBOOST |
Accuracy | 81.6 | 83.8 | 94 | 96.5 |
Precision | 82 | 84 | 84.9 | 94.9 |
Recall | 82 | 84 | 96.9 | 98.4 |
F1-score | 82 | 84 | 90.9 | 96.5 |
In this comparative analysis of ML models for anomaly detection, as shown in the Table, the XGBoost model demonstrates the highest overall performance, with an accuracy of 96.5%, precision of 94.9%, recall of 98.4%, and an F1-score of 96.5%, making it the most effective model across all performance measures. The Random Forest model follows with an accuracy of 94%, strong recall at 96.9%, but a lower precision of 84.9%, resulting in an F1-score of 90.9%, indicating a good balance among precision and recall but slightly less effective than XGBoost. The MLP model shows moderate performance, achieving 83.8% accuracy and consistently balanced metrics, with 84% for F1-score, recall, and precision, making it a solid but less optimal choice. In contrast, the Naïve Bayes model delivers the lowest performance, with 81.6% accuracy and 82% for precision, recall, and F1-score, reflecting its limitations in this context. Overall, XGBoost emerges as the superior model for detecting malware attacks, outperforming the other models across all key metrics.
Conclusion & Future Work
This article successfully demonstrates the effectiveness of machine learning-based approaches in enhancing malware detection within network traffic, using the CICIDS2017 dataset as a benchmark. Among the models evaluated, XGBoost emerged as the most effective, achieving the highest accuracy, F1-score, recall, and precision with 94% to 98%, making it well-suited for detecting a variety of cybersecurity threats like DoS/DDoS, Brute Force attacks, and Port Scans. The potential of ML to increase the precision and resilience of cybersecurity systems is shown by the comparative study of several models. The article makes a significant addition to the area by presenting a scalable and effective detection methodology that can be used to counteract contemporary assaults in actual network environments.
Future research could focus on further enhancing the detection system by exploring DL techniques like CNNs (Convolutional Neural Networks) or RNNs (Recurrent Neural Networks) to enhance an ability to detect more sophisticated and evolving malware threats. Additionally, incorporating real-time detection capabilities and testing on diverse datasets beyond CICIDS2017 would help validate the system's generalizability across various network environments. Investigating the integration of adaptive learning mechanisms and the ability to handle encrypted traffic could also be promising areas for future development to address the ever-changing landscape of cybersecurity threats.
Nitin Talreja is a B.Tech from a premium institute (IIT), PMP-certified, and MBA in Management Information Systems with more than 16 years of IT experience with relevant experience working in projects related to healthcare, insurance, media, and publishing industries with an emphasis on System Analysis. Experience in providing cradle-to-grave management over large-scale IT implementation projects. Demonstrated strong analytical and critical thinking skills, earned a reputation as someone who offers a proactive approach, and is recognized as being results-driven and an effective collaborator with exceptional leadership and communication skills. Talented with data management policies, procedures, and widespread technical tools. He is currently serving as Principal Data Engineer at UHC, Optum. He has successfully implemented cloud data migration in multiple projects, and his research and interest areas are in data, decision support systems, knowledge management, Deep Learning, and Neural Networks.