ABSTRACT
Advanced Persistent Threats (APTs), represent sophisticated and enduring network intrusion campaigns targeting sensitive information from targeted organizations and operating over a long period. These types of threats are much harder to detect using signature-based methods. Anomaly-based methods consist of monitoring system activity to determine whether an observed activity is normal or abnormal. This is done according to heuristic or statistical analysis, and can be used to detect unknown attacks. Despite all significant research efforts, such techniques still suffer from a high number of false positive detections. Detecting APTs is complex because it tends to follow a “low and slow” attack profile that is very difficult to distinguish from normal, legitimate activity. The volume of data that must be analyzed is overwhelming. One technology that holds promise for detecting this kind of attack that is nearly invisible is Big data analytics. In this work, I propose a data-driven anomaly based behavior detection method which aims to leverage big data methods, and capable of processing significant amounts of data from diverse or several data sources. Big data analytics will significantly enhance or improve the detection capabilities, enabling the detection of Advanced Persistent Threats (APTs) activities that pass under the radar of traditional security solutions.
Keywords: Big data, Advanced Persistent Threats, Big data analytics, Network intrusion, Hadoop
TABLE OF CONTENTS
CERTIFICATION …………………………………………………………………………………………………………………….. ii
ABSTRACT ……………………………………………………………………………………………………………………………… iii
ACKNOWLEDGEMENTS ………………………………………………………………………………………………………. iv
DEDICATION…………………………………………………………………………………………………………………………… v
LIST OF ABBREVIATIONS …………………………………………………………………………………………………….. x
LIST OF FIGURES AND TABLES …………………………………………………………………………………………… xi
CHAPTER ONE ……………………………………………………………………………………………………………………….. 1
INTRODUCTION ……………………………………………………………………………………………………………………… 1
1.1 Background of the study ………………………………………………………………………………………………… 1
1.2 Objective of the research ……………………………………………………………………………………………….. 4
1.3 Research statement ………………………………………………………………………………………………………. 4
1.4 Structure of the work …………………………………………………………………………………………………….. 4
CHAPTER TWO ………………………………………………………………………………………………………………………. 5
LITERATURE REVIEW ………………………………………………………………………………………………………….. 5
2.1. What is an Advanced Persistent Threat? ………………………………………………………………………………. 5
2.1.1. What actually differentiates APT from other non-targeted threats? …………………………….. 6
2.1.2. Terminology …………………………………………………………………………………………………………. 6
2.1.3. Common Goals of APT Attack [51] ……………………………………………………………………….. 7
2.1.4. Other Attacks Related to APT [51] ……………………………………………………………………….. 8
2.1.5. The Relationship between APT, AET and Botnet [51] ……………………………………………. 9
2.2. Tools and Methods used by the attackers ………………………………………………………………………… 9
2.2.1. Malware ………………………………………………………………………………………………………………. 9
2.2.1.1. Malware capabilities ……………………………………………………………………………………… 9
2.2.1.2. How does malware infiltrate a computer? …………………………………………………….. 10
2.2.2. Phishing and other e-mail attacks ……………………………………………………………………….. 11
2.3. Traditional Security solutions ……………………………………………………………………………………….. 13
2.3.1. Antivirus software ………………………………………………………………………………………………. 13
2.3.1.1. Ways to get rid of viruses [26] ………………………………………………………………………. 14
2.3.1.2. Limitations of antivirus software ………………………………………………………………….. 15
2.3.2. Firewalls …………………………………………………………………………………………………………….. 16
2.3.3. Intrusion Prevention Systems ……………………………………………………………………………… 16
vii
2.3.4. Web filters ………………………………………………………………………………………………………….. 17
2.3.5. Spam filters ………………………………………………………………………………………………………… 17
2.4. APT Life Cycle ……………………………………………………………………………………………………………… 17
2.5. Model of operation of APT malware ……………………………………………………………………………… 21
2.6. Command & Control Channels (C&C) …………………………………………………………………………….. 22
2.6.1. Malware C&C Network Protocol Usage ………………………………………………………………. 23
2.6.2. Detection and Reaction ……………………………………………………………………………………….. 24
2.6.3. C&C Channel Detection Techniques …………………………………………………………………… 25
2.6.3.1. Blacklisting …………………………………………………………………………………………………. 25
2.6.3.2. Signature based …………………………………………………………………………………………… 25
2.6.3.3. DNS protocol based ……………………………………………………………………………………… 25
2.6.3.4. IRC protocol based ……………………………………………………………………………………… 25
2.6.3.5. Peer to peer protocol based ………………………………………………………………………….. 26
2.6.3.6. HTTP protocol based …………………………………………………………………………………… 26
2.6.3.7. Temporal-based …………………………………………………………………………………………… 26
2.6.3.8. Anomaly detection ……………………………………………………………………………………….. 27
2.6.3.9. Correlation based ………………………………………………………………………………………… 27
2.7. Research Direction ………………………………………………………………………………………………………. 27
2.8. Related work ………………………………………………………………………………………………………………. 28
CHAPTER THREE …………………………………………………………………………………………………………………. 29
METHODOLOGY ………………………………………………………………………………………………………………….. 29
3.1. Big data and Big data analytics ……………………………………………………………………………………… 29
3.1.1. Big Data ……………………………………………………………………………………………………………… 29
3.1.2. Big Data Analytics ………………………………………………………………………………………………. 29
3.1.3. Some Big Data Technologies ……………………………………………………………………………….. 30
3.1.3.1. Hadoop ……………………………………………………………………………………………………….. 30
3.1.3.2. MapReduce and Distributed Computing Using Spark …………………………………… 31
3.1.3.3. Spark Ecosystem …………………………………………………………………………………………. 32
3.1.3.4. What are the benefits of Spark? …………………………………………………………………… 32
3.1.3.5. Resilient Distributed Datasets ………………………………………………………………………. 33
3.1.3.6. Predictive Modeling and Analytics ……………………………………………………………….. 33
3.1.3.7. Types of Machine Learning Models ……………………………………………………………… 34
3.1.4. Machine Learning and Big Data Analytics …………………………………………………………… 34
viii
3.1.5. Benefits of Big Data Analytics in APT attack detection ………………………………………… 35
3.2. Methodology ………………………………………………………………………………………………………………. 37
3.2.1. What is Anomaly Detection? ……………………………………………………………………………….. 37
3.2.2. The Components of a Data-driven Anomaly-based Behavior Detection method for Advanced Persistent Threats (APT) …………………………………………………………………………………… 39
3.2.2.1. Data Collection ……………………………………………………………………………………………. 41
Data preprocessing …………………………………………………………………………………………………. 41
3.2.2.2……………………………………………………………………………………………………………………………. 41
3.2.2.3. Model Creation via classification ………………………………………………………………….. 44
3.2.2.4. Model Selection …………………………………………………………………………………………… 46
3.2.2.5. Model Prediction and Evaluation …………………………………………………………………. 46
CHAPTER FOUR ……………………………………………………………………………………………………………………. 49
IMPLEMENTATION AND EVALUATION …………………………………………………………………………….. 49
4.1. Big Data Analytics (Machine learning) based on network traces with full payloads ……………… 49
4.2. Big Data Analytics (Machine Learning) based on HTTP traffic ……………………………………………. 49
4.3. Environment for the Implementation …………………………………………………………………………….. 50
4.4. IMPLEMENTATION STAGES …………………………………………………………………………………………… 50
4.4.1. Data Collection …………………………………………………………………………………………………… 50
4.4.2. Data Preprocessing …………………………………………………………………………………………….. 53
4.4.2.1. Load and Analyze data ………………………………………………………………………………… 53
4.4.2.2. Feature Extraction ………………………………………………………………………………………. 53
4.4.2.3. Data Cleaning ……………………………………………………………………………………………… 56
4.4.2.4. Feature Engineering and Transformation …………………………………………………….. 56
4.4.3. Model Creation via classification …………………………………………………………………………. 58
4.4.3.1. Create Pipeline ……………………………………………………………………………………………. 58
4.4.4. Model Selection ………………………………………………………………………………………………….. 59
4.4.4.1. Tuning the pipeline using a CrossValidator ………………………………………………….. 59
4.4.5. Model Prediction and Evaluation ………………………………………………………………………… 60
CHAPTER FIVE …………………………………………………………………………………………………………………….. 65
CONCLUSIONS ……………………………………………………………………………………………………………………… 65
5.1. Summary ……………………………………………………………………………………………………………………. 65
5.2. Challenges ………………………………………………………………………………………………………………….. 65
5.3. Future Work ……………………………………………………………………………………………………………….. 66
ix
REFERENCES ………………………………………………………………………………………………………………………… 67
CHAPTER ONE
INTRODUCTION
1.1 Background of the study
With the rapid development of computer networks, new and sophisticated types of attacks have emerged which require novel and more sophisticated defense mechanisms. Advanced Persistent Threats (APTs) are one of the most fast-growing cyber security threats that organizations face today [12]. They are carried out by knowledgeable, very skilled and well-funded hackers, targeting sensitive information from specific organizations. The objective of an APT attack is to steal sensitive data from the targeted organization, to gain access to sensitive customer data, or to access strategic or important business information that could be used for financial gain, blackmail, embarrassment, data poisoning, “illegal insider trading or disrupting an organization’s business” [30]. APT attackers target organizations in sectors with high-value information, such as national defense or military, manufacturing, and the financial industry.
The technologies and methods employed in APT attacks are stealthy and difficult to detect, for instance, they can employ “social engineering which involves tricking people into breaking normal security procedures” [13]. In addition, the APT intruders constantly change and refine their methods, including having insiders (those within the organization) who abuse legitimate access rights to manipulate and steal data.
Once hacking into the targeted network is successful, the intruder installs APT malware on the victim’s system. The attacker then is able to monitor and control the spread of malware and also
2
remotely control the infected systems. This opens a channel through which they steal sensitive information from the victim’s system unknowingly to the owner, over a long period of time except if the malicious activity is detected. After the information of interest has been found the attacker gives a command to exfiltrate the information. This is usually done through a channel separate from the Command and Control (C&C) channel. To maintain access to the network the attacker continuously rewrites codes and employs sophisticated evasion methods. The frequency or the rate of such attacks and breaches highlights the fact that even the best Information Technology (IT) network perimeter defenses or traditional security solutions, including proxy, firewall, VPN, antivirus, and malware tools are unable to prevent the intrusions [Craig Richardson (http://data-informed.com/use-data-analytics-combat-advanced-persistent-threats)]. The data breach investigation report stated in Verizon [14] confirmed that, in 86% of the cases, evidence about the data breach was recorded in the organization logs but the traditional security solutions failed to raise security alarms. This is a signal that there is a need for other forms of security solutions in addition to the existing ones that would be better able to detect the activities of APTs. Detecting APTs is complex because it tends to follow a low and slow attack profile that is very difficult to differentiate from normal, legitimate activity. Thus, detection of this kind of attacks relies heavily on heuristics or human inspection.
The best way to achieve this detection is by examining communication patterns over many nodes, over an extended period, which is better than the micro-examination of specific packets or protocol patterns for malware which tend to generate too many false positive detections. Though, as pointed earlier, differentiating normal legitimate activity from malicious APTs is difficult, nevertheless, certain aspects of APT behavior can be detected by observing trends over periods of time (days or weeks) to spot unusual patterns.
3
An approach that can connect different low-level events to each other to form an attack scenario can possibly detect APTs attack [15] [16] and reduce false positives. The correlation of recent and historical events of network traffic logged data from many numbers of diverse data sources can help detect APT malware. According to Jared Dean [31], “Anomaly detection should detect malicious behaviors including segmentation of binary code in a user password, stealthy reconnaissance attempts, backdoor service on a well-known standard port, natural failures in the network, new buffer overflow attacks, HTTP traffic on a non-standard port, intentionally stealthy attacks, variants of existing attacks in new environments, and so on”. Accurate anomaly detection of these malicious behaviors has several challenges due to the huge volume of data that must be analyzed. Big Data storage and analysis techniques can be a solution to this challenge. The advantage of big data tools is that they can assist to handle the large volumes and semi-structured data formats involved in monitoring large networks [32]. Big data helps to collect and analyze terabytes of data collected from diverse sources and in addition, such correlation helps to lower false positive alerts. It helps to increase the quantity and scope of data over which correlation can be performed. Big data analytics significantly enhance the detection capabilities, enabling the detection of APT activities that are passing under the radar of traditional security solutions. This work presents an intelligent distributed Machine Learning System that detects APT activities based on examining communication patterns registered in Network traffic and logs, over multiple nodes and over an extended period. The proposed system leverages big data Machine Learning methods to identify the necessary features to identify APT commands, Command channels and with the extracted features, a model is created to detect malicious traffic. The Classification method was used to create the models, and the detection accuracy of the created model was evaluated. The
4
evaluated results show that the models are capable of detecting malicious attack with high accuracy and low false positive rates.
1.2 Objective of the research
The goal of this research work is to leverage big data methods and explore new detection algorithm capable of processing significant amounts of data from diverse data sources, to detect APT activities that are passing under the radar of traditional security solutions.
1.3 Research statement
Focusing on Advanced Persistent Threat (APT) intrusion detection systems, and intrusion prevention systems which, according to various reports, are not capable of protecting systems against APT attacks because there are no signatures. Therefore, to overcome the issue of APTs which is a challenging and persistent problem to security communities, a new model which leverages big data technologies to detect APTs attacks is proposed.
1.4 Structure of the work
This work is organized as follows: Chapter One has a background of study, the objective of the research and research statement. Chapter Two has literature review and related work. Chapter Three presents our methodology and the approach used. Chapter Four contains the Implementation of the APT model proposed in Chapter Three. The work is concluded in Chapter Five.