by Paul Lipman,
CEO at BullGuard
The volume and sophistication of cyber threats continues to escalate. Traditional approaches to threat detection — such as signature databases that identify malicious files by comparing them to a database of “known bad” files and behavioral heuristics, which ascertain malicious intent by evaluating behavior against a defined rule set — are no longer sufficient to ensure protection.
Machine learning (ML) has proven extremely effective at identifying cyber attacks. The power of ML is the result of three factors: data, compute power and algorithms. The cyber field produces copious amounts of data by its very nature. For example, a corporate network might see billions of daily IP packets, millions of DNS queries, resolved URLs and executed files, and perhaps hundreds of millions of events (processes, connections, I/Os) on its endpoint devices. Extracting, cleaning and processing this data requires vast amounts of computing power, which fortunately is easily, scalably and affordably available through a variety of cloud-based platforms. Increasingly powerful open source ML algorithms are available that abstract away the complex underlying math to enable development, tuning and training of sophisticated models. Together, these factors provide cybersecurity vendors with capabilities that would have been unthinkable only a few years ago.
Cybersecurity vendors typically train their ML models using live customer data, “honeypots” that are designed to attract attackers, and sharing of data within the cyber community. This enables a broader view of the threat landscape, for example, creating model features that might include a file’s recency, prevalence and frequency of usage across the entire customer universe. Vendors also train their models with corpuses of known types of malware as well as legitimate files. The training not only includes determining if a file is malicious or not, but also often tries to classify the type of malware — which is important in order to determine how to remediate or remove the malware.
The applications of ML are broad, including anti-malware, bot detection, anti-fraud and privacy protection. There are several interesting emerging challenges in the use of ML within cybersecurity, making it an exciting field with tremendous potential.
Internet of Things (IoT).
Ten of billions of new connected devices come online every year. Many of these IoT devices have limited compute or storage capacity, cannot run endpoint cybersecurity software, and are built on proprietary firmware. Furthermore, these devices tend to be “headless” with limited ability for users to access or update software running on the device. For these reasons, IoT devices are uniquely vulnerable to cyber attack.
The natural solution to this problem is to run IoT cybersecurity at the network level and/or in the cloud. However, traditional signature-based network security technologies aren’t designed to address the IoT device security problem. Furthermore, most IoT cybersecurity products are currently little more than re-packaged IDS, URL reputation or hardened DNS services. Exciting work is being done, however, in the application of ML to this field. Sophisticated models have been developed that can identify infected devices through inspection of just a few packets of data, enabling pro-active detection and blocking of threats.
Adversarial Use of AI.
The democratization of AI through the availability of large data sets, fast reducing cost of compute at scale and open-source availability of powerful algorithms have proven such a boon for the cybersecurity industry that they have also made ML an increasingly important tool in the cyber adversary’s arsenal.
For example, generative adversarial models are used to develop strategies to minimize the risk of an attack being identified by cybersecurity tools. In much the same way ML-based behavioral anomaly detection systems will learn normal behavior in order to quickly identify unusual and potentially malicious activity, so too are adversaries developing malware that learns normal user and system behavior in order to mimic that behavior and minimize detection risk.
The efficacy of an ML system can be highly affected by the cleanliness of the data used to train the model. Adversaries can take advantage of this fact through a “poisoning” attack that seeks to inject bad training data to influence the model to learn incorrectly. This can happen in a variety of ways, from generation of fake traffic patterns to poisoning of commercial or open source malware sample datasets.
Adversaries have been able to leverage ML models designed to prevent false positives as a way to avoid detection. For instance, attackers learned that by embedding certain patterns into malware, they could trick a popular anti-malware product into whitelisting the code (flagging the code as legitimate) even though it was malware.
Another interesting adversarial example is the use of ML to model human communication patterns to develop more realistic and effective phishing attacks. The state-of-the-art in natural language processing and natural language generation — Open AI’s GPT-3, for example — means it may soon become extremely challenging to discriminate between real and synthetic communications.
Deep Reinforcement Learning.
Conventional ML techniques have been applied in cybersecurity with great success, especially in detecting unknown attacks (also referred to as Zero Day attacks). These techniques work very well in static linear environments. However, today’s sophisticated adversary scenarios are dynamic, multi-vector and sequentially non-linear in character. Relying on a cybersecurity system to reactively identify one part of the attack sequence is insufficient.
One of the most exciting topics in ML is Deep Reinforcement Learning (DRL), which couples deep learning techniques (such as convolutional neural networks) with reinforcement learning. This is the core approach behind DeepMind’s AlphaZero breakthrough. The application of DRL to cybersecurity is an important step forward in tackling sophisticated threats.
DRL systems learn somewhat like a human, exploring their environment (in the case of cybersecurity, an event space) and learn by receiving feedback and rewards based on the actions that they take. This more autonomous approach has been demonstrated to be well suited to complex adversarial scenarios, with superior efficacy, generalizability and adaptability.
It is often the case that the most powerful innovations happen at the intersection between adjacent fields of endeavor. It is an exciting time in both the ML and cybersecurity fields. We are seeing the power of ML being harnessed to drive important innovations in the cyber field — innovations that will ultimately help to keep all of us safer.