Today, significant amount of IT expenditure goes into detection including firewalls, anti-virus, SIEM, IDS/IPS (Intrusion detection/Prevention) systems. Yet, we have seen threats getting through these perimeter defenses without getting detected. There are several recent reports and incidents that highlight most breaches are discovered after a significant lapse of time. Enterprises need to look more than just their perimeter defenses. This requires more inward looking.
Despite the growing variety of defense tools we have in today’s Cyber Security Operation Centers, we are still seeing threats make their way past all our defenses. Many newer types of defenses (like Sandboxing) were affective when they first hit the market, but within a year agile adversaries have learned how to circumvent these defenses, and so the trend continues. Defense tools like SIEMs and Anti-virus look for known threats, using signatures that are available through proprietary or commercially available threat feeds. SIEMs attempt to detect unknown threats using a variety of behavioral rules that have proven to be indicators of compromise. However, SIEMs have fallen short on delivering accurate results because they are designed to analyze limited data for short periods of time. Most recently, industry experts and practitioners believe that accuracy lies in collecting more and more data while retaining it for longer periods of time; sometimes for years.
Today's information security solutions generally fall into two main categories:
Machine learning driven
Analyst-drive solutions rely on rules determined by security or other experts and usually lead to high rates of undetected attacks, as well as delays between detection of attack and implementing countermeasures against the attack vector. Moreover, bad actors often figure out the rules, and design newer attacks that sidestep the detection mechanism put in place.
Machine learning to detect rare or anomalous patterns can improve detection of newer attacks, however, it may also trigger more false positives (those that are flagged, but actually are not threats), which can themselves require substantial investigation before dismissing them. This then leads to alarm fatigues and distrust, over time reverting to analyst-driven systems with its attendant weaknesses.
A Solution that properly addresses these challenges must ensure analysts' time is used effectively, detect new and evolving threats and attacks in their early stages, reduce response time between detecting and prevention and have very low false positives.
Advanced Security Analytics operating on a Security Data Lake, the latest new defense is an attempt to accelerate the ability to quickly detect and pinpoint if the networks have been compromised. This type of big-data solutions offer key capabilities to help meet this challenge:
Collect and store vast amounts of data from different sources - an earlier blog post describes how to build such a data lake.
Advance statistical algorithms which can find indicators of compromise based on general anomalies, without knowing what the anomaly is beforehand.
Inclusion of "Security-Analyst" in this loop who can analyze data and learn to continually adjust the algorithms over time thereby keeping ahead of adversaries.
The proposed solution Synthesizes couple of approaches together:
combines the behavior of different entities within a raw big data set
presents the human security engineer with an extremely small set of events, generated by a machine learning outlier detection system (unsupervised)
collects all the feedback or labels generated by the human engineer about those events
learns and reinforces a supervised model using the feedback,
uses these learnt and reinforced model in conjunction with the unsupervised models to predict threats.
the above process (1-5) is constantly repeated.
Briefly, in the next couple of sections I will highlight some of the work that was done in the area of building an ensemble learning model for outlier detection. The challenges of combining several of these methods to compare and rank has several constraints - hard to interpret and compare outlier scores computed using different methods, and second difficult to weigh confidence as there is no ground truth to compare against.
Let me briefly discuss why outlier or anomaly detection is vital for threat detection. Generally, when a threat is understood, one can devise various methods to identify and defend against that threat vector, it is those that one has not seen in the environment hard to defend against. In the case of former, one can train supervised learning models, but for those that are unknown, it is the deviation or identification of abnormal behavior a signal that one has to discern - anomaly and outlier detection are a family of methods one can use to identify.
Replicator Neural Network (RNN)
Not to be confused with Recurrent Neural Networks (RNN), Replicator networks are a multi-layer networks that have the same number of input and output neurons - representing the same variables - forming an implicit compressed model of the data during training. The key idea of using replicator networks (auto-encoders) is it relies on a compression-reconstruction analysis, where a multilayer network is trained to compress and reconstruct data in such a way that bulk of the data is reconstructed accurately, while outliers are not.
First proposed by Ted Dunning, the algorithm generates a digest of quantiles and stores in a data structure that has a low memory footprint. Quantiles are generated using 1-dimensional K-mean clustering algorithm. The digest is generated and stored - which then is used to check against new incoming data to find which quantile it belongs to, if they are above a certain quantile or threshold, it is signaled as an outlier
Principle Component Analysis based outlier detection
The basic principle of PCA based outlier detection relies on the same concept of compression and reconstruction process to identify the outliers or anomalies. Matrix decomposition is used to detect the principle components or the Eigen vectors, on which the data is projected. Next, these projections from the principle components are reversed back into the original space - the reconstruction error is then used to score the outliers.
I had alluded earlier in this post that the biggest constraint is to combine these outlier scores and present to the analyst who can further analyze and provide feedback to the system. What we have implemented for now is to combine ranks instead of scores. We do list the scores generated from the methods that analyst can check and compare. This is what I have seen as most preferred by analyst - they don't want a black box, but rather a system where they can easily focus and zoom into top issues identified by the system. Additionally, they also get all the data in terms of the scores, etc., provided that further enables to focus on critical one.
We begin by providing a sample of features or behaviors that is extracted from the raw logs. Note, these behaviors are not monitored for users alone, they can be extended to entities like Printers, Servers, databases. etc. In one of the deployments we used 42 distinct features - these were determined in conjunction with security analysts. A sample is provided here:
Big Data Processing System
The ability to load any data source in the IT Infrastructure is a fundamental requirement for successful identification of insider threats. Organizations must be at liberty to use any and every obscure source that may reveal abnormal behavior. Many organizations have internal home- built systems that are central to user activity. These logs can be critical for detecting abnormal behavior. A Big data processing system that has the following characteristics was deployed:
Loading over 7 billion records daily with peak loads over 20 billion records in a day (without interruption to reporting or other activities)
Scanning trillions of records daily
Loading data from 20,000 Windows Servers daily
Loading data from 1,300 MS SQL Servers and other databases servers daily
Loading data from 1,500 IIS web sites deployed on 40+ IIS Servers daily
Outlier Detection System
This system learns a descriptive model of those features extracted from the data via unsupervised learning, using those outlined above. To achieve confidence and robustness when detecting rare and extreme events, we fuse these ranks, and also provide relative scores generated from these models. Additionally, a graphical representation about how far these events from the normal are visualized.
Feedback mechanism and continuous learning
This component incorporates analyst input through a user interface. It shows the top "m" outlier events or entities, and ask the analyst to deduce whether or not they are malicious. This feedback is then fed into the supervised learning module. The feedback frequency and "m", the number of outlier events to show are configurable and decided by the users.
Supervised Learning module
Given the analyst's feedback, the supervised learning module learns a model that predicts whether a new incoming event is normal or malicious. This constant feedback helps the system to fine-tune and improves its prediction model - this is also know as Reinforcement Learning or Semi-Supervise learning.
What I have attempted to show case in this post is what I found as the most effective manner of deploying AI/ML modules and applications. Rather than have a black-box approach a synergistic approach where AI/ML sifts through large amounts of data, detects patterns and presents the user a refined list of items to focus on - but at the same time providing enough insight into why and how it arrived at certain conclusions or observations - this is very critical as that improves the confidence and the ability for the users to trust in the outcomes.
Again, I wouldn't have done this without the encouragement and insights I gathered speaking to and watching large number of experts work on this. Additionally, vast amounts of literature and path-breaking work that was done helped in building these systems