Security Intelligence using Big Data
The growth of Internet and emergence and explosion of new devices has led to increase in both data volume and security risks for the enterprise. Now, more devices are being connected and Internet ready, new avenues of attack, and potentially sabotage open up. Security - has evolved from mere providing access control to digital assets to confronting high-tech cyber crimes – espionage, computer intrusion and cyber fraud. CSO’s and security professionals are now confronted with not just monitoring and fencing their network; they have to constantly monitor diverse sources – social network, web, email, transaction systems etc., in order to thwart and prevent attacks. For example, spear-phishing uses social connections to send emails to individuals to gather personal info and gain access into the enterprise network. The consequence of this growth and sophistication in attacks has led to explosion in data volume, velocity and variety. This requires collecting and processing data from security systems and beyond – including traditional log and event data as well as network flow data, vulnerability and configuration information, identity context, threat intelligence, social networks, blogs, email and others. Enterprises are now faced with this challenge, where they need to constantly evolve and respond to their customers, and cannot isolate themselves into an environment preventing such interactions and open access. The twin challenges of having open access and have the ability to prevent and mitigate security breaches and attacks have led enterprises to look for solutions that provide them with tools and applications, which can combat security risks and allow them to respond to ever increasing demands of their customers.
Data growth represents a significant challenge for the enterprise where the increase in volume, variety, and variability of data requires use of techniques and technologies that make handling data at extreme scale affordable. In short this has become a Big Data problem.
Shortcomings in Current Solution Offerings
Traditional SIEM (security information and event management) systems lack the ability to scale and provide a cost effective alternative to build such systems. SIEM systems are generally built using RDBMs, which do not scale for very large volumes of data. Additionally, handling data variety – unstructured text, audio/video, semi- structured and other forms of rich text support in RDMS are lacking or non-existent. Log management systems evolved to solve the issue of scale – they are built on top of proprietary MPP databases, or open source technologies like Eleasticsearch. But, they handle only a restricted set of data, and are unable to store and retrieve other forms of context data from transaction systems, unstructured data from social network, blogs, etc. However, preventing and thwarting new and sophisticated attacks require managing all forms of data and ability to correlate and trace across these wide sources and types of data. Hereafter, systems that process large data volume, variety and can handle extreme data velocity originating from the sources in a cost effective manner using commodity hardware are needed. In addition, these systems need to straddle both worlds of batch, and stream processing capabilities. The latter for the purposes of alerting and handling near real-time actions based on system generated events.
Security Intelligence using Big Data
Security intelligence is the continuous real-time collection, normalization and analysis of data generated by users, applications and infrastructure. It integrates functions that have typically been segregated in first-generation security information and even management (SIEM) solutions, including log management, security event correlation and network activity monitoring. Data collection and analysis goes well beyond traditional SEIM, with support for not only logs and events, but also network flows, user identities and activity, asset profiles and configurations, system and application vulnerabilities, and external threat intelligence within the single warehouse as illustrated in Figure
Common Data Repository (CDR) for Security
CDR forms the foundation for building more advanced applications and systems. Security data is often found stored in multiple copies across an enterprise, and every security product collects and stores its own copy. For example, tools working with network traffic (IDS/IPS, forensic tools) monitor processes and store their own copies of the traffic. Network anomaly detection, user scoring, correlation engine and others all need a copy of the data to function. A data lake (or a data hub) is a central location where all security data is collected and stored. The concept of Data Lake includes data storage and may be some processing. One of the objectives of Data Lake is to run on commodity hardware and storage that is cheaper than special purpose storage arrays or SANs. Big data technologies like Flume, Kafka, HDFS, and HBase enables and provides a cost effective solution. For example, Flume and Kafka can be used to ingest or collect raw data from all the various sources and write them to HDFS file system. Later, raw data is processed and stored in HBase. It is important to highlight that Big Data technologies augment and complement pre- existing SIEM applications, if already present in the enterprise. There are alternative architecture solutions to integrate SIEM within a data lake.
CDR fulfills the need to scale, store large variety of security data. These systems are built using open source systems like Hadoop – which addresses the scale and store issues but generally lack the ability to respond in real-time; they are more batch- oriented. To address the real-time, streaming needs, which are also called “Fast- data” requires the ability to respond at the moment data/event is generated or consumed.
The figure illustrates one such architecture, where a pipeline that connects and streams data from its creation or origination is first consumed or inspected by the real-time or fast data systems, and later collected and stored in the data lake. Machine learning and advanced algorithms are then applied to this data along with other contextual information to build models, which are then fed back into the fast data component to apply them as events are generated – creating a closed-loop system.
The inability of traditional solutions to offer low-cost solutions and scale with large volumes of data has left enterprise vulnerable to ever changing landscape of cyber attacks. Signature based malware detection are unable to detect these threats as attackers begin to craft these malware to suite a web site or an enterprise. Enterprises and organizations are more vulnerable to security breach that emanate from within their network boundary – either by errant behaviors or loss of devices and poorly protected end point devices. Security professionals are coming around to the idea that breach prevention is extremely hard, but needs to be acted on at the shortest time possible from when the compromise is detected. Effective solutions require the ability to be flexible – adapt to changing threat landscape. Enable adding new data sources, agility in querying and refining through vast amounts of event data into a manageable set of priority offenses. Provide a single view of log data, network traffic and other security telemetry across thousands of systems and resources so as to rapidly assess the source and impact of a breach.
Real-time correlation and Analysis
Monitoring of data fire hoses generated from user activities and infrastructure is needed to correlate massive data sets in real time for earlier and more accurate detection of advanced attacks. Organizations need to continuously monitor their user applications and infrastructure and observe for any abnormal behavior. For example, when there are unlikely application paths reported in application logs,along with access to certain databases or systems they can be compared with normal access patterns, and deviations observed can be alerted. Closing the data loop – operationalizing the discovery of patterns performed offline or in batch mode so that decisions can be made real-time. Big Data technologies like Storm and Spark enable building such streaming applications that can scale and handle high velocity data fire hoses.
Given the large and ever growing commerce on the Internet, detecting fraud has gained critical importance. Organizations lose substantial revenue to fraudulent claims, account takeovers and invalid transactions. The current systems that monitor for these activities create large number of false positives alerts or miss out on an on-going fraud transaction. Many of these companies are turning to Big data to find better solutions. One such is to gather and store historical data and use to build better statistical models, and baselines that one can use to compare for any deviations. Security teams and fraud investigators need deep access to information and an ability to parse unstructured text to understand discrepancies in customer transactions, claims and other behavior to build baselines and user profiles, ability to perform linguistic analysis to profile email and social communications to identify suspicious activities. To support building these solutions systems use advanced machine learning models that continuously learn to refine and improve their accuracy. Mahout, Spark MLib, WEKA and other open source machine learning libraries provide building blocks for deploying such applications. More recent advances in deep learning - employing multi-layer neural networks to compress and reconstruct the data in such a way that the bulk of the data is reconstructed accurately, but outliers are not have improved fraud/outlier detection accuracy
Network Security/ Netflow Monitoring
DDoS (Distributed Denial of Service) forces networks to shutdown leading to economic loses. A typical attack might begin by flooding requests to DNS from zombie hosts – when DNS servers are paralyzed a very wide range of network services will be blocked or broken. Detecting such attacks requires traffic models, the better the model one can with high certainty spot these attacks - false positives are counterproductive since it penalizes legitimate traffic. Traditional methods for countering such attacks is by sampling the netflow, and these samplings are preformed at a frequency that does not overwhelm the network. But, then they miss out on low intensity attacks, which nevertheless has a huge downside. A further challenge to detecting low intensity attacks is they resemble legitimate access requests, and that even with increased sampling rates they still fail to detect. Big Data enables first to capture the whole packet and all the traffic that is needed to build accurate models. This is done without overwhelming the network by copying/capturing the network packets into a persistent store. Later models are built that trace the flow of these packets and correlate to the various services end points within the protected network building a baseline. These models are then deployed with deep packet inspection that then correlates to the baseline to see whether they meet the various thresholds. The baseline models are constantly updated and refined in an on-going basis, and thereby improving its accuracy
Enterprise Events Analytics
Organizations routinely collect terabytes of data for several reasons, including the need for regulatory compliance and forensic analysis. Given the scale and size, traditional systems can barely store this data and much less do anything useful. Big data provides a means of storing and scaling linearly and enables building analytical algorithms that can harness petabytes of data. For example, identifying botnets requires building large graphs that model the netflow between various hosts. Next, advanced algorithms like PageRank and Clustering can be applied to these graphs to identify command-and-control channels and hosts. Another example, would be to identify malicious hosts which can be ascertained by seeding the graphs with ground truth from blacklists and whitelists of IP’s. This trust and non-trust information is then propagated within the network using a specific version of PageRank algorithm called TrustRank.
A big data security analytics solution must ingest data provided by a large and variety of security data feeds from within the enterprise, as well as unstructured and structured data from inside and outside the enterprise. It must also adapt to the changing threat landscape, provide a holistic view of the enterprise environment and drive actionable intelligence to protect against both known and unknown threats. Like any new technology, it’s important to understand what Big Data can and cannot do, to have experience in using it effectively, and to understand how related Big Data technologies can complement. More recently, there have been efforts to publish reference architectures and open source implementations like Apache Metron and Apache Spot that have now prepackaged components that allow enterprises a starting point for building their security applications around. These are community driven projects enabling cross-pollination of experiences and improving the implementations.