Reference Architecture for implementing big data systems
Many enterprises have begun to realize the potential and benefits of big data. Understanding a high level view of the reference architecture provides a good background in how Big Data complements existing analytics, storage systems, databases and other IT systems. The architecture is not a one-size-fits-all approach. Each component in this architecture could be realized using several alternative technologies – each having their own advantages and disadvantages based on the workload, and other factors. The attempt here is to provide a reference architecture that captures many of the common functions and patterns that have been published and also observed during my past implementations. Many large enterprises begin with a subset of these functions and components and expand to implement others as they realize benefit and value.
The reference architecture attempts to organize and map software technology and components according to functions generally reported, and observed in implementing “big data” applications. Furthermore, the architecture serves as a guideline for thinking and communicating about big data applications and gives some guidelines while designing such applications.From its principle structure and the idea of data pipeline, the reference architecture may look similar to a typical data warehousing architecture and processes, where functionality is also broken down into different phases or stages, through which data sequentially passes. But, there are very important differences. In the big data reference architecture, the processing pipeline is loosely coupled. The sequence in traditional data warehouse architectures is strict and data typically is persisted after each step, in big data applications several of these steps can be skipped and given the data volume and velocity does not always make sense to frequently persist data. Moreover, traditional data warehouse architectures do not include a stream processing and low latency analytics that are critical for big data applications. Additionally, there are also differences within individual components across the two architectures. The data extraction and storage component e.g. needs to consider unstructured data, which is not the case in traditional data warehouse implementation. The information extraction component is new, as it especially aims at handling unstructured data and not commonly found in data warehouse. However, the attempt in designing this architecture an evolutionary approach is taken, rather than a revolutionary – leverage processes and techniques that have been perfected over the years and augment them to the new needs and technological innovations.
Batch Data Acquisition
Batch data refers to data that is a snapshot and of finite size, although, the size could be very large. Data residing in databases, data dump on file systems, data feeds based on recurring frequency are some of the examples that characterizes such mode of acquisition. Most of the data in enterprises reside in relational databases. Commercially available ETL vendors have now connectors that move data from databases and other sources into the Big Data environment - although, there are open source tools like Sqoop (native to the Hadoop environment) that leverages the Big data cluster capacity to parallelize data extraction and transportation. Commercial vendors also are moving in that direction, to generate code or harness cluster capacity by deploying run time components to partition and parallelize for efficient usage of the cluster.The downside of Sqoop, is it does not have the ability to apply any transformation while acquiring data – in scenarios where such transformation is required; they need to be performed as a pre-processing step within the source database, or as a post-processing step at the destination.
Stream Data Acquisition
Data in motion that creates an ongoing stream of data is commonly referred as “data streams”. These streams do not have a finite end. Once emitted or transmitted, they cannot be regenerated from its source. Typically, they are time sensitive and temporal in nature, for example: navigation clicks from a visitor on a web page, commands from millions of players on a massively multiplayer internet game, sensory data from patient monitoring systems creating an ongoing stream, exemplifies data in motion. Analysis and actions need to be performed in short order, as the data is streaming in, close to event time for it to be effective. Data Velocity is a function of these streams – the number of simultaneous streams and their rate of arrival characterizes speed and volume. Acquisition and timeliness are the two critical factors to be satisfied while architecting such systems. Typically, data streams have been unidirectional - Most deployments require data to be acquired from the source and analyzed without communicating back changes or decisions in a closed-loop. Agents sit on the perimeter of the collection data center, enabling ease-of-management, configuration and enforcing security. Communication between source and agents at the edge servers’ use secured transportation protocols. Advent of IoT and intelligent agents at the edge require bi-directional streaming capabilities. Intelligent systems and IoT generally have an agent deployed on source machines or in very close proximity to them. These agents need to acquire data from the source and also communicate back to effect change in a closed-loop. In these deployments, the agent and the collection infrastructure require providing bi-directional communication, and the ability to manage these agents remotely separated by long geographical distances.
Data acquisition pipelines in practice are complex, it includes branching and joining of different pipelines depending on data conditions and types – data filtering provides such functionality. During data acquisition it is often the case that only parts of the data from a source is needed or valuable. In these cases it is reasonable to have a mechanism that filters unnecessary data - rules and patterns can be one way of identifying such data.
These functions can be implemented as part of the acquisition process. For example, in the case of massive multiplayer Internet game, every click or command need not be recorded, but only when there is a state change. In the case of detecting a fraud in a credit card transaction – an initial inspection reveals a fraud, transaction is flagged and routed to a different handler - filtering provides mechanism to shepherd and route data.
Extract Transform & Load (ETL/ELT)
ETL is a process that subsumes several other sub-processes. A note to be made here is a data warehouse to which ETL has traditionally been associated with, includes data acquisition (Extract in ETL refers to that process), which we have separated and treat as an independent function and will not discuss further in this section. However, “Data Lakes” referring to a single repository of all data assets in an enterprise, a common use case for big data applications requires only a subset of the functions of a traditional ETL process. But, the distinction between a data lake and data warehouse is blurring, we see them converging in the manner they use ETL processes. For example, data lake applications impose a structure on the data while reading/consuming – this implies, the application is performing an ETL on demand and on the fly, including data cleaning, and some form of data quality checks.
Information engineering aims at all functionality to extract information from semi-structured and unstructured content and thereby impose structure on it, that is storing the extracted information in structured form. This can include simple parsing and structure extraction from semi-structured sources e.g. XML or HTML pages. It also refers to using more complex techniques like text analytics and natural language processing for classification, entity recognition and relation extraction from unstructured data. Using ontologies, e.g. expressed in OWL, can also be helpful, e.g. for identifying and classifying entities, while the other way round information extraction can be used to populate ontologies.
Data variety is a critical characteristic in big data applications. Variety refers to different data types – structured, semi-structured, unstructured, audio, video, images from heterogeneous sources – flat files, legacy databases, relational databases, CMS (Content management systems), sensors and others. Typically, there is a need to manage and apply different adapters for extracting relevant information and metadata. Both commercial and open source ETL engine are matured and come with several adapters for data extraction - it is important to spend a few lines on explaining what extraction means - typically, it is used as a means of "pulling" specific column values from a structured table. But, now with a large percent of data also being unstructured (tweets, posts, documents, and others), extraction refers to identification, resolution, and establishing relationships, between entities.
Data is integrated from different sources, so they can be queried and analyzed together to create overarching insights that are not available from analyzing individual sources in isolation. This refers to integrating multi-structured data, after the necessary data extraction has been done as a preprocessing step. This process could include - schema integration, where defining a general, overarching schema and mapping different data sources to it. Mapping and transformations on the field level, e.g. splitting a ‘’name’ field into two fields ‘first name’ and ‘surname’ or a calculation field ‘salary before taxes’ to ‘salary after taxes’.
There is also dependency between schema integration and metadata management. Schema integration requires high quality metadata and information about the schemas of the different data sources to finally map them onto an integrated one.
Important to note, that it may be necessary to allow data analysts to access raw source of data, Or, raw data needs to be stored in the system in the same form as it was initially extracted from the sources (Data Lake). The decision is then, either to build a virtual or persisted integrated schema. The former can be queried, which in-turn transparently transforms the query into sub-queries onto the different data sources based on predefined mapping and transformation rules, Or applies them on the local "Data Lake". This is obviously a trade-of between a decreased storage need and a performance gain for all analysis tasks that can run on the integrated schema.
Entity recognition and extracting relationships between them are generally performed together. Entity recognition is used to identify unique entities both in structured and unstructured data, although, they are often quoted along with unstructured data processing. This includes entity matching to merge identical entities and needs to consider synonyms, that are different names referring to the same entity - due to term evolution, and homonyms, different entities sharing the same name. An entity recognition component further needs to categorize e.g. as persons, products, corporations, profit amount, etc. This is easier, if domain knowledge can be used and the categorization is based on a set of known and named entity types. It is harder, if general entities need to be extracted, whose possible types are not known in advance. Entity recognition is the first step, it is also possible to extract facts or relations between entities. These facts can either be attributes and attributes values, e.g. the age of a person, profit of an enterprise, or general relationships between entities, e.g. a person ‘is born’ in a city, where both the person and the city are extracted entities. Relation extraction is often limited onto a context, where possible entity types are named and known in advance, as the extraction of attributes necessitates knowledge about which attributes can be present. For the task of extracting general relationships it is especially useful to use ontologies that specify possible relationships between entity types. There is interdependence between extracting attributes for an entity and the entity-matching task during entity recognition, as identical or similar attribute values are a strong indication for entities to be identical as well.
Normalize common data elements to the standards instituted by the business, apply data rules and constraints, validate, handle missing values, and remove duplicates among others. It partially refers to data cleaning that is correcting errors in the data, completing empty attribute values and optionally identifying and eliminating noise and outliers. It therefore refers to handling data quality problems that originate from a single data set, contrary to data integration, which refers to harmonizing different data sources and handling quality issues and inconsistencies that occur when combining data from different sources. Data cleaning techniques can generally be classified as data scrubbing, that is using domain knowledge and rules to identify and correct errors (e.g. defining a threshold for an attribute value), and data auditing, that is to use statistical, data mining and machine learning techniques to do the same. A simple example for the later is the use of the average of an attribute value and to identify tuples that exceed a tolerance interval of that average as outliers. One can also distinguish between several sub-functions: value completion, outlier detection & smoothing, duplicate filtering and inconsistency correction. Note, however, that data can only be cleaned to a certain extent and the resulting data quality is still largely dependent on the quality of the source data.
Another part of managing data quality refers to giving users an assumption of the credibility of the data and estimate the probability of the results. This is related to metadata management (see below), mainly provenance tracking, presenting users with the necessary information and calculating a trustworthiness score. Though, this cannot happen as a separate function within the data processing pipeline. At this point in time, the processing is just not complete and information from processing steps further down the pipeline also needs to be incorporated into the score. Therefore, this needs to happen integrated into other function, namely metadata management (provenance tracking) and data analysis as well as the user interface (calculating and presenting the trustworthiness score)
Value created from data mainly originates from the analysis tasks conducted over the data. Data analysis can broadly be classified into three categories. The first category comprises classical business reporting functionality. This is very similar to traditional data warehousing and refers to predefined and calculated, often static reports and dashboards as well as classical OLAP . The second category refers to ad-hoc or interactive processing for power users to dig into the available data and conduct deeper analysis. The third category is performing ad-hoc analytics or querying based on a declarative language, scripts or on some analytics package, e.g. R or SAS. This analysis is conducted ad-hoc and interactive; therefore requires low latency response time.
Over and above those described, deep analytics - complex analysis tasks over large amounts or all of the available data forms a separate category- typical examples, include machine learning model generation and other data science tasks. These typically take a lot of time and are iterative - produce, inspect, tweak and validate cycles.
While data analysis requirements mentioned above mainly describe direct functionality, several usability requirements can be considered, that allow users to interact with the system more effectively.
First, the use of visualization techniques makes it easier for users to interpret results and are typically prominent in dashboard and reports. They directly support business reporting, but they can also be applied to ad-hoc analytics results, e.g. visualization functionality embedded in analytics packages, and even for deep analytics results. Second, developing and providing appropriate programming models and framework enhance the productivity of programming analysis tasks within the system. This directly supports deep analytics tasks described earlier, as those are typically developed on demand and need to be changed when analysis requirements change.
Another, probably the most widespread, is the use of declarative languages to formulate ad-hoc queries. The obvious example is SQL, but there are also new languages, which were developed within the big data ecosystem, e.g. PIG. Declarative languages are in most cases also highly optimized and provide a level of optimization that requires lot of effort to achieve in non-declarative languages. Finally, integration and use of analytics package or libraries, provide often, useful functionality, e.g. machine learning algorithms. An example is Apache Mahout. This supports programmer’s productivity and these packages can often also be used to quickly formulate some ad-hoc analysis.
Another requirement, which is tightly connected with ad-hoc analytics, is the support of experimentation. Working with data and identifying or optimizing appropriate analysis techniques and parameterization of those techniques require experimentation by the data analysts. Data from new sources might need to be experimented with, before a decision is made if they are to be acquired on a regular basis. A sandbox approach, which allows analysts and programmers to ‘play’ with data in an isolated area is required to be supported.
A common pattern observed in many big data applications is counting of certain event (ex: count of unique users, number of failed logins per day by a user), finding averages, etc., pre-computing this across petabytes of data helps in improving the performance as these computations are done most frequently. Frequent access patterns or periodic reports can be materialized and persisted – these can be done as part of an off-line process or part of the ETL process which would allow the distributed query engines to work with a relatively smaller set of data.
Features represent attributes within data sets and streams. Many of these provide valuable information with regard to activities and useful in building machine learning models – but some are also noise, the process of feature engineering is to separate them and find those that are useful for building models. This process is also termed as feature selection.
Guided Ad-hoc analysis
Visual based system guides non-technical users to interact with data, define queries and generate results. Guided ad-hoc analysis (e.g. traditional OLAP approaches) and data discovery & search lie somewhere in between the two as both allow different levels of free navigation while typically limiting definition of own computations.
Note, that these distinctions can be kind of fuzzy with dashboards gaining capabilities for navigation and filtering, moving them closer to guided ad-hoc analysis. The distinction between free and guided ad-hoc analysis is fuzzy in itself depending where to draw the line considering freedom and query flexibility. Furthermore, free ad-hoc analysis also allows for statistical and predictive analysis tools sharing similar techniques with deep analytics. There the differences are very much dependent on the available hardware performance, scalability and amount of data involved to dictate the timeliness of the analysis either requiring batch computation or allowing interactive analysis. One characteristic is, that users during ad-hoc analysis often need to rely on sampling available data, while deep analytics can incorporate all available data with the possible size of the sample and input data depending on the complexity of the computation (as well as on the availability of hardware and performance and scalability of the underlying system).
Free Ad-hoc analysis
Free Ad-hoc Analysis differentiates itself from guided ad-hoc analysis by the level of flexibility it provides to its users. While they are granted a higher level of freedom, this also means, these tools require a higher level of skill and knowledge about underlying data structure, models and a larger time effort to sift through the data. In exchange for this larger skill and time requirement, users can freely query available data sources usually based on directly formulating queries in some declarative language (e.g. SQL, PigLatin or HiveQL). Tools for advanced or predictive analytics using statistical and data mining techniques (e.g. SAS or SPSS) also fall into this category. The user communicates with these tools interactively, formulating a query and getting rapid responses, allowing experimentation and analysis.
Ad-hoc query function would enable end-users to query the system, either from the BI application or from any web browser. This component provides a uniform access to the data resident in any persisted storage. The component would expose a set of standard API’s that can be interfaced (consumed) by commercially available BI tools like Tableau - a BI tool that provides rich visualization and a GUI interface for users to query and analyze the data.
A very important and critical requirement is to enable low-latency querying with large data and scale with increase in data volumes. Distributed query engines – PrestoDB, Hive, and Impala, Parquet file format and Elastic Search are enablers for low-latency response. For example, PrestoDB -- an open source distributed query engine distributes the processing of data across the nodes in the cluster, and performs operation in-memory; unlike typical map reduce jobs, which is more suited for batch oriented processing. Additionally, when PrestoDB reads data from Parquet files, they are optimized for fast scans, and also reads only those columns that are required - as physical files are organized in columnar form, rather than row oriented as is typical in RDBMS. Moreover, joins across tables are performed in-memory and distributed across different data partitions across the nodes, thereby harnessing each nodes IO, memory and computing power. Finally, a salient feature of the engine is it is based on shared-nothing architecture, which eliminates expensive concurrency issues, typically encountered in such systems. Hive and Impala also fundamentally have similar internal architecture – more often in practice, implementations choose multiple query engines and use them depending on the type of workloads.
Information retrieval systems like Elastic Search, Apache Solr Cloud stores data in a form called “Inverted index”, which provides the capability to perform free text search (no SQL); For example, typical doctor’s notes are in free form text, ability to search on certain keywords or phrases enables analyst to narrow down and explore relevant data. Or, a marketing specialist, could be searching for text from tweets and comments on web page before building content for their next campaign
Data Discovery and Search
Data Discovery & Search aims less at analyzing data for new insights, but more at discovering and sift through available data (structured and unstructured) and possible data sources. It is largely based on metadata to provide users with an overview of what data is available and help them in deciding which available data to use, e.g. for free ad-hoc analysis, and how to sample it. It also includes functionality to search for specific data items or documents utilizing free-text search
Reporting And Dashboard
Reporting & Dashboard describes the definition of fixed reports and dashboards, which largely limit navigation and flexibility of the end-users. Dashboard additionally incorporates visualization techniques, e.g. traffic lights and diagrams, to allow for easier and faster grasping of the presented data and key figures. They are often used for executive personnel or general employees who have few time to spend and often not the lower-level skills for sifting through data, but need a fast and comprehensive overview of available information, e.g. the company’s performance in different areas. Reporting & Dashboard often relies on pre-calculated values or incorporates deep analytics results, but can also compute simple key figure on the fly, especially if those key figures are only needed for a single report.
Beyond ad-hoc and interactive queries rich historical data is available to be mined and analyzed. These tasks require data pre-processing, building statistical and machine learning models, validation and verification. Deep Analytics tasks typically compute and store insights, rules, models or extract relationships between data items, which are later utilized and navigated through using end-user facing analysis functions, e.g. reporting and ad-hoc analysis. Results can also be incorporated into some other tasks of the processing pipeline, e.g. by incorporating mined classification models into value correction during data cleaning.
Stream data analytics
Stream data refers to activity that is in-progress or just occurred within the enterprise. Analytics that are performed on short time window of such streaming data are generally referred to as streaming analytics. There is a spectrum of different analytics that could be performed - from simple (identifying a specific metadata or datum in a stream, for example: source type of a log message) to complex – like detecting anomalies or fraud on a real-time transaction. Important to note that often these analytics work in conjunction with the artifacts generated from deep analytics processes. The statistical model created during the off-line process is applied to generate a result or an action. Responses from streaming systems are demanding – they are in sub-seconds range, therefore, these systems are optimized for such activity.
The best practice is to place the component as close and ideally within the operational system that processes the data in the first place and is responsible for handling these transactions, e.g. business transactions. The need to communicate with another system over a network is detrimental to applying analysis results in real time and reacting to them, that is adjusting the operational processing of the respective transaction or just giving feedback to the user. Therefore, the Stream Analysis component should be placed within the operational application and integrated with it, if this is possible, rendering it an external component from the viewpoint of the ‘big data’ system. However, this is not always possible as there are use cases, where data from several sources needs to be streamed in, the data streams need to be joined and analyzed in combination and results need to be streamed back to different Operational Applications.
Note, that these can be identical with the Streaming Data Source, but do not need to be. In these cases it is necessary to have a central Stream Analysis component, which can be placed within the ‘big data’ system. In those cases, streaming data gets acquired as before and the Stream Acquisition components directly forward the acquired data streams into the Stream Analysis component.
In both cases, placing the Stream Analysis component within the data source or placing it within the ‘big data’ system with several data streams flowing in, it has several interfacing points with other components of the ‘big data’ system. First, models, rules and intermediate results created from deep analytics tasks can be made available and sent to the Stream Analysis component as required, where they can be joined with and applied to the data streams. Again, they can be referred to the principle of pre-computation of intermediate results. Second, results from the Streaming Analysis component can be written to the staging or temporary cache to be joined back with the actual streaming data from the Stream Acquisition component. Third, the Stream Analysis can lead to alerts for human actors. It can make sense, to send these alerts and integrate them into reports and dashboards. Fourth, streaming analysis results can be made available through the Data Access API, where applications outside the ‘big data’ system can subscribe to for getting this result stream forwarded.
Edge Services (Services Layer)
The data distribution component contains all functionality that aims at distributing and making data and analysis results available either to human users through a user interface or to other applications that further work with them through an application interface.
One or several user interfaces are used for visualization and presentation of data and results to end-users. This can e.g. include portals or distributing reports as pdf by e-mail. It decoupled presentation, which is typically done at a client, from analytical functionality, which is done closer to the data in an application server layer or even in-database.
Application Programming Interface (API)
Data Stored in big data system and analysis thereof are often exposed via API's that other applications can leverage and integrate. This includes technical interfaces to access data, such as well defined API's or also message passing via message brokers. Additionally, upstream data warehouse and data marts periodically pull or data is pushed to them. Just an additional note, the data consuming applications can very well be identical to data sources, in which case the communication represents a feedback or closed loop.
Big Data Applications exists as part of an operational enterprise IT infrastructure. Legacy and operational systems are often consumers of the analytics and data generated by big data applications. Enterprise data warehouse, which often serves operational BI, offloads its data processing tasks to the big data clusters. These offline and batch processed data are then pushed into upstream data warehouse and data marts. Similarly, enterprise security infrastructure like LDAP and AD (active directory) are integrated by configuring the security services to synch groups and permissions – enabling to enforce a unified and centralized security policy.
Data Storage & Workflow Manager
Functionally data storage needs to provide the ability to store all forms of data reliably. Furthermore, it needs to expose functions to perform CRUD operations, archive, restore among others. Big Data applications require multiple data stores depending on the data type and workload characteristics. Traditional databases, which provide great performance for OLTP type of applications, are generally not suitable when it comes to varied data types with large volumes to contend. Moreover, they do not scale as data volume increases. The requirements for storing large historical data and a progressive increase in data volume thereby needs system that can scale and also have the ability to query this increasing volume of data efficiently. HBase, Kudu are some of the NoSQL technology choice, it is based on columnar storage, which efficiently can retrieve and store very large sets of data and handle data growth. HBase, provides high throughput write performance and is suitable for the types of transactions that are encountered within big data applications – for example: keeping a count of unique visitors to a web site, or unique IP address seen in a given time window, etc.
A common access pattern observed in big data applications is to perform free form search, by business users and others who are not tech savvy to craft SQL statements or want to explore the data sets in the repository. In order, to provide such capabilities, storage engine, which enables storing data as inverted indices (index of keywords which point to records or documents they appear in) – enable user to query free form textual inputs.
Large number of applications and processes in "big data ecosystem" are batch in nature. They require coordination, sequencing and also building inter-dependencies between them. Workflow manager provide such capabilities. Allows, linking of jobs/processes, triggers, scheduler and error handling.
Metadata management refers to the extraction, creation, storage and management of structural, process and operational metadata. The first type describes structure, schema and formats of the stored data and its containers, e.g. tables. Process metadata describes the different data processing and transformation steps, data cleaning, integration and analysis techniques. Operational data describes the provenance for each data item, when and from which sources it was extracted, which processing steps have been conducted and which are still to follow. Structural and process metadata needs to be provided for all data structures and steps along the processing pipeline and operational metadata needs to be collected accordingly. Metadata management is therefore a vertical function consisting of metadata extraction, metadata storage, provenance, tracking and metadata access. Cataloguing all the activities where it is used or consumed provides the users with the confidence in using this data reliably in arriving at decisions
Data Life Cycle Management and Provenance
Data life cycle primarily revolves around data archiving – when, where and how should data be archived? Typical big data applications require large amount of historical data to be available for model building and other analytical purposes. Having said that, it is also not feasible to preserve large historical content in an online manner – the common practice, is summarize this data into a format that can be consumed, moving the original raw data to deep archives. The implications of moving raw data cascades, given that derived data that depends on the raw data may exist elsewhere in the system – appropriate tagging in a metadata management system is required to reflect this new state.
The data lifecycle management component includes all activity aimed at the management of data across its lifecycle from the creation to its discarding. It is typically based on a rules-engine (Rule-based Data and Policy Tracking), which determines the value of data items and accordingly automatically triggers data compression, data archiving or discarding of stale data. It is also used to move data between different storage solutions and triggering processing steps according to direct rules, e.g. scheduled jobs or according to data rule-derived value, e.g. moving data from an in-memory database to a disk-based database if it is less used. Data lifecycle management needs to be conducted along the processing pipeline without having direct influence onto analysis results and is therefore a vertical function.
Provenance tracking function describes extraction and collection of administrative and operational metadata during the different data processing steps. This includes job logging of processing steps, run time of components and information about volume and timeliness of loaded data. It furthermore includes user statistics, data access and reporting logs. Especially the logging of processing steps data items went through, probabilities and confidence intervals for statistical and machine leaning techniques and the timeliness of loaded data allows end-users to trace analysis results back to the data sources and make educated estimations about the reliability and relevance of data and analysis results. It is also the basis for calculating a trustworthiness score in analytical applications.
Data Governance, Privacy, Authorization and Authentication
Enterprises are building large repositories as data hub. Enforcing security in terms of data privacy using encryption both at rest and in motion for PII and other sensitive data is essential. Defining access rules for both data manipulation and accessibility is table stakes requirement. Keeping bad actors away from access to the data is critical given the economic impact breaches can have on the business. These functions require a centralized command center where one can easily configure manage and administer them. Privacy includes all methods and techniques used to ensure privacy of the stored data, namely authentication and authorization, access tracking and data anonymization. To be effective these techniques obviously need to be adapted during all steps and data stores along the processing pipeline with a special emphasis on the data distribution component, which is typically the point-of-access to data and functions within the system.
Sensitive personal, financial, and health information is governed by a variety of industry and governmental data privacy regulations. Failure to maintain data privacy, enterprises could face a substantial loss of consumer and market confidence. It is essential to have a security infrastructure that protects all data, applications and databases. Big Data systems supports two different modes of operation to determine the user's identity. In simple authentication, the identity of a client process is determined by the host operating system. In Kerberos based operation, its Kerberos credentials determine the identity of a client process where a Kerberos principal is mapped to a username.
Perimeter security is critical and supported in big data systems - e.g., in a Hadoop cluster via the Apache Knox gateway. The gateway runs as a server (or cluster of servers) that provides centralized access to one or more Hadoop clusters by providing authentication and token verification at the perimeter. This enables authentication integration with enterprise and cloud identity management systems and provides service level authorization at the perimeter. A single URL hierarchy is exposed at the perimeter that aggregates REST APIs of a Hadoop cluster to limit the network endpoints (and therefore firewall holes) required to access a Hadoop cluster and to also hide the internal Hadoop cluster topology from potential attackers.
The security infrastructure needs include,
Kerberos based security
User Authentication and Authorization
Map/Reduce jobs run using the user/group id of the user who submitted the job – Prevents unauthorized access to data
File level permission control
Similar to Unix/Linux permission mode
Prevents accidental deletion of data or files
Encryption of data at rest
Personal data are encrypted while stored or persisted
Security for data on the move
Wire Protocols that provide encryption