As rightly mentioned by Rand Beers, the security advisor to President Barack Obama.
Similarly, the key to using big data applications effectively and fully is how secure these applications are.
After successfully implementing our data ingestion and analysis platform Gazelle ™ for one of our clients in the IOT (IIOT) area in January 2017, our next project was to implement security measures in this application for the customer to make maximum use of this application without any external threat as well as always in a comfortable position to comply by any regulations.
Based on the customer’s preferences, we have been using Cloudera distribution of Hadoop to implement big data projects for this client.
Conceptually, there are 4 prime areas where we focussed on beefing up the Hadoop security for this application, viz –
Data access controls
Entry level authentication
While planning for this project, we decided to go in phases rather than eat the entire pie in one go. I always advocate for a phased approach to dealing with big and complex projects. This helps to understand the risks, dangers, and complexities involved in the project and helps validate our approach, whether it is the right one and would help us achieve the required target smoothly or make changes to develop a more suitable approach.
We planned the below phases for implementing the Hadoop security measures.
Phase 1: In this phase, we start with the basics. First, for example, set up authentication checks to prove that users/services accessing the cluster are who they claim to be. This involves setting up users/groups in AD and configuring the access controls.
Phase 2: We have configured the authentication details for the users/groups; in this phase, we take care of the data at rest and data in motion by taking some measures to introduce data encryption. We also need to take care that any sensitive data should not be accessible to end users; hence, we need to resolve this using data masking.
Phase 3: In this phase, we plan to introduce secure measures to control who views what based on the authorization level setup for the users/groups. This need to do at the individual Hadoop component level.
Phase 4: For more robust security, data governance aspects need to be considered, like auditing, data lineage, etc. Data governance is an important aspect of security. Governance includes auditing accesses to data residing in meta stores, reviewing and updating metadata, and discovering the lineage of data objects.
Let’s start with phase 1, wherein we work on securing the gates for our castle.
Phase -1 of Hadoop Security Measure – User Authentication
The purpose of authentication is to ensure that the person/system accessing the application is the right individuals who are supposed to use the application. Enterprises’ authentication is typically managed through a single distribute system, such as an LDAP directory consisting of username/password mechanisms.
We decided to use Kerberos authentication, which is a common and secure enterprise-grade authentication system. Kerberos provides strong security benefits including capabilities that render intercepted authentication packets unusable by an attacker. It virtually eliminates the threat of impersonation by never sending a user’s credentials in cleartext over the network.
The user credentials were stored in AD, which used Kerberos authentication for security.
We used the Cloudera Manager’s Kerberos wizard to automate Kerberos configuration on the cluster. Cloudera manager was configured to use its internal database and external AD to authenticate the users. A couple of groups were created on AD already, and we provided login access based on the groups who were supposed to access this application. An IT group was provided full administrator access to the Cloudera manager.
Phase – 2 of Hadoop Security Measure – Data Encryption
In this phase, all data on the cluster, at rest and in motion, must be encrypted, and sensitive data must be masked. A completely secure enterprise data hub can stand up to
The audits are required for compliance with PCI, HIPAA, and other common industry standards.
Our strategy with respect to data encryption was to encrypt the data on HDFS and use the Cloudera Navigator Key Trustee Server for storing the keys. The reason for going with HDFS encryption is that unlike OS and network-level encryption, HDFS transparent encryption is end-to-end. It protects data at rest and in motion, making it more efficient than implementing a combination of OS-level and network-level encryptions.
HDFS Encryption implements transparent, end-to-end encryption of data read from and written to HDFS without requiring changes to application code. Because the encryption is end-to-end, data can be encrypted and decrypted only by the authorized user. HDFS does not store or have access to unencrypted data or encryption keys. This supports at-rest encryption (data on persistent media, such as a disk) and in-motion encryption (data traveling over a network).
Enabling HDFS encryption is achieved using the Cloudera manager wizard. Here, we also configured the Cloudera Navigator Key Trustee Server for storing the keys. It would help if you did this in the case of production systems. Enabling HDFS encryption involves many steps, but it is pretty well documented from Cloudera and easy to follow. We also performed a couple of steps for securing data transport on HDFS as well as HBase, wherein data in the case of HDFS is transported between data nodes and clients as well as among data nodes, and in the case of HBase it is transported between Hbase masters and region servers.
Hadoop Security Measure Data Masking
After securing our data at rest and in motion, it was a pleasant feeling as this was a very important task for us. But the functions with data were not over yet.
Data was encrypted, but still the user who has administrator rights can easily decrypt the data using the keys and view the data. So, the next step was to mask the sensitive data stored on Hadoop. In the case of our customer, there was some sensitive data related to some domain knowledge like formulae expressions for various processes, which was very critical for the client, and at any cost, they could not afford to have those being leaked out of the organization. They were supposed to be the main intellectual property for the client, driving their business and, ultimately, their revenue.
Apart from this, critical end-customer information was sensitive and should be masked. This is named data redaction, and we can enable or disable redaction for the whole cluster with a simple HDFS service-wide configuration change.
Using Cloudera Manager, we setup quite a few redaction rules as the data that had to be masked was not standard like any credit card or SSN information, but were some formulae and expressions as well as customer credentials. The components for creating redaction rules –
We had to search for formulae/expressions, as well as customer emails and phone numbers, to name a few, for masking. A regular expression is built for the same. If the regular expression matches any part of the data, the match is replaced by the contents of the replace string.
The string is used to replace the sensitive data.
This component proved to be very important for us. This specifies a simple string to search in the data. The redactor searches for matches to the search regular expression only if this string is found. By the way, this component is optional. So, specifying no value for the trigger component, redact Search regular expression is similar. Also, from a performance point of view, the Trigger field improves performance since simple string matching is faster than regular expression matching. The regular expressions for data masking proved to be complex, and this trigger component was helpful.
Once the rules were identified, it was time to do a simple configuration in the Cloudera manager. It then enabled the log and query redaction, and we added the rules we had identified earlier. I appreciate the documentation provided by Cloudera. It is crystal clear, as well as the ease with which various configuration tasks can be performed.
It was a very interesting exercise architecting and working on implementing Hadoop security for our customers. If you also think so, it’s still ongoing. In my next article, I’ll cover the remaining 2 phases, including data governance and some other aspects of Hadoop Security that we took care of in this program.