This is a continuation of my previous article on ’Hadoop Security: Prime Areas To Focus On.’ If you still need to read my last blog, please go through it before you read this to have continuity.
In my previous article, I shared our experiences implementing Hadoop security for areas like user authentication, data encryption, and masking. In this article, I will cover interesting and useful information on other areas related to Hadoop security like user authorization and data governance, while implementing it for our customers.
Phase 3 of Hadoop Security Measure
The borders have been secure and authenticated to ensure the right people can enter the application. Data has been secure and masked wherever required to provide that nothing is in the open. Now it’s time to manage the authorization level for various users to ensure they can view and access whatever they are authorized to and nothing else.
User Authorization concerns who or what has access or control over a given resource or service. Since Hadoop merges together the capabilities of multiple varied and previously separate IT systems/components as an enterprise data hub that stores and works on all data within an organization, it requires multiple authorization controls with varying granularities.
We performed the below steps for user Authorization.
Tying all users to groups we had already created in the AD directories.
Providing role-based access control for data access and ingestion, like batch and interactive SQL queries.
With respect to user Authorization, we performed the following activities
The existing files and directories were assign to the concern group, as an owner was assigned to them. Each assignment has a basic set of permissions available; file permissions are simply read, write, and execute, and directories have an additional permission to determine access to child directories.
Extended Access Control Lists (ACLs) were also set for HDFS so as to provide fine-grained control of permissions for HDFS files so we could set different permissions for specific users or groups.
We extensively use Apache HBase in our application, so there was a high requirement to control who can query data using HBase. We set up various authorizations for various operations (READ, WRITE, CREATE, ADMIN) based on column family. These authorizations are set up at the group level.
We decided to use Apache Sentry to configure role-based access control to have a centralized system to manage the various roles and permissions.
Apache Sentry is a granular, role-based authorization module for Hadoop. It allows you to define authorization rules to validate a user or application’s access requests for Hadoop resources. Sentry is highly modular and can support authorization for various components in Hadoop. Sentry relies on underlying authentication systems, such as Kerberos or LDAP, to identify the user. It also uses the group mapping mechanism configured in Hadoop to ensure that Sentry sees the same group mapping as other Hadoop ecosystem components.
We created multiple groups in the AD directory, for example, Management, Technology, Batch, and a couple of admin groups. Then we created various roles in Sentry, like ‘Auditor,’ Read-Only,’ ‘Cluster Administrator,’ etc. based on the required roles.
Based on these groups and roles, suitable role policies were assigned to these groups, for for example, the Auditor role was assigned to management group which consisted of project managers and Architects, the Read-Only role was assigned to Technology team, and similar other admin roles were assigned to admin groups.
The biggest advantage of role-based access controls (RBAC) is that they make managing things like new users added, users deleted, rights revoked or granted, etc., pretty easy. Another thing that we experienced is that configuring RBAC through Sentry has a huge benefit since you do not have to do this for various other Hadoop components, and Sentry takes care of this by propagating it across various Hadoop components.
Phase 4 of the Hadoop Security Measure
Suddenly, one day, the customer noticed some data discrepancies and started asking questions like:
What happened to my customer details data? Why is it showing some ambiguous data?
Which tables store customer data?
What sources feed into the customer data?
Which users did access the files that feed customer data?
Which users queried/modified the tables containing customer data?
What did the user ‘cust_rep1’ do on Saturday, July 29?
What operations did the user perform? And many other questions.
We started thinking, why is the customer asking us these questions? We are not employed in that organization, nor supposed to know the answer.
Well, it is correct that we are not the right persons to answer these questions, and the customer should know all this data, but the most important thing to ponder upon is that the customer is using the application built by us for data storage, processing, and access, have we empowered the application to answer all these questions to the customer?
The simple answer was ‘NO,’ and we found a major area of data governance needing to be included in the application. However, this was discovered well before the application was announced to be used by all the organization’s users.
So, we decided to use Cloudera Navigator, and based on the customer requirements, we introduced the below data governance features in our application –
Due to the sensitive data, the customer wanted to know exactly who was accessing their data- Tom, John, or Lucy. What data are they accessing, and how are users using it? The idea is to ensure the correct governance measures are in place to protect sensitive data. Also, any proactive/reactive measures based on data usage patterns and to track down any malicious data modification or access to the root.
We didn’t have to do much configuration as Cloudera Navigator already captures all the data related to various Hadoop components like HDFS, Hive, Impala, HBase, etc., but of course we had to create new roles specifically for data auditing purposes; these were provided to the users at the highest level.
Finding data lineage would help answer the question, ‘Which sources feed the customer’s table?’. We decided to use the cloudera Navigator to enable data lineage, which provides an inbuilt functionality for various Hadoop components like HDFS, Scoop, Spark, Impala, etc. Moreover, both forward and backward data lineage is available up to the column level, which is amazing. We could view data lineages for various tables and queries without any coding or configuration. This was a huge value add for the customer as there is no need for technical expertise or knowledge.
To answer questions like ‘Which tables store customer data?’, you need to open the design document or go through the individual tables in Hive to check. The simplest would be, if possible, to ask Google about this.
You might be wondering where Google comes into the picture here. So, it’s not exactly Google, but Navigator provides some search features based on tags. It helps us to search the entities depending on the search keyword. We scheduled multiple meetings with the business users to understand the business language and some of the functionality. We took help from the technical teams who managed various source systems. Based on all this information, we created a simple tagging document. We used tagging documents to tag various entities on the big data platform with meaningful business acronyms. This would enable a simple yet effective metadata search functionality in the application using Navigator.
It was a wonderful experience. There were lots of learnings implementing this project and, most importantly, a happy and satisfied customer.
I hope that our Hadoop security and governance experience will be helpful to customers. Also, individuals looking forward to implementing Hadoop security on big data platforms. If you have any questions/queries, please feel free to reach out to me.