With Hadoop security deployments among the Fortune 200, Dataguise has developed these practices and procedures from significant experience in securing these large and diverse environments.
The explosion in information technology tools and capabilities has enabled advanced analytics using Big Data. However, the benefits of this new technology area are often coupled with data privacy issues. In these large information repositories, personally identifiable information (PII), such as names, addresses and social security numbers may exist.
Financial data such as credit card and account numbers might also be found in large volumes across these environments and pose serious concerns related to access. Through careful planning, testing, pre-production preparation and the appropriate use of technology, much of these concerns can be alleviated.
The following Hadoop security best practices provide guidance throughout Hadoop project implementations, but are especially important in the early planning stages:
1. Start Early! Determine the data privacy protection strategy during the planning phase of a deployment, preferably before moving any data into Hadoop. This will prevent the possibility of damaging compliance exposure for the company and avoid unpredictability in the roll out schedule.
2. Identify what data elements are defined as sensitive within your organization. Consider company privacy policies, pertinent industry regulations and governmental regulations.
3. Discover whether sensitive data is embedded in the environment, assembled or will be assembled in Hadoop.
4. Determine the compliance exposure risk based on the information collected.
5. Determine whether business analytic needs require access to real data or if desensitized data can be used. Then, choose the right remediation technique (masking or encryption). If in doubt, remember that masking provides the most secure remediation while encryption provides the most flexibility, should future needs evolve.
6. Ensure the data protection solutions under consideration support both masking and encryption remediation techniques, especially if the goal is to keep both masked and unmasked versions of sensitive data in separate Hadoop directories.
7. Ensure the data protection technology used implements consistent masking across all data files (Joe becomes Dave in all files) to preserve the accuracy of data analysis across every data aggregation dimensions.
8. Determine whether a tailored protection for specific data sets is required and consider dividing Hadoop directories into smaller groups where security can be managed as a unit.
9. Ensure the selected encryption solution interoperates with the company’s access control technology and that both allow users with different credentials to have the appropriate, selective access to data in the Hadoop cluster.
10. Ensure that when encryption is required, the proper technology (Java, Pig, etc.) is deployed to allow for seamless decryption and ensure expedited access to data.
By starting early and establishing processes that define sensitive data, detect that data in the Hadoop environment, analyze the risk exposure and assign the proper data protection using either masking or encryption, enterprises can remain confident their data is protected from unauthorized access. In following these guidelines, data management, security and compliance officers cognizant of the sensitive information in Hadoop can not only lower exposure risks, but increase performance for a greater return on Big Data initiatives.