Dale Kim is the Director of Industry Solutions at MapR.
If you’ve followed the big data buzz in the last few years, then you are probably familiar with Apache Hadoop and its growing popularity. You might know it as a great system for cost effectively running large-scale analytics and processing. Hadoop has evolved significantly from its early days when it was used for internet search indexing, into a framework that is valuable for many different applications and a wide range of enterprises.
Some of the more popular uses for Hadoop today include: data warehouse optimization, fraud/anomaly detection, recommendation engines, and clickstream analysis, all of which can address personally identifiable information (PII) such as social security numbers and credit card numbers. How does a framework built for search indexing become applicable for vastly different use cases? Hadoop’s versatility is enhanced by the many open-source projects added over the years, including Apache HBase, Apache Hive, Apache Mahout, and Apache Pig. Those help on a functional level, but what about security?
Due to Hadoop’s growing proliferation, its security capabilities have been under scrutiny lately. Questions have been raised about Hadoop's security and whether it is ready for production use; this is an unfortunate mischaracterization. If organizations new to Hadoop hear that there is significant risk of exposure, then they will likely delay their adoption. In the meantime, enterprises continue to face serious big data challenges while resorting to other solutions that might not adequately address their issues.
The question is not whether Hadoop is ready for secure environments. It already runs in some of the most security-conscious organizations in the world in financial services, healthcare and government. You can find numerous case studies on the internet. The real issue is identifying the right approach for your specific environment.
In some deployment models, organizations fence off a Hadoop cluster with firewalls and other network protection schemes and only allow trusted users to access it. This is the most basic type of implementation that does not necessarily depend on specific security capabilities in Hadoop. As an extension to this, a model can prohibit direct login to the cluster servers, and users are given data access via edge nodes combined with basic Hadoop security controls. In a more sophisticated approach, native Hadoop security controls are implemented to give access to more users while ensuring any data access is performed by authorized users. In still more advanced environments, Hadoop security capabilities are fully deployed in conjunction with monitoring and analytical tools on Hadoop clusters to detect and prevent intrusion and other rogue activities.
The fact that organizations are using Hadoop on sensitive data today strongly supports Hadoop’s legitimacy. Therefore, it is worthwhile to pursue a deeper understanding of its security capabilities by talking to Hadoop vendors and third-party security providers, just as organizations should do for any new deployment. It’s important to document what’s important for you first, and then look for specific features that support your priorities. They should largely mirror the requirements you currently have in your other enterprise systems.
What capabilities are available in Hadoop? First of all, authentication is always required for secure data, and there’s Kerberos integration as a start, along with alternate and enhanced authentication capabilities by some Hadoop and third-party vendors. Second, authorization or access controls in Hadoop are available to grant and deny permissions for accessing specific data. Third, auditing can be done in a variety of ways in Hadoop to handle business requirements such as analyzing user behavior and achieving regulatory compliance. Finally, encryption is supported, though it is an often misunderstood capability because it sometimes is misused as a means for access controls. Rather, it should be used to protect data-in-motion (data sent over the network) and data-at-rest to protect data even if physical storage devices are stolen. A specific area of encryption for data-at-rest is obfuscating sensitive elements in files, essentially making the data non-sensitive while retaining analytical value. This type of encryption is handled by a variety of third-party vendors for Hadoop.
There are several options for handling access control within Hadoop, and one challenge today is that no universal standard exists. This means you must do a bit more investigation to determine what option is right for you. Some technologies take a “build-it-as-you-go” or “follow-the-data” approach, and some take a data-centric approach. Fortunately, this lack of standards should not deter you because the various approaches simply mean different levels of people and processes need to be applied to data security. That's no different than the practices we’ve applied to other enterprise systems.
More than anything, organizations should be comfortable that production environments with sensitive data are already running on Hadoop, and the security capabilities are only getting better.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.