Big Data Security: How to secure your Hadoop cluster

Hi all,

Big Data is for sure one of the biggest trends of the last few years. Besides the conceptual discussions on what big data actually is and what amount of data can be defined as such, any technology that quickly becomes widespread is relevant to security as well.

A bit of History

When talking about big data in practical terms, most of the time we are talking about Hadoop, that likely is the most adopted platform when dealing with huge amounts of data.

With the advent of Cloud Computing, Hadoop has catched the attention of cloud vendors and providers, which have started offering big data processing as a service or with the pay-as-you-go model. Also, many companies have deployed their own Hadoop cluster in the Cloud or on-premises.

Security Concerns

However, despite being a great tool for processing big data, in origin Hadoop was designed mainly for internal use, meaning on local clusters within the security perimeter of the organization. Therefore, previous versions (before 2.6) of Hadoop were not designed to be well protected against external threats, hence they were easily hackable in case of a breach and highly unsecure. For instance, some of the weaknesses affecting earlier versions of Hadoop were:

How to secure Hadoop

Current versions of Hadoop are by far more secure than earlier versions. However, by default they are not secure out-of-the-box, therefore you need to get your hands on the configuration files and make sure security relevant options are actually enabled.

Also, it is important to keep in mind that the only way of making your cluster secure is to protect it at different layers,  from the lower (that is OS-level security) up to application-level and network-level security.

hadoop-rest-api-security-with-apache-knox-gateway-8-638

Here is a minimum set of options we strongly recommend to enable in order to secure your Hadoop clusters:

  1. Enable HDFS extended ACLs by adding the following properties to hdfs-site.xml
    dfs.namenode.acls.enabled
    true
    
  2. Enable Hadoop security module and strong authentication (Kerberos) by adding the following properties to core-site.xml
      hadoop.security.authentication
      kerberos 
    
    
    
      hadoop.security.authorization
      true
    
  3. Secure HDFS by adding the following properties to hdfs-site.xml (in particular, enable HTTPS and advanced authorization)
    
    
      dfs.block.access.token.enable
      true
    
    
    
    
      dfs.namenode.keytab.file
      /etc/hadoop/conf/hdfs.keytab 
    
    
      dfs.namenode.kerberos.principal
      hdfs/_HOST@YOUR-REALM.COM
    
    
      dfs.namenode.kerberos.internal.spnego.principal
      HTTP/_HOST@YOUR-REALM.COM
    
    
    
    
      dfs.secondary.namenode.keytab.file
      /etc/hadoop/conf/hdfs.keytab 
    
    
      dfs.secondary.namenode.kerberos.principal
      hdfs/_HOST@YOUR-REALM.COM
    
    
      dfs.secondary.namenode.kerberos.internal.spnego.principal
      HTTP/_HOST@YOUR-REALM.COM
    
    
    
    
      dfs.datanode.data.dir.perm
      700 
    
    
      dfs.datanode.address
      0.0.0.0:1004
    
    
      dfs.datanode.http.address
      0.0.0.0:1006
    
    
      dfs.datanode.keytab.file
      /etc/hadoop/conf/hdfs.keytab 
    
    
      dfs.datanode.kerberos.principal
      hdfs/_HOST@YOUR-REALM.COM
    
    
    
    
      dfs.web.authentication.kerberos.principal
      HTTP/_HOST@YOUR_REALM
    
    dfs.http.policy
    HTTPS_ONLY
    
  4. If you use WebHDFS (REST API for HDFS), make sure Kerberos authentication is on by adding the following properties to hdfs-site.xml
      dfs.web.authentication.kerberos.principal
      HTTP/_HOST@YOUR-REALM.COM
    
    
    
      dfs.web.authentication.kerberos.keytab
      /etc/hadoop/conf/HTTP.keytab 
    
  5. Enable transparent data encryption and configure the Key Provider (which will take of generating and providing encryption keys). You can use the hdfs crypto to test your configuration. Here you can find the documentation https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html#Configuration
  6. Finally, as a general security warning, make sure that firewall rules are correctly set and limit the access from the Internet only to necessary services. Indeed, by default Hadoop will expose two web interfaces, one for the Resource Manager (on port 8088) and one for NameNode (on port 50070), which are not protected by authentication, therefore they can potentially leak sensitive and critical information

resource managernamenode

Continuous Security and Vulnerability Assessment

Of course, monitoring a Hadoop cluster (or even many of them) can quickly become a headache for system administrators and devops, therefore an automated tool can be of great help. That’s why we embedded these (plus a few others) Hadoop security analysis into our product Elastic Workload Protector in order to help you continuously monitor the security of your Hadoop cluster(s) and be notified as soon as there is a potential issue or misconfiguration.

hadoop-2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s