In my previous contributions I covered authentication and authorization in Hadoop. This time I will be covering Audit, the third of the three AAAs of Information Security. Audit and monitoring are critical to data security. Through audit, we can ensure that the security controls that are in place are working correctly and identify attempts to circumvent them.
Logs are a common method to record the actions of an application and allow administrators and auditors to go “Back in Time” to review a user’s actions. Much like your credit card or bank statement, these logs provide evidence of transactions performed. In absence of a time machine, these logs may be the only means to provide a historical view of what took place in a Hadoop cluster at a given moment in time.
As you all know by now, Hadoop has many different components and it just so happens that they have different types of audit logs. I will cover the auditing capabilities of several components in this article.
HDFS Audit Logs
HDFS is at the core of Hadoop, providing the distributed file system that makes Hadoop so successful. HDFS has two different audit logs, hdfs-audit.log for user activity and SecurityAuth-hdfs.audit for service activity. Both of these logs are implemented with Apache Log4j, a common and well known mechanism for logging in Java. The log4j properties can be configured in the log4j.properties file with:
Below is an example log for user Marty McFly after a listing of files/directories and an attempted copy to directory /user/doc which was denied.
2015-07-01 12:15:10,123 INFO FSNamesystem.audit: allowed=true ugi=martymcfly@HILLVALLEY.COM
(auth:KERBEROS) ip=/192.168.2.10 cmd=getfileinfo src=/user/martymcfly dst=null perm=null
2015-07-01 12:15:10,125 INFO FSNamesystem.audit: allowed=true ugi=martymcfly@HILLVALLEY.COM
(auth:KERBEROS) ip=/192.168.2.10 cmd=listStatus src=/user/martymcfly dst=null perm=null
2015-07-01 12:15:46,167 INFO FSNamesystem.audit: allowed=false ugi=martymcfly@HILLVALLEY.COM
(auth:KERBEROS) ip=/192.168.2.10 cmd=rename src=/user/martymcfly/delorean dst=/user/doc perm=null
MapReduce Audit Logs
Like HDFS, MapReduce also has two logs mapred-audit.log for user activity and SecurityAuth- mapred.audit for service activity. The log4j configuration can be found in the log4j.properties file with:
YARN Audit Logs
For YARN the user audit log events are not in a separate file but rather mixed into the daemon log files. To enable the service logging in YARN as with HDFS and MapReduce you enable the log4j property with:
Hive Audit Logs
Hive is a bit different and uses the Hive Metastore for service logging. To identify the Hive audit events amongst the other logged events you can filter lines containing org.apache.hadoop.hive.metastore.HiveMetaStore.audit. Hive log events will also contain information to identify which database or table is being operated on.
HBase Audit Logs
HBase has a separate file for audit logs, though playing back the activity for a user is a bit trickier as the events can be spread amongst the HBase nodes. The events will contain information about the column family, column, table and action performed. The log4j configuration can be found in the log4j.properties file with:
Sentry Audit Logs
While logging user operations are important, logging admin operations and changes to user permissions is extremely important. Apache Sentry also uses log4j and has a dedicated file that is configured with:
Cloudera Impala Audit Logs
Each Cloudera Impala daemon will have its own audit log file. The format is a bit different and uses JSON for easier parsing of events. Like Hive, Impala will log information about the database, table and even SQL statement performed.
Monitoring and Log Analysis for the added benefit of Event Analysis and Alerts
Once you have set up all the Hadoop logging, an equally important step is to monitor the cluster proactively for security events, breaches and suspicious activity. And what better place to do this but Hadoop itself!
Among the many other great use cases for big data, one is to use Hadoop for log ingestion and security analytics. In the past, important information contained in log files was discarded during log rotations, but now with Hadoop, smart organizations are storing all log data for active archiving. Organizations then take advantage of the large ecosystem of tools that are available for advanced persistence threat (APT) analytics, security forensics, cyber intelligence and user behavior machine learning built on Hadoop.
Stay tuned for upcoming articles on new methods and approaches to capture network, packet and DNS data on Apache Hadoop to detect potential threats using machine learning.
It is always a good idea to make sure you have enabled logging correctly even on existing clusters or after performing upgrades. And if you are not currently storing logs in Hadoop you should definitely start now.