Big Data Security Assessment

What is Big Data?

Big Data refers to datasets whose size and/or structure is beyond the ability of traditional software tools or database systems to store, process, and analyze within reasonable timeframes.

HADOOP is one of the main computing environment built on top of a distributed clustered file system (HDFS) that was designed specifically for large scale data operations and embraced by enterprises.

Benefits of Big Data and Data Analytics

  • Big data makes it possible for you to gain more complete answers because you have more information.
  • More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.

Security Issue

  • The primary goal of an attacker is to obtain sensitive data that sits in a Big Data cluster.Organizations collect and process huge sensitive information regarding customers, employees, IPs (intellectual property), and financial information. Such confidential information are aggregated and centralized in one place for analysis in order to increase their value. This centralization of data is valuable target for attackers and those confidential information might be exposed.
  • attacks may include attempting to destroy or modify data or prevent availability of this platform.

Security Strategy

Understanding architecture and cluster composition of the ecosystem in place is the first step to putting together for security strategy. it is important to understand each component interface as an attack target.

Each component offers attacker a specific set of potential exploits, while defenders have a corresponding set of options for attack detection and prevention.

Big Data Security Issue

Threats on Big Data Platforms

Data access & ownership
Relational and quasi-relational platforms include roles, groups, schemas, label security, and various other facilities for limiting user access to subsets of available data. authentication and authorization requirements shall be assessed while managing the cluster for limiting access to sensitive data.

Logging capabilities in the big data ecosystem, both open source and commercial shall be assessed for proper implementations. We need to verify that the logs are configured to capture both the correct event types and sufficient information to determine user actions including queries executed.

Security Monitoring
The built-in monitoring tools to detect misuse or block malicious queries shall be validated and assessed. Database activity monitoring technologies will help to flag or even block misuse operations.

Data at rest protection
Encryption can help to protect against attempts to access data outside established application interfaces. Unauthorized stealing of archives or directly reading files from disk, can be mitigated using encryption at the file or HDFS layer .This ensures files are protected against direct access by users as only the file services are supplied with the encryption keys. Third parties products can help to provide advanced transparent encryption options for both HDFS and non-HDFS file formats. Transport Layer Security (TLS) provides confidentiality of data and provides authentication via certificates and data integrity verification.

Inter-node communication
Data in transit, along with application queries might be accessible for inspection and tampering while using unencrypted RPC over TCP/IP communication protocols. Ensure TLS and SSL capabilities are bundled in big data distributions.

We need to ensure one tenant cannot read another's data and 'encryption zones’ are built into native HDFS. Additional security controls shall be implemented to ensure privacy using Access Control Entries (ACE) or Access Control Lists (ACL) when multiple applications and 'tenants' are served in ecosystem.

Client interaction
Gateway services shall be created to load data, instead of clients communicate directly with both resource managers and individual data nodes as Compromised clients may send malicious data or link to services.

API security
Ensure the big data cluster APIs be protected from code and command injection, buffer overflow attacks.

Holistic Approach for Big Data Security Operation


  • Segregate administrative roles and restrict unwanted access to a minimum
  • Direct access to files or data is shall be addressed through a combination of role based-authorization, access control lists, file permissions, and segregation of administrative roles

Authentication and perimeter security

  • Ensure to authenticate nodes before they join a cluster. If an attacker can add a new node they control to the cluster, they can exfiltrate data. Certificate-based identity options can provide strong authentication and improve security.

Data protection

  • Tokenization, Masking and data element encryption tools help to support data centric security implementation when the systems that process data cannot be fully trusted, or in cases we don't want to share data with users.

Configuration and patch management

  • Keeping track of encryption keys, certificates, open-source libraries up to date as it may be common for hundreds of nodes unintentionally run different configurations and kept unpatched.
  • Ensure to use Configuration management tools, recommended configurations and pre-deployment checklists

Security Solutions

Apache Ranger — Ranger is a policy administration tool for Hadoop clusters. It includes a broad set of management functions, including auditing, key management, and fine grained data access policies across HDFS, Hive, YARN, Solr, Kafka and other modules.

Apache Ambari — Ambari is a facility for provisioning and managing Hadoop clusters. It helps administrators set configurations and propagate changes to the entire cluster.

Apache Knox — You can think of Knox as a Hadoop firewall. More precisely it is an API gateway. It handles HTTP and RESTful requests, enforcing authentication and usage policies on inbound requests and blocking everything else.

Monitoring — Hive, PIQL, Impala, Spark SQL and similar modules offer SQL or pseudo-SQL syntax. This enables you to leverage activity monitoring, dynamic masking, redaction, and tokenization technologies originally developed for relational platforms.