What’s New in Hadoop 3? Top Features You Should Know

post

Hadoop 3 brings significant improvements over Hadoop 2.x, focusing on scalability, efficiency, and fault tolerance. Below are the standout features:

What’s New in Hadoop 3? Top Features You Should Know

Hadoop 3 brings significant improvements over Hadoop 2.x, focusing on scalability, efficiency, and fault tolerance. Below are the standout features:

1. Java 8 Requirement

Hadoop 3 mandates Java 8 as the minimum supported version. All Hadoop JAR files are compiled using Java 8, necessitating users to upgrade from Java 7 to utilize Hadoop 3 effectively. Apache Hadoop

2. Erasure Coding in HDFS

Replacing the traditional 3x data replication, Hadoop 3 introduces erasure coding, which reduces storage overhead to approximately 50% while maintaining the same level of fault tolerance. This method is particularly beneficial for storing infrequently accessed data. apache.github.io

3. YARN Timeline Service v.2

An upgraded Timeline Service enhances scalability and reliability by separating data collection from serving. It employs a distributed writer architecture and scalable backend storage, using HBase for efficient data handling.

4. Opportunistic Containers & Distributed Scheduling

Hadoop 3 introduces opportunistic containers that wait for resource availability, allowing better utilization of cluster resources. These containers can be preempted by higher-priority guaranteed containers when necessary.

5. Support for Multiple NameNodes

Beyond the active-standby NameNode configuration, Hadoop 3 allows for more than two NameNodes, enhancing fault tolerance. For instance, configuring three NameNodes with five JournalNodes enables the system to withstand multiple failures. GeeksforGeeks

6. Updated Default Ports

To avoid conflicts with ephemeral ports, Hadoop 3 changes default ports for various services:

NameNode: 9870 (was 50070)

DataNode: 9866 (was 50010)

Secondary NameNode: 9868 (was 50090)GeeksforGeeks

7. Intra-DataNode Balancer

A new tool, hdfs diskbalancer, addresses uneven data distribution across disks within a single DataNode, ensuring balanced storage utilization. Cloudera Documentation

8. Enhanced Heap Management

Heap size configuration is now more flexible, with the introduction of HEAP_MAX_SIZE and HEAP_MIN_SIZE variables, replacing the older HADOOP_HEAPSIZE. This change allows for better memory management based on host resources.

9. Generalized YARN Resource Model

YARN's resource model is extended to support custom resources like GPUs and software licenses, enabling more versatile resource scheduling beyond just CPU and memory.

10. S3Guard for S3A Client

To improve consistency and performance when interfacing with Amazon S3, Hadoop 3 introduces S3Guard, which uses DynamoDB to cache metadata, ensuring faster and more reliable file operations.

📝 Summary

Hadoop 3 represents a significant evolution from its predecessor, offering enhanced storage efficiency, improved fault tolerance, and greater scalability. These advancements make it a robust choice for modern big data applications.


Share This Job:

Write A Comment

    No Comments