Skip to content

Commit ab24570

Browse files
author
Felix Hennig
committed
Update rack awareness page
1 parent ee010b6 commit ab24570

File tree

1 file changed

+16
-11
lines changed

1 file changed

+16
-11
lines changed
Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,17 @@
11
= HDFS Rack Awareness
2+
:rack-awareness-docs: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html
3+
:hdfs-topology-provider: https://github.com/stackabletech/hdfs-topology-provider
24

3-
Apache Hadoop supports a feature called Rack Awareness, which allows users to define a topology for the nodes making up a cluster.
4-
Hadoop then uses that topology to spread out replicas of blocks in a fashion that maximizes fault tolerance.
5+
{rack-awareness-docs}[Rack awareness] is a feature in Apache Hadoop that allows users to define a cluster's node topology.
6+
Hadoop uses that topology to distribute block replicas in a way that maximizes fault tolerance.
57

6-
The default write path, for example, is to put replicas of a newly created block first on a different node, but within the same rack, and the second copy on a node in a remote rack.
7-
In order for this to work properly, Hadoop needs to have access to the information about the underlying infrastructure it runs on. In a Kubernetes environment, this means obtaining information from the pods or nodes of the cluster.
8+
For example, when a new block is created, the default behavior is to place one replica on a different node within the same rack, and another on a node in a remote rack.
9+
To do this effectively, Hadoop must access information about the underlying infrastructure.
10+
In a Kubernetes environment, this involves retrieving data from Pods or Nodes in the cluster.
811

9-
In order to enable gathering this information the Hadoop images contain https://github.com/stackabletech/hdfs-topology-provider on the classpath, which can be configured to read labels from Kubernetes objects.
12+
== Configuring rack awareness
1013

11-
In the current version of the SDP this is now exposed as fully integrated functionality in the operator, and no longer needs to be configured via config overrides.
12-
13-
NOTE: Prior to SDP release 24.3, it was necessary to manually deploy RBAC objects to allow the Hadoop pods access to the necessary Kubernetes objects. This ClusterRole allows the reading of pods and nodes and needs to be bound to the individual ServiceAccounts that are deployed per Hadoop cluster: this is now performed by the operator itself.
14-
15-
Configuration of the tool is done by using the field `rackAwareness` under the cluster configuration:
14+
To configure rack awareness, use the `rackAwareness` field in the cluster configuration:
1615

1716
[source,yaml]
1817
----
@@ -29,4 +28,10 @@ spec:
2928
...
3029
----
3130

32-
Internally this is used to create a topology label consisting of the value of the node label `topology.kubernetes.io/zone` and the pod label `app.kubernetes.io/role-group`, e.g. `/eu-central-1/rg1`.
31+
This creates an internal topology label by combining the values of the `topology.kubernetes.io/zone` Node label and the `app.kubernetes.io/role-group` Pod label (e.g. `/eu-central-1/rg1`).
32+
33+
== How it works
34+
35+
In order to enable gathering this information the Hadoop images contain the {hdfs-topology-provider}[hdfs-topology-provider] on the classpath, which can be configured to read labels from Kubernetes objects.
36+
37+
The operator deploys ClusterRoles and ServicesAccounts with the relevant RBAC rules to allow the Hadoop Pod to access the necessary Kubernetes objects.

0 commit comments

Comments
 (0)