Skip to content

docs: improve install section, improve style, add pod override info #587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/modules/hdfs/pages/getting_started/first_steps.adoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
= First steps
:description: Deploy and verify an HDFS cluster with Stackable by setting up Zookeeper and HDFS components, then test file operations using WebHDFS API.

Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the operator and its dependencies, you will now deploy an HDFS cluster and its dependencies.
Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the operator and its dependencies, now deploy an HDFS cluster and its dependencies.
Afterward, you can <<_verify_that_it_works, verify that it works>> by creating, verifying and deleting a test file in HDFS.

== Setup
Expand All @@ -13,7 +13,7 @@ To deploy a Zookeeper cluster create one file called `zk.yaml`:
[source,yaml]
include::example$getting_started/zk.yaml[]

We also need to define a ZNode that will be used by the HDFS cluster to reference Zookeeper.
Define a ZNode that is used by the HDFS cluster to reference Zookeeper.
Create another file called `znode.yaml`:

[source,yaml]
Expand Down Expand Up @@ -94,7 +94,7 @@ Then use `curl` to issue a `PUT` command:
[source]
include::example$getting_started/getting_started.sh[tag=create-file]

This will return a location that will look something like this:
This returns a location that looks similar to this:

[source]
http://simple-hdfs-datanode-default-0.simple-hdfs-datanode-default.default.svc.cluster.local:9864/webhdfs/v1/testdata.txt?op=CREATE&user.name=stackable&namenoderpcaddress=simple-hdfs&createflag=&createparent=true&overwrite=false
Expand All @@ -109,7 +109,7 @@ Rechecking the status again with:
[source]
include::example$getting_started/getting_started.sh[tag=file-status]

will now display some metadata about the file that was created in the HDFS cluster:
now displays some metadata about the file that was created in the HDFS cluster:

[source,json]
{
Expand Down
8 changes: 4 additions & 4 deletions docs/modules/hdfs/pages/getting_started/index.adoc
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
= Getting started
:description: Start with HDFS using the Stackable Operator. Install the Operator, set up your HDFS cluster, and verify its operation with this guide.

This guide will get you started with HDFS using the Stackable Operator.
It will guide you through the installation of the Operator and its dependencies, setting up your first HDFS cluster and verifying its operation.
This guide gets you started with HDFS using the Stackable operator.
It guides you through the installation of the operator and its dependencies, setting up your first HDFS cluster and verifying its operation.

== Prerequisites

You will need:
You need:

* a Kubernetes cluster
* kubectl
* optional: Helm

Resource sizing depends on cluster type(s), usage and scope, but as a starting point we recommend a minimum of the following resources for this operator:
Resource sizing depends on cluster type(s), usage and scope, but as a starting point the following resources are recommended as a minium requirement for this operator:

* 0.2 cores (e.g. i5 or similar)
* 256MB RAM
Expand Down
39 changes: 21 additions & 18 deletions docs/modules/hdfs/pages/getting_started/installation.adoc
Original file line number Diff line number Diff line change
@@ -1,39 +1,41 @@
= Installation
:description: Install the Stackable HDFS operator and dependencies using stackablectl or Helm. Follow steps for setup and verification in Kubernetes.
:kind: https://kind.sigs.k8s.io/

On this page you will install the Stackable HDFS operator and its dependency, the Zookeeper operator, as well as the
Install the Stackable HDFS operator and its dependency, the Zookeeper operator, as well as the
commons, secret and listener operators which are required by all Stackable operators.

== Stackable Operators

There are 2 ways to run Stackable Operators

. Using xref:management:stackablectl:index.adoc[]
. Using Helm

=== stackablectl
There are multiple ways to install the Stackable operators.
xref:management:stackablectl:index.adoc[] is the preferred way, but Helm is also supported.
OpenShift users may prefer installing the operator from the RedHat Certified Operator catalog using the OpenShift web console.

[tabs]
====
stackablectl::
+
--
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install
operators. Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.

After you have installed `stackablectl`, run the following command to install all operators necessary for the HDFS
cluster:
After you have installed `stackablectl`, run the following command to install all operators necessary for the HDFS cluster:

[source,bash]
----
include::example$getting_started/getting_started.sh[tag=stackablectl-install-operators]
----

The tool will show
The tool prints

[source]
include::example$getting_started/install_output.txt[]

TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use `stackablectl`. For
example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].

=== Helm
TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use `stackablectl`.
For example, you can use the `--cluster kind` flag to create a Kubernetes cluster with {kind}[kind].
--

Helm::
+
--
You can also use Helm to install the operators. Add the Stackable Helm repository:
[source,bash]
----
Expand All @@ -46,8 +48,9 @@ Then install the Stackable Operators:
include::example$getting_started/getting_started.sh[tag=helm-install-operators]
----

Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the HDFS cluster (as well as the CRDs
for the required operators). You are now ready to deploy HDFS in Kubernetes.
Helm deploys the operators in a Kubernetes Deployment and apply the CRDs for the HDFS cluster (as well as the CRDs for the required operators).
--
====

== What's next

Expand Down
4 changes: 1 addition & 3 deletions docs/modules/hdfs/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,7 @@ The operator depends on the xref:zookeeper:index.adoc[] to operate a ZooKeeper c

== Getting started

Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable
HDFS and ZooKeeper operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set
up correctly.
Follow the xref:getting_started/index.adoc[Getting started guide] which guides you through installing the Stackable HDFS and ZooKeeper operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set up correctly.

Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your HDFS configuration to
your needs, or have a look at the <<demos, demos>> for some example setups.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ stackable-hdfs-operator run --product-config /foo/bar/properties.yaml

*Multiple values:* false

The operator will **only** watch for resources in the provided namespace `test`:
The operator **only** watches for resources in the provided namespace `test`:

[source]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ docker run \

*Multiple values:* false

The operator will **only** watch for resources in the provided namespace `test`:
The operator **only** watches for resources in the provided namespace `test`:

[source]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,10 @@ nameNodes:
replicas: 2
----

All override property values must be strings. The properties will be formatted and escaped correctly into the XML file.
All override property values must be strings.
The properties are formatted and escaped correctly into the XML file.

For a full list of configuration options we refer to the Apache Hdfs documentation for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml[hdfs-site.xml] and https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml[core-site.xml].
For a full list of configuration options refer to the Apache Hdfs documentation for https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml[hdfs-site.xml] and https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/core-default.xml[core-site.xml].

=== The security.properties file

Expand Down Expand Up @@ -117,4 +118,10 @@ nameNodes:
replicas: 1
----

IMPORTANT: Some environment variables will be overriden by the operator and cannot be set manually by the user. These are `HADOOP_HOME`, `HADOOP_CONF_DIR`, `POD_NAME` and `ZOOKEEPER`.
IMPORTANT: Some environment variables are overridden by the operator and cannot be set manually by the user.
These are `HADOOP_HOME`, `HADOOP_CONF_DIR`, `POD_NAME` and `ZOOKEEPER`.

== Pod overrides

The HDFS operator also supports Pod overrides, allowing you to override any property that you can set on a Kubernetes Pod.
Read the xref:concepts:overrides.adoc#pod-overrides[Pod overrides documentation] to learn more about this feature.
23 changes: 11 additions & 12 deletions docs/modules/hdfs/pages/usage-guide/fuse.adoc
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
= FUSE
:description: Use HDFS FUSE driver to mount HDFS filesystems into Linux environments via a Kubernetes Pod with necessary privileges and configurations.

Our images of Apache Hadoop do contain the necessary binaries and libraries to use the HDFS FUSE driver.

FUSE is short for _Filesystem in Userspace_ and allows a user to export a filesystem into the Linux kernel, which can then be mounted.
HDFS contains a native FUSE driver/application, which means that an existing HDFS filesystem can be mounted into a Linux environment.
HDFS contains a native FUSE driver/application, enabling you to mout an existing HDFS filesystem into a Linux environment.

Stackable images of Apache Hadoop contain the necessary binaries and libraries to use the HDFS FUSE driver.
To use the FUSE driver you can either copy the required files out of the image and run it on a host outside of Kubernetes or you can run it in a Pod.
This Pod, however, will need some extra capabilities.
This Pod, however, needs some extra capabilities.

This is an example Pod that will work _as long as the host system that is running the kubelet does support FUSE_:
This is an example Pod that works _as long as the host system that is running the kubelet does support FUSE_:

[source,yaml]
----
Expand Down Expand Up @@ -39,8 +37,9 @@ spec:
configMap:
name: <your hdfs here> <2>
----
<1> Ideally use the same version your HDFS is using. FUSE is baked in to our images as of SDP 23.11.
<2> This needs to be a reference to a discovery ConfigMap as written by our HDFS operator.
<1> Ideally use the same version your HDFS is using.
Stackable HDFS images contain the FUSE driver since 23.11.
<2> This needs to be a reference to a discovery ConfigMap as written by the HDFS operator.

[TIP]
.Privileged Pods
Expand All @@ -57,7 +56,7 @@ securityContext:
----

Unfortunately, there is no way around some extra privileges.
In Kubernetes the Pods usually share the Kernel with the host running the Kubelet, which means a Pod wanting to use FUSE will need access to the underlying Kernel modules.
In Kubernetes the Pods usually share the Kernel with the host running the Kubelet, which means a Pod wanting to use FUSE needs access to the underlying Kernel modules.
====

Inside this Pod you can get a shell (e.g. using `kubectl exec --stdin --tty hdfs-fuse -- /bin/bash`) to get access to a script called `fuse_dfs_wrapper` (it is in the `PATH` of our Hadoop images).
Expand All @@ -70,14 +69,14 @@ To mount HDFS call the script like this:
----
fuse_dfs_wrapper dfs://<your hdfs> <target> <1> <2>

# This will run in debug mode and stay in the foreground
# This runs in debug mode and stays in the foreground
fuse_dfs_wrapper -odebug dfs://<your hdfs> <target>

# Example:
mkdir simple-hdfs
fuse_dfs_wrapper dfs://simple-hdfs simple-hdfs
cd simple-hdfs
# Any operations in this directory will now happen in HDFS
# Any operations in this directory now happens in HDFS
----
<1> Again, use the name of the HDFS service as above
<2> `target` is the directory in which HDFS will be mounted, it must exist otherwise this command will fail
<2> `target` is the directory in which HDFS is mounted, it must exist otherwise this command fails
2 changes: 1 addition & 1 deletion docs/modules/hdfs/pages/usage-guide/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
:description: Learn to configure and use the Stackable Operator for Apache HDFS. Ensure basic setup knowledge from the Getting Started guide before proceeding.
:page-aliases: ROOT:usage.adoc

This Section will help you to use and configure the Stackable Operator for Apache HDFS in various ways.
This Section helps you to use and configure the Stackable operator for Apache HDFS in various ways.
You should already be familiar with how to set up a basic instance.
Follow the xref:getting_started/index.adoc[] guide to learn how to set up a basic instance with all the required dependencies (for example ZooKeeper).
2 changes: 1 addition & 1 deletion docs/modules/hdfs/pages/usage-guide/listenerclass.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ spec:
listenerClass: external-stable # <2>
----
<1> DataNode listeners should prioritize having a direct connection, to minimize network transfer overhead.
<2> NameNode listeners should prioritize having a stable address, since they will be baked into the client configuration.
<2> NameNode listeners should prioritize having a stable address, since they are baked into the client configuration.
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,4 @@ spec:
enableVectorAgent: true
----

Further information on how to configure logging, can be found in
xref:concepts:logging.adoc[].
Further information on how to configure logging, can be found in xref:concepts:logging.adoc[].
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ You can configure the graceful shutdown as described in xref:concepts:operations

As a default, JournalNodes have `15 minutes` to shut down gracefully.

The JournalNode process will receive a `SIGTERM` signal when Kubernetes wants to terminate the Pod.
It will log the received signal as shown in the log below and initiate a graceful shutdown.
After the graceful shutdown timeout runs out, and the process still didn't exit, Kubernetes will issue a `SIGKILL` signal.
The JournalNode process receives a `SIGTERM` signal when Kubernetes wants to terminate the Pod.
It logs the received signal as shown in the log below and initiate a graceful shutdown.
After the graceful shutdown timeout runs out, and the process is still running, Kubernetes issues a `SIGKILL` signal.

https://github.com/apache/hadoop/blob/a585a73c3e02ac62350c136643a5e7f6095a3dbb/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNode.java#L272[This] is the relevant code that gets executed in the JournalNodes as of HDFS version `3.3.4`.

Expand Down
24 changes: 13 additions & 11 deletions docs/modules/hdfs/pages/usage-guide/operations/pod-disruptions.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,24 @@

You can configure the permitted Pod disruptions for HDFS nodes as described in xref:concepts:operations/pod_disruptions.adoc[].

Unless you configure something else or disable our PodDisruptionBudgets (PDBs), we write the following PDBs:
Unless you configure something else or disable the default PodDisruptionBudgets (PDBs), the operator writes the following PDBs:

== JournalNodes
We only allow a single JournalNode to be offline at any given time, regardless of the number of replicas or `roleGroups`.
Only a single JournalNode is allowed to be offline at any given time, regardless of the number of replicas or `roleGroups`.

== NameNodes
We only allow a single NameNode to be offline at any given time, regardless of the number of replicas or `roleGroups`.
Only a single NameNode is allowed to be offline at any given time, regardless of the number of replicas or `roleGroups`.

== DataNodes
For DataNodes the question of how many instances can be unavailable at the same time is a bit harder:
HDFS stores your blocks on the DataNodes.
Every block can be replicated multiple times (to multiple DataNodes) to ensure maximum availability.
The default replication factor is `3` - which can be configured using `spec.clusterConfig.dfsReplication`. However, it is also possible to change the replication factor for a specific file or directory to something other than the cluster default.

When you have a replication of `3`, you can safely take down 2 DataNodes, as there will always be a third DataNode holding a copy of each block currently assigned to one of the unavailable DataNodes.
However, you need to be aware that you are now down to a single point of failure - the last of three replicas!
With a replication of `3`, at most 2 data nodes may be down, as the third one is holding a copy of each block currently assigned to the unavailable nodes.
However, the last data node running is now a single point of failure -- the last of three replicas!

Taking this into consideration, our operator uses the following algorithm to determine the maximum number of DataNodes allowed to be unavailable at the same time:
Taking this into consideration, the operator uses the following algorithm to determine the maximum number of DataNodes allowed to be unavailable at the same time:

`num_datanodes` is the number of DataNodes in the HDFS cluster, summed over all `roleGroups`.

Expand Down Expand Up @@ -93,13 +93,15 @@ This results e.g. in the following numbers:
|===

== Reduce rolling redeployment durations
The default PDBs we write out are pessimistic and will cause the rolling redeployment to take a considerable amount of time.
As an example, when you have 100 DataNodes and a replication factor of `3`, we can safely only take a single DataNode down at a time. Assuming a DataNode takes 1 minute to properly restart, the whole re-deployment would take 100 minutes.
The default PDBs written out are pessimistic and cause the rolling redeployment to take a considerable amount of time.
As an example, when you have 100 DataNodes and a replication factor of `3`, only a single DataNode can be taken offline at a time.
Assuming a DataNode takes 1 minute to properly restart, the whole re-deployment would take 100 minutes.

You can use the following measures to speed this up:

1. Increase the replication factor, e.g. from `3` to `5`. In this case the number of allowed disruptions triples from `1` to `3` (assuming >= 5 DataNodes), reducing the time it takes by 66%.
2. Increase `maxUnavailable` using the `spec.dataNodes.roleConfig.podDisruptionBudget.maxUnavailable` field as described in xref:concepts:operations/pod_disruptions.adoc[].
3. Write your own PDBs as described in xref:concepts:operations/pod_disruptions.adoc#_using_you_own_custom_pdbs[Using you own custom PDBs].
* Increase the replication factor, e.g. from `3` to `5`.
In this case the number of allowed disruptions triples from `1` to `3` (assuming >= 5 DataNodes), reducing the time it takes by 66%.
* Increase `maxUnavailable` using the `spec.dataNodes.roleConfig.podDisruptionBudget.maxUnavailable` field as described in xref:concepts:operations/pod_disruptions.adoc[].
* Write your own PDBs as described in xref:concepts:operations/pod_disruptions.adoc#_using_you_own_custom_pdbs[Using you own custom PDBs].

WARNING: In cases you modify or disable the default PDBs, it's your responsibility to either make sure there are enough DataNodes available or accept the risk of blocks not being available!
Loading
Loading