Skip to content

Commit 6e3eef5

Browse files
razvanNickLarsenNZsbernauersiegfriedweber
authored
feat(regionserver): add graceful shutdown configuration (#570)
* feat(regionserver): add graceful shutdown configuration Extract region configuration into it's own structure. Refactor the lib and controller modules to work with the new structure in a slightly generic way. * Make UnifiedRoleConfiguration a sub-trait of Send * Replace trait with enum. * implement region mover command * fix: crd field names * unit tests and shell escaping * update docs * spelling * cargo update * added shutdown test & hbase-entrypoint.sh * cleanup and set region mover opts env var * first successful integration test * fix image pull policy for the kerberos tests * add RUN_REGION_MOVER env var * Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc Co-authored-by: Nick <10092581+NickLarsenNZ@users.noreply.github.com> * remove trailing whitespace in docs * rust : remove unused dep * fix shellcheck lint * update shutdown test and run it successfuly * update docs * Update rust/crd/src/lib.rs Co-authored-by: Nick <10092581+NickLarsenNZ@users.noreply.github.com> * fix const arithmetic * switch to LazyLock * configure gracefulShutdownTimeout in (almost) all tests * region mover args * Update CHANGELOG.md Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de> * Update rust/crd/src/lib.rs Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * Update rust/crd/src/lib.rs Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * Update rust/crd/src/lib.rs Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * Update rust/crd/src/lib.rs Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * note on constant paths and the entrypoint script * remove unnecessary configOverrides * wip: use Fragment for the RegionMover The crd generation panics * fix crd generation * test: fail if the regionmover fails (only with 2.6) * refactor to reduce (some) duplication * tests: use dev images * feat: remove hard-coded cluster.local from the domain name * fix: RegionMover fields should not be Optional * add STACKABLE_LOG_DIR env var * ref introduce const CONTAINERDEBUG_LOG_DIRECTORY * make shutdown test more resilient * tmp test def * update rustfmt * Update tests/templates/kuttl/shutdown/30-install-hbase.yaml.j2 Co-authored-by: Siegfried Weber <mail@siegfriedweber.net> * revert test definition * update changelog --------- Co-authored-by: Nick <10092581+NickLarsenNZ@users.noreply.github.com> Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de> Co-authored-by: Siegfried Weber <mail@siegfriedweber.net>
1 parent 2fcecd1 commit 6e3eef5

File tree

69 files changed

+1503
-360
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+1503
-360
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@
1616

1717
### Changed
1818

19+
- Support moving regions to other Pods during graceful shutdown of region servers ([#570]).
1920
- Default to OCI for image metadata and product image selection ([#611]).
2021

22+
[#570]: https://github.com/stackabletech/hbase-operator/pull/570
2123
[#598]: https://github.com/stackabletech/hbase-operator/pull/598
2224
[#605]: https://github.com/stackabletech/hbase-operator/pull/605
2325
[#611]: https://github.com/stackabletech/hbase-operator/pull/611

Cargo.lock

Lines changed: 25 additions & 19 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.nix

Lines changed: 19 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ rstest = "0.24"
2121
serde = { version = "1.0", features = ["derive"] }
2222
serde_json = "1.0"
2323
serde_yaml = "0.9"
24+
shell-escape = "0.1"
2425
snafu = "0.8"
2526
stackable-operator = { git = "https://github.com/stackabletech/operator-rs.git", tag = "stackable-operator-0.85.0" }
2627
product-config = { git = "https://github.com/stackabletech/product-config.git", tag = "0.7.0" }

deploy/helm/hbase-operator/crds/crds.yaml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -688,6 +688,9 @@ spec:
688688
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
689689
nullable: true
690690
type: string
691+
hbaseOpts:
692+
nullable: true
693+
type: string
691694
hbaseRootdir:
692695
nullable: true
693696
type: string
@@ -775,6 +778,34 @@ spec:
775778
nullable: true
776779
type: boolean
777780
type: object
781+
regionMover:
782+
default:
783+
ack: null
784+
maxThreads: null
785+
runBeforeShutdown: null
786+
description: Before terminating a region server pod, the RegionMover tool can be invoked to transfer local regions to other servers. This may cause a lot of network traffic in the Kubernetes cluster if the entire HBase stacklet is being restarted. The operator will compute a timeout period for the region move that will not exceed the graceful shutdown timeout.
787+
properties:
788+
ack:
789+
description: If enabled (default), the region mover will confirm that regions are available on the source as well as the target pods before and after the move.
790+
nullable: true
791+
type: boolean
792+
additionalMoverOptions:
793+
default: []
794+
description: Additional options to pass to the region mover.
795+
items:
796+
type: string
797+
type: array
798+
maxThreads:
799+
description: Maximum number of threads to use for moving regions.
800+
format: uint16
801+
minimum: 0.0
802+
nullable: true
803+
type: integer
804+
runBeforeShutdown:
805+
description: Move local regions to other servers before terminating a region server's pod.
806+
nullable: true
807+
type: boolean
808+
type: object
778809
requestedSecretLifetime:
779810
description: Request secret (currently only autoTls certificates) lifetime from the secret operator, e.g. `7d`, or `30d`. Please note that this can be shortened by the `maxCertificateLifetime` setting on the SecretClass issuing the TLS certificate.
780811
nullable: true
@@ -938,6 +969,9 @@ spec:
938969
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
939970
nullable: true
940971
type: string
972+
hbaseOpts:
973+
nullable: true
974+
type: string
941975
hbaseRootdir:
942976
nullable: true
943977
type: string
@@ -1025,6 +1059,34 @@ spec:
10251059
nullable: true
10261060
type: boolean
10271061
type: object
1062+
regionMover:
1063+
default:
1064+
ack: null
1065+
maxThreads: null
1066+
runBeforeShutdown: null
1067+
description: Before terminating a region server pod, the RegionMover tool can be invoked to transfer local regions to other servers. This may cause a lot of network traffic in the Kubernetes cluster if the entire HBase stacklet is being restarted. The operator will compute a timeout period for the region move that will not exceed the graceful shutdown timeout.
1068+
properties:
1069+
ack:
1070+
description: If enabled (default), the region mover will confirm that regions are available on the source as well as the target pods before and after the move.
1071+
nullable: true
1072+
type: boolean
1073+
additionalMoverOptions:
1074+
default: []
1075+
description: Additional options to pass to the region mover.
1076+
items:
1077+
type: string
1078+
type: array
1079+
maxThreads:
1080+
description: Maximum number of threads to use for moving regions.
1081+
format: uint16
1082+
minimum: 0.0
1083+
nullable: true
1084+
type: integer
1085+
runBeforeShutdown:
1086+
description: Move local regions to other servers before terminating a region server's pod.
1087+
nullable: true
1088+
type: boolean
1089+
type: object
10281090
requestedSecretLifetime:
10291091
description: Request secret (currently only autoTls certificates) lifetime from the secret operator, e.g. `7d`, or `30d`. Please note that this can be shortened by the `maxCertificateLifetime` setting on the SecretClass issuing the TLS certificate.
10301092
nullable: true

docs/modules/hbase/pages/usage-guide/operations/graceful-shutdown.adoc

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
= Graceful shutdown
22

3-
You can configure the graceful shutdown as described in xref:concepts:operations/graceful_shutdown.adoc[].
3+
You can configure the graceful shutdown grace period as described in xref:concepts:operations/graceful_shutdown.adoc[].
44

55
== Masters
66

@@ -15,7 +15,7 @@ However, there is no message in the log acknowledging the graceful shutdown.
1515

1616
== RegionServers
1717

18-
As a default, RegionServers have `60 minutes` to shut down gracefully.
18+
By default, RegionServers have `60 minutes` to shut down gracefully.
1919

2020
They use the same mechanism described above.
2121
In contrast to the Master servers, they will, however, acknowledge the graceful shutdown with a message in the logs:
@@ -26,6 +26,61 @@ In contrast to the Master servers, they will, however, acknowledge the graceful
2626
2023-10-11 12:38:05,060 INFO [shutdown-hook-0] regionserver.HRegionServer: ***** STOPPING region server 'test-hbase-regionserver-default-0.test-hbase-regionserver-default.kuttl-test-topical-parakeet.svc.cluster.local,16020,1697027870348' *****
2727
----
2828

29+
The operator allows for finer control over the shutdown process of region servers.
30+
For each region server pod, the region mover tool may be invoked before terminating the region server's pod.
31+
The affected regions are transferred to other pods thus ensuring that the data is still available.
32+
33+
Here is an example:
34+
35+
[source,yaml]
36+
----
37+
spec:
38+
regionServers:
39+
config:
40+
regionMover:
41+
runBeforeShutdown: true # <1>
42+
maxThreads: 5 # <2>
43+
ack: false # <3>
44+
additionalMoverOptions: ["--designatedFile", "/path/to/designatedFile"] # <4>
45+
----
46+
<1>: Run the region mover tool before shutting down the region server. Default is `false`.
47+
<2>: Maximum number of threads to use for moving regions. Default is 1.
48+
<3>: Enable or disable region confirmation on the present and target servers. Default is `true`.
49+
<4>: Extra options to pass to the region mover tool.
50+
51+
For a list of additional options accepted by the region mover use the `--help` option first:
52+
53+
[source]
54+
----
55+
$ /stackable/hbase/bin/hbase org.apache.hadoop.hbase.util.RegionMover --help
56+
usage: hbase org.apache.hadoop.hbase.util.RegionMover <options>
57+
Options:
58+
-r,--regionserverhost <arg> region server <hostname>|<hostname:port>
59+
-o,--operation <arg> Expected: load/unload/unload_from_rack/isolate_regions
60+
-m,--maxthreads <arg> Define the maximum number of threads to use to unload and reload the regions
61+
-i,--isolateRegionIds <arg> Comma separated list of Region IDs hash to isolate on a RegionServer and put region
62+
server in draining mode. This option should only be used with '-o isolate_regions'. By
63+
putting region server in decommission/draining mode, master can't assign any new region
64+
on this server. If one or more regions are not found OR failed to isolate successfully,
65+
utility will exist without putting RS in draining/decommission mode. Ex.
66+
--isolateRegionIds id1,id2,id3 OR -i id1,id2,id3
67+
-x,--excludefile <arg> File with <hostname:port> per line to exclude as unload targets; default excludes only
68+
target host; useful for rack decommisioning.
69+
-d,--designatedfile <arg> File with <hostname:port> per line as unload targets;default is all online hosts
70+
-f,--filename <arg> File to save regions list into unloading, or read from loading; default
71+
/tmp/<usernamehostname:port>
72+
-n,--noack Turn on No-Ack mode(default: false) which won't check if region is online on target
73+
RegionServer, hence best effort. This is more performant in unloading and loading but
74+
might lead to region being unavailable for some time till master reassigns it in case the
75+
move failed
76+
-t,--timeout <arg> timeout in seconds after which the tool will exit irrespective of whether it finished or
77+
not;default Integer.MAX_VALUE
78+
----
79+
80+
NOTE: There is no need to explicitly specify a timeout for the region movement. The operator will compute an appropriate timeout that cannot exceed the `gracefulShutdownTimeout` for region servers.
81+
82+
IMPORTANT: The ZooKeeper connection must be available during the time the region mover is running for the graceful shutdown process to succeed.
83+
2984
== RestServers
3085

3186
As a default, RestServers have `5 minutes` to shut down gracefully.

rust/crd/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ publish = false
1212
product-config.workspace = true
1313
serde.workspace = true
1414
serde_json.workspace = true
15+
shell-escape.workspace = true
1516
snafu.workspace = true
1617
stackable-operator.workspace = true
1718
strum.workspace = true

0 commit comments

Comments
 (0)