Skip to content

Commit de5a699

Browse files
authored
Merge pull request #13605 from kalexand-rh/machine-health-check
machine health check draft
2 parents 7dc5726 + 765d150 commit de5a699

File tree

5 files changed

+135
-0
lines changed

5 files changed

+135
-0
lines changed

_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,8 @@ Name: Control Plane management
215215
Dir: control-plane-management
216216
Distros: openshift-origin, openshift-enterprise
217217
Topics:
218+
- Name: Deploying machine health checks
219+
File: deploying-machine-health-checks
218220
- Name: Applying autoscaling to a cluster
219221
File: applying-autoscaling
220222
---
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[id='deploying-machine-health-checks']
2+
= Deploying machine heath checks
3+
include::modules/common-attributes.adoc[]
4+
:context: deploying-machine-health-checks
5+
toc::[]
6+
7+
You can configure and deploy a machine health check to automatically repair
8+
damaged machines in a machine pool.
9+
10+
include::modules/machine-health-checks-about.adoc[leveloffset=+1]
11+
12+
include::modules/machine-health-checks-resource.adoc[leveloffset=+1]
13+
14+
include::modules/machine-health-checks-creating.adoc[leveloffset=+1]
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * master/deploying-machine-health-checks.adoc
4+
5+
[id='machine-health-checks-about-{context}']
6+
= About MachineHealthChecks
7+
8+
MachineHealthChecks automatically repair unhealthy Machines in a particular
9+
MachinePool.
10+
11+
To monitor machine health, you create a resource to define the
12+
configuration for a controller. You set a condition to check for, such as
13+
staying in the `NotReady` status for 15 minutes or displaying a permanent condition
14+
in the node-problem-detector, and a label for the set of machines to monitor.
15+
16+
[NOTE]
17+
====
18+
You cannot apply a MachineHealthCheck to a machine with the master role.
19+
====
20+
21+
The controller that observes a MachineHealthCheck resource checks for the status
22+
that you defined. If a machine fails the health check, it is automatically deleted
23+
and a new one is created to take its place. When a machine is deleted, you
24+
see a `machine deleted` event. To limit disruptive impact of the machine
25+
deletion, the controller drains and deletes only one node at a time.
26+
27+
28+
To stop the check, you remove the resource.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * master/deploying-machine-health-checks.adoc
4+
5+
[id='machine-health-checks-creating-{context}']
6+
= Creating a MachineHealthCheck resource
7+
8+
You can create a MachineHealthCheck resource for all `MachinePools` in your
9+
cluster except the `master` pool.
10+
11+
.Prerequisites
12+
13+
* Install the `oc` command line and kubectl.
14+
15+
.Procedure
16+
17+
. Create a *_healthcheck.yml_* file that contains the definition of your
18+
MachineHealthCheck.
19+
20+
. Apply the *_healthcheck.yml_* file to your cluster:
21+
[source,bash]
22+
----
23+
$ kubectl apply -f healcheck.yml
24+
----
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * master/deploying-machine-health-checks.adoc
4+
5+
[id='machine-health-checks-resource-{context}']
6+
= Sample MachineHealthCheck resource
7+
8+
The MachineHealthCheck resource resembles the following YAML file:
9+
10+
.MachineHealthCheck
11+
[source,yaml]
12+
----
13+
apiVersion: healthchecking.openshift.io/v1alpha1
14+
kind: MachineHealthCheck
15+
metadata:
16+
name: example <1>
17+
namespace: example <2>
18+
Spec:
19+
Selector:
20+
matchLabels:
21+
sigs.k8s.io/cluster-api-cluster: <cluster_name> <3>
22+
sigs.k8s.io/cluster-api-machine-role: <label> <4>
23+
sigs.k8s.io/cluster-api-machine-type: <label> <4>
24+
sigs.k8s.io/cluster-api-machineset: <cluster_name>-<label>-<AWS-zone> <5>
25+
----
26+
<1> Specify the name of the MachineHealthCheck to deploy. Include the name of the
27+
MachinePool to track.
28+
<2> Specify the namespace to deploy the MachineHealthCheck to.
29+
<3> Specify the name of your cluster.
30+
<4> Specify a label for the MachinePool that you want to check.
31+
<5> Specify the MachineSet to track in `<cluster_name>-<label>-<AWS-zone>`
32+
format. For example, `prod-node-us-east-1a`.
33+
34+
35+
36+
////
37+
38+
.MachinePoolHealthCheck
39+
[source,yaml]
40+
----
41+
apiVersion: healthchecking.machineapi.openshift.io/v1alpha1
42+
kind: MachinePoolHealthCheck
43+
metadata:
44+
name: worker-pool-healthcheck
45+
namespace: openshift-cluster-api
46+
annotations:
47+
Spec:
48+
MachineSelector: metav1.LabelSelector
49+
----
50+
51+
.MachineRemediation
52+
[source,yaml]
53+
----
54+
apiVersion: healthchecking.machineapi.openshift.io/v1alpha1
55+
kind: MachineRemediation
56+
metadata:
57+
name: worker-pool-healthcheck-machineName
58+
namespace: openshift-cluster-api
59+
annotations:
60+
Spec:
61+
machineName: “machineName”
62+
remediationStrategy: “default”
63+
Status:
64+
Phase: “healthy”
65+
Reason: “no unhealthy conditions detected”
66+
StartTime: “metav1.now()”
67+
////

0 commit comments

Comments
 (0)