Skip to content

Commit 5b3ec8f

Browse files
authored
Merge pull request #13943 from mburke5678/nodes-node-problem-detector
Document Node Problem Detector for 4.0
2 parents 74c7ea0 + 70b1162 commit 5b3ec8f

File tree

3 files changed

+259
-18
lines changed

3 files changed

+259
-18
lines changed

modules/nodes-nodes-problem-detector-customizing.adoc

Lines changed: 74 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,57 @@
55
[id='nodes-nodes-problem-detector-customizing_{context}']
66
= Customizing Node Problem Detector conditions
77

8-
You can configure the Node Problem Detector to watch for any log string by editing the Node Problem Detector configuration map.
8+
You can configure the Node Problem Detector to watch for any log string by editing the Node Problem Detector custom resource (CR).
9+
10+
.Prerequisites
11+
12+
* The Node Problem Detector Operator must be installed.
13+
14+
* If needed, get the name of the Node Problem Detector CR:
15+
+
16+
----
17+
$ oc get NodeProblemDetector
18+
NAME AGE
19+
node-problem-detector 6m6s
20+
----
21+
22+
* Set the Node Problem Detector to the unmanaged state. In managed state, the Node Problem Detector Operator reverts changes made to the problem node detector configuration map.
923

1024
.Procedure
1125

12-
To configure the Node Problem Detector, add or remove problem conditions and events.
26+
To modify the Node Problem Detector:
1327

14-
. Edit the Node Problem Detector configuration map with a text editor.
28+
. Open the Node Problem Detector CR for editing.
1529
+
1630
----
17-
$ oc edit configmap /openshift-node-problem-detector
31+
$ oc edit problem-node-detector <node>
1832
----
1933
+
20-
.Sample Node Problem Detector Configuration Map
21-
[source,yaml]
34+
For example:
35+
+
2236
----
23-
apiVersion: v1
24-
kind: ConfigMap
37+
oc edit problem-node-detector problem-node-detector
38+
39+
apiVersion: node-problem-detector.operator.k8s.io/v1alpha1
40+
kind: NodeProblemDetector
2541
metadata:
42+
creationTimestamp: 2019-03-04T00:18:48Z
43+
generation: 1
2644
name: node-problem-detector
27-
data:
28-
kernel-monitor.json: | <1>
45+
namespace: default
46+
resourceVersion: "47179"
47+
selfLink: /apis/node-problem-detector.operator.k8s.io/v1alpha1/namespaces/default/nodeproblemdetectors/node-problem-detector
48+
uid: 14acef47-3e13-11e9-a640-0a4ad769663a
49+
namespace: openshift-node-problem-detector
50+
----
51+
52+
. Change the parameters and values as needed:
53+
+
54+
.Sample Node Problem Detector Configuration Map
55+
[source,yaml]
56+
----
57+
spec:
58+
kernel-monitor.json: | <8>
2959
{
3060
"plugin": "journald", <2>
3161
"pluginConfig": {
@@ -42,7 +72,7 @@ data:
4272
"message": "kernel has no deadlock" <7>
4373
}
4474
],
45-
"rules": [ <8>
75+
"rules": [
4676
{
4777
"type": "temporary",
4878
"reason": "OOMKilling",
@@ -76,6 +106,37 @@ data:
76106
},
77107
]
78108
}
109+
110+
kubelet-monitor.json: |-
111+
{
112+
"plugin": "custom",
113+
"pluginConfig": {
114+
"invoke_interval": "120s",
115+
"timeout": "60s",
116+
"concurrency": 1
117+
},
118+
"source": "kubelet-custom-plugin-monitor",
119+
"conditions": [{
120+
"type": "KubeletProblem",
121+
"reason": "KubeletIsUp",
122+
"message": "kubelet is up"
123+
}],
124+
"rules": [{
125+
"type": "temporary",
126+
"reason": "KubeletIsDown",
127+
"path": "/etc/npd-plugins/kubelet-health.sh",
128+
"timeout": "30s"
129+
},
130+
{
131+
"type": "permanent",
132+
"condition": "KubeletProblem",
133+
"reason": "KubeletIsDown",
134+
"path": "/etc/npd-plugins/kubelet-health.sh",
135+
"timeout": "45s"
136+
}
137+
]
138+
}
139+
79140
----
80141

81142
<1> Rules and conditions that apply to container images.
@@ -92,7 +153,7 @@ https://kubernetes.io/docs/tasks/debug-application-cluster/monitor-node-health/#
92153
The Node Problem Detector supports file-based kernel logging. However, it is easy to extend it to support other log formats.
93154
////
94155

95-
. Remove, add, or edit any node conditions or events as needed.
156+
. Optionally, you can add new node conditions or events:
96157
+
97158
[source,yaml]
98159
----
@@ -134,3 +195,4 @@ spec:
134195
<1> Sends the output to standard output (stdout).
135196
<2> Path to the error log.
136197
<3> Comma-separated path to the plug-in configuration files.
198+

modules/nodes-nodes-problem-detector-installing.adoc

Lines changed: 180 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@
55
[id='nodes-nodes-problem-detector-installing_{context}']
66
= Installing the {product-title} Node Problem Detector
77

8-
You can use the {product-title} console to install the Node Problem Detector (NPD), which creates the Node Problem Detector Operator.
8+
You can use the {product-title} console to install the Node Problem Detector Operator.
99

1010
.Prerequisites
1111

1212
. Create a Project for the NPD:
1313
+
1414
----
15-
$ oc create ns openshift-node-problem-detector
15+
$ oc create ns openshift-node-problem-detector --node-selector: ""
1616
----
1717

1818
. Create an Operator Group
@@ -32,18 +32,192 @@ EOF
3232

3333
.Procedure
3434

35-
To install the Node Problem Detector:
35+
The process to install the Node Problem Detector involves installing the Node Problem Detector Operator and creating a Node Problem Detector instance.
3636

3737
. In the {product-title} console, click *Catalog* -> *Operator Hub*.
3838

39-
. Choose *node-problem-detector* from the list of available Operators, and click Install.
40-
4139
. On the *Create Operator Subscription* page:
4240

4341
.. Select the `openshift-node-problem-detector` project from the *A specific namespace on the cluster* drop-down list.
4442

4543
.. Click *Subscribe*.
4644

47-
. On the *Catalog* → *Installed Operators* page, verify that the NodeProblemDetector (CSV) eventually shows up and its *Status* ultimately resolves to *InstallSucceeded*.
45+
.. Click *Subscribe*.
4846

47+
. On the *Catalog* → *Installed Operators* page, verify that the NodeProblemDetector (CSV) eventually shows up and its *Status* ultimately resolves to *InstallSucceeded*.
48+
+
4949
If it does not, switch to the *Catalog* → *Operator Management* page and inspect the *Operator Subscriptions* and *Install Plans* tabs for any failure or errors under *Status*. Then, check the logs in any Pods in the openshift-operators project (on the *Workloads* → *Pods* page) that are reporting issues to troubleshoot further.
50+
51+
. Click *Administration* -> *CRD*.
52+
53+
. On the *Custom Resource Definitions* page, click *NodeProblemDetector*.
54+
55+
. On the *Node Problem Detector* page, click *Create Node Problem Detector*.
56+
57+
. Specify a name and enter the *openshift-node-problem-detector* namespace.
58+
+
59+
[source,yaml]
60+
----
61+
apiVersion: node-problem-detector.operator.k8s.io/v1alpha1
62+
kind: NodeProblemDetector
63+
metadata:
64+
name: example
65+
namespace: default
66+
spec: {}
67+
----
68+
<1> Specify a name for the Node Problem Detector.
69+
<2> Specify `openshift-operators` as the namespace.
70+
+
71+
For example:
72+
+
73+
[source,yaml]
74+
----
75+
apiVersion: node-problem-detector.operator.k8s.io/v1alpha1
76+
kind: NodeProblemDetector
77+
metadata:
78+
name: node-problem-detector
79+
namespace: openshift-node-problem-detector
80+
spec: {}
81+
----
82+
83+
//Beta steps https://bugzilla.redhat.com/show_bug.cgi?id=1679467
84+
85+
. Create a Node Problem Detector Custom Resource Definition (CRD):
86+
+
87+
[source,yaml]
88+
----
89+
apiVersion: apiextensions.k8s.io/v1beta1
90+
kind: CustomResourceDefinition
91+
metadata:
92+
name: nodeproblemdetectors.node-problem-detector.operator.k8s.io
93+
spec:
94+
group: node-problem-detector.operator.k8s.io
95+
names:
96+
kind: NodeProblemDetector
97+
listKind: NodeProblemDetectorList
98+
plural: nodeproblemdetectors
99+
singular: nodeproblemdetector
100+
scope: Namespaced
101+
version: v1alpha1
102+
----
103+
104+
. Create a Node Problem Detector Service Account (SA):
105+
+
106+
[source,yaml]
107+
----
108+
apiVersion: v1
109+
kind: ServiceAccount
110+
metadata:
111+
name: node-problem-detector-operator
112+
namespace: openshift-node-problem-detector
113+
----
114+
115+
. Create a Node Problem Detector RBAC (RBAC):
116+
+
117+
[source,yaml]
118+
----
119+
kind: Role
120+
apiVersion: rbac.authorization.k8s.io/v1
121+
metadata:
122+
name: node-problem-detector-operator
123+
namespace: openshift-node-problem-detector
124+
rules:
125+
- apiGroups:
126+
- node-problem-detector.operator.k8s.io
127+
resources:
128+
- "*"
129+
verbs:
130+
- "*"
131+
- apiGroups:
132+
- ""
133+
resources:
134+
- pods
135+
- events
136+
- configmaps
137+
- secrets
138+
- services
139+
- endpoints
140+
- serviceaccounts
141+
verbs:
142+
- "*"
143+
- apiGroups:
144+
- apps
145+
resources:
146+
- daemonsets
147+
verbs:
148+
- "*"
149+
150+
---
151+
152+
kind: RoleBinding
153+
apiVersion: rbac.authorization.k8s.io/v1
154+
metadata:
155+
name: node-problem-detector-operator
156+
namespace: openshift-node-problem-detector
157+
subjects:
158+
- kind: ServiceAccount
159+
name: node-problem-detector-operator
160+
roleRef:
161+
kind: Role
162+
name: node-problem-detector-operator
163+
apiGroup: rbac.authorization.k8s.io
164+
165+
---
166+
167+
kind: ClusterRole
168+
apiVersion: rbac.authorization.k8s.io/v1
169+
metadata:
170+
name: openshift-node-problem-detector-operator
171+
rules:
172+
- apiGroups:
173+
- rbac.authorization.k8s.io
174+
resources:
175+
# the operator needs to be able to bind the cluster role
176+
# system:node-problem-detector to the node-problem-detector service account
177+
- clusterrolebindings
178+
verbs:
179+
- "*"
180+
- apiGroups:
181+
- security.openshift.io
182+
resources:
183+
# the operator needs to be able to add the node-problem-detector service account
184+
# to the list of accounts that can use the privileged SCC
185+
- securitycontextconstraints
186+
verbs:
187+
- "*"
188+
189+
---
190+
191+
kind: ClusterRoleBinding
192+
apiVersion: rbac.authorization.k8s.io/v1
193+
metadata:
194+
name: openshift-node-problem-detector-operator-1
195+
subjects:
196+
- kind: ServiceAccount
197+
name: node-problem-detector-operator
198+
namespace: openshift-node-problem-detector
199+
roleRef:
200+
kind: ClusterRole
201+
name: openshift-node-problem-detector-operator
202+
apiGroup: rbac.authorization.k8s.io
203+
204+
---
205+
206+
oc create -f deploy/rbac.yaml
207+
oc create -f deploy/operator.yaml
208+
oc create -f deploy/cr.yaml
209+
210+
211+
. Create a Node Problem Detector custom resource (CR):
212+
+
213+
[source,yaml]
214+
----
215+
apiVersion: node-problem-detector.operator.k8s.io/v1alpha1
216+
kind: NodeProblemDetector
217+
metadata:
218+
name: node-problem-detector
219+
namespace: openshift-node-problem-detector
220+
----
221+
222+
. Configure the Node Problem Detector policy as needed and click *Create*.
223+

nodes/nodes/nodes-nodes-problem-detector.adoc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,11 @@ https://access.redhat.com/support/offerings/techpreview/.
2727
endif::[]
2828
====
2929

30+
[NOTE]
31+
====
32+
Procedures in this topic require your cluster to be in an unmanaged state.
33+
====
34+
3035
// The following include statements pull in the module files that comprise
3136
// the assembly. Include any combination of concept, procedure, or reference
3237
// modules required to cover the user story. You can also include other

0 commit comments

Comments
 (0)