Skip to content

Commit 5954a16

Browse files
authored
Merge pull request #12964 from mburke5678/nodes-to-4.0
new assemblies and modules for nodes
2 parents 4e7db29 + 8779d3e commit 5954a16

38 files changed

+2205
-0
lines changed

_topic_map.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,22 @@ Name: Nodes
212212
Dir: nodes
213213
Distros: openshift-*
214214
Topics:
215+
- Name: Viewing and listing the nodes in your cluster
216+
File: nodes-nodes-viewing
217+
- Name: Working with nodes
218+
File: nodes-nodes-working
219+
- Name: Understanding node rebooting
220+
File: nodes-nodes-rebooting
221+
- Name: Freeing node resources using garbage collection
222+
File: nodes-nodes-garbage-collection
223+
- Name: Allocating resources for nodes
224+
File: nodes-nodes-resources-configuring
225+
- Name: Advertising hidden resources for nodes
226+
File: nodes-nodes-opaque-resources
227+
- Name: Monitoring for problems in your nodes
228+
File: nodes-nodes-problem-detector
229+
- Name: Running tasks in pods using jobs
230+
File: nodes-nodes-jobs
215231
- Name: Using containers
216232
File: nodes-containers-using
217233
- Name: Using Init Containers to perform tasks before a pod is deployed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-nodes-garbage-collection.adoc
4+
5+
[id='nodes-nodes-garbage-collection-configuring_{context}']
6+
= Configuring garbage collection for containers and images
7+
8+
As an administrator, you can configure how {product-title} performs garbage collection for each type of node group.
9+
10+
.Procedure
11+
12+
. Open the appropriate node configuration map.
13+
+
14+
* node-config-master
15+
16+
* node-config-infra
17+
18+
* node-config-compute
19+
20+
* node-config-all-in-one
21+
22+
* node-config-master-infra
23+
24+
25+
. For containers, you can specify values for these settings in the `*kubeletArguments*` section of
26+
the node configuration map. Add the section if it does not already exist:
27+
+
28+
[source,yaml]
29+
----
30+
kubeletArguments:
31+
minimum-container-ttl-duration: <1>
32+
- "10s"
33+
maximum-dead-containers-per-container: <2>
34+
- "2"
35+
maximum-dead-containers: <3>
36+
- "240"
37+
----
38+
<1> Specify the minimum age that a container is eligible for garbage collection. The
39+
default is *0*.
40+
<2> Specify the number of instances to retain per pod container. The default is *1*.
41+
<3> Specify the maximum number of total dead containers in the node. The default is *-1*, which means unlimited.
42+
43+
. For images, you can specify values for these settings in the `*kubeletArguments*` section of
44+
the node configuration map.
45+
+
46+
[source,yaml]
47+
----
48+
kubeletArguments:
49+
image-gc-high-threshold: <1>
50+
- "85"
51+
image-gc-low-threshold: <2>
52+
- "80"
53+
----
54+
<1> Specify he percent of disk usage (expressed as an integer) which triggers image
55+
garbage collection. The default is *85*.
56+
<2> Specify percent of disk usage (expressed as an integer) to which image garbage
57+
collection attempts to free. Default is *80*.
58+
59+
. Save and close the configuration map. A sync pod on each node of that type picks up and implements the changes.
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-nodes-garbage-collection.adoc
4+
5+
[id='nodes-nodes-garbage-collection-containers_{context}']
6+
= Understanding how terminated containers are removed though garbage collection
7+
8+
Container garbage collection is enabled by default and happens automatically in
9+
response to eviction thresholds being reached. The node tries to keep any
10+
container for any pod accessible from the API. If the pod has been deleted, the
11+
containers will be as well. Containers are preserved as long the pod is not
12+
deleted and the eviction threshold is not reached. If the node is under disk
13+
pressure, it will remove containers and their logs will no longer be accessible
14+
via `oc logs`.
15+
16+
The policy for container garbage collection is based on three conditions:
17+
18+
* The minimum age that a container is eligible for garbage collection. The
19+
default is *0*.
20+
21+
* The number of instances to retain per pod container. The default is *1*.
22+
23+
* The maximum number of total dead containers in the node. The default is *-1*, which means unlimited.
24+
25+
[IMPORTANT]
26+
====
27+
Garbage collection only removes the containers that do not have any pods.
28+
====
29+
30+
For container garbage collection, you can modify any of the following variables in
31+
in the `*kubeletArguments*` section of the appropriate node configuration file.
32+
33+
.Variables for configuring container garbage collection
34+
35+
[options="header",cols="1,3"]
36+
|===
37+
38+
|Setting |Description
39+
40+
|`*minimum-container-ttl-duration*`
41+
|The minimum age that a container is eligible for garbage collection. The
42+
default is *0*. Use *0* for no limit. Values for this setting can be
43+
specified using unit suffixes such as *h* for hour, *m* for minutes, *s* for seconds.
44+
45+
|`*maximum-dead-containers-per-container*`
46+
|The number of instances to retain per pod container. The default is *1*.
47+
48+
|`*maximum-dead-containers*`
49+
|The maximum number of total dead containers in the node. The default is *-1*, which means unlimited.
50+
|===
51+
52+
The `*maximum-dead-containers*` setting takes precedence over the
53+
`*maximum-dead-containers-per-container*` setting when there is a conflict. For
54+
example, if retaining the number of `*maximum-dead-containers-per-container*`
55+
would result in a total number of containers that is greater than
56+
`*maximum-dead-containers*`, the oldest containers will be removed to satisfy
57+
the `*maximum-dead-containers*` limit.
58+
59+
When the node removes the dead containers, all files inside those containers are
60+
removed as well. Only containers created by the node are removed.
61+
62+
ifdef::openshift-origin[]
63+
[NOTE]
64+
====
65+
Currently, Docker and rkt are supported. The following only applies to Docker;
66+
rkt has its own garbage collection.
67+
====
68+
endif::[]
69+
70+
Each spin of the garbage collector loop goes through the following steps:
71+
72+
1. Retrieve a list of available containers.
73+
2. Filter out all containers that are running or are not alive longer than
74+
the `*minimum-container-ttl-duration*` parameter.
75+
3. Classify all remaining containers into equivalence classes based on pod and image name membership.
76+
4. Remove all unidentified containers (containers that are managed by kubelet but their name is malformed).
77+
5. For each class that contains more containers than the
78+
`*maximum-dead-containers-per-container*` parameter, sort containers in the class by
79+
creation time.
80+
6. Start removing containers from the oldest first until the
81+
`*maximum-dead-containers-per-container*` parameter is met.
82+
7. If there are still more containers in the list than the
83+
`*maximum-dead-containers*` parameter, the collector starts removing containers
84+
from each class so the number of containers in each one is not greater than the
85+
average number of containers per class, or
86+
`<all_remaining_containers>/<number_of_classes>`.
87+
8. If this is still not enough, sort all containers in the list and start
88+
removing containers from the oldest first until the `*maximum-dead-containers*`
89+
criterion is met.
90+
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-nodes-garbage-collection.adoc
4+
5+
[id='nodes-nodes-garbage-collection-images_{context}']
6+
= Understanding how images are removed though garbage collection
7+
8+
Image garbage collection relies on disk usage as reported by *cAdvisor* on the
9+
node to decide which images to remove from the node.
10+
11+
The policy for container garbage collection is based on two conditions:
12+
13+
* The percent of disk usage (expressed as an integer) which triggers image
14+
garbage collection. The default is *85*.
15+
16+
* The percent of disk usage (expressed as an integer) to which image garbage
17+
collection attempts to free. Default is *80*.
18+
19+
For image garbage collection, you can modify any of the following variables in
20+
in the `*kubeletArguments*` section of the appropriate node configuration file.
21+
22+
.Variables for configuring image garbage collection
23+
24+
[options="header",cols="1,3"]
25+
|===
26+
27+
|Setting |Description
28+
29+
|`*image-gc-high-threshold*`
30+
|The percent of disk usage (expressed as an integer) which triggers image
31+
garbage collection. The default is *85*.
32+
33+
|`*image-gc-low-threshold*`
34+
|The percent of disk usage (expressed as an integer) to which image garbage
35+
collection attempts to free. Default is *80*.
36+
|===
37+
38+
Two lists of images are retrieved in each garbage collector run:
39+
40+
1. A list of images currently running in at least one pod.
41+
2. A list of images available on a host.
42+
43+
As new containers are run, new images appear. All images are marked with a time
44+
stamp. If the image is running (the first list above) or is newly detected (the
45+
second list above), it is marked with the current time. The remaining images are
46+
already marked from the previous spins. All images are then sorted by the time
47+
stamp.
48+
49+
Once the collection starts, the oldest images get deleted first until the
50+
stopping criterion is met.
51+

modules/nodes-nodes-jobs-about.adoc

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-nodes-jobs.adoc
4+
5+
[id='nodes-nodes-jobs-about_{context}']
6+
= Understanding jobs and CronJobs in {product-title}
7+
8+
A job tracks the overall progress of a task and updates its status with information
9+
about active, succeeded, and failed pods. Deleting a job will clean up any pods it created.
10+
Jobs are part of the Kubernetes API, which can be managed
11+
with `oc` commands like other object types.
12+
13+
There are two possible resource types that allow creating run-once objects in {product-title}:
14+
15+
Job::
16+
A regular job is a run-once object that creates a task and ensures the job finishes.
17+
18+
CronJob::
19+
20+
A CronJob can be scheduled to run multiple times, use a CronJob.
21+
22+
A _CronJob_ builds on a regular job by allowing you to specify
23+
how the job should be run. CronJobs are part of the
24+
link:http://kubernetes.io/docs/user-guide/cron-jobs[Kubernetes] API, which
25+
can be managed with `oc` commands like other object types.
26+
27+
CronJobs are useful for creating periodic and recurring tasks, like running backups or sending emails.
28+
CronJobs can also schedule individual tasks for a specific time, such as if you want to schedule a job for a low activity period.
29+
30+
ifdef::openshift-online[]
31+
[IMPORTANT]
32+
====
33+
CronJobs are only available for _OpenShift Online Pro_. For more information about the
34+
differences between Starter and Pro tiers, visit the
35+
link:https://www.openshift.com/pricing/index.html[pricing page].
36+
====
37+
endif::[]
38+
39+
[WARNING]
40+
====
41+
A CronJob creates a job object approximately once per execution time of its
42+
schedule, but there are circumstances in which it fails to create a job or
43+
two jobs might be created. Therefore, jobs must be idempotent and you must
44+
configure history limits.
45+
====
46+
47+
[[jobs-create]]
48+
= Understanding how to create jobs
49+
50+
Both resource types require a job configuration that consists of the following key parts:
51+
52+
- A pod template, which describes the pod that {product-title} creates.
53+
- An optional `parallelism` parameter, which specifies how many pods running in parallel at any point in time should execute a job. If not specified, this defaults to
54+
the value in the `completions` parameter.
55+
- An optional `completions` parameter, specifying how many successful pod completions are needed to finish a job. If not specified, this value defaults to one.
56+
57+
[[jobs-set-max]]
58+
== Understanding how to set a maximum duration for jobs
59+
60+
When defining a job, you can define its maximum duration by setting
61+
the `activeDeadlineSeconds` field. It is specified in seconds and is not
62+
set by default. When not set, there is no maximum duration enforced.
63+
64+
The maximum duration is counted from the time when a first pod gets scheduled in
65+
the system, and defines how long a job can be active. It tracks overall time of
66+
an execution. After reaching the specified timeout, the job is terminated by {product-title}.
67+
68+
[[jobs-set-backoff]]
69+
== Understanding how to set a job back off policy for pod failure
70+
71+
A Job can be considered failed, after a set amount of retries due to a
72+
logical error in configuration or other similar reasons. Failed Pods associated with the Job are recreated by the controller with
73+
an exponential back off delay (`10s`, `20s`, `40s` …) capped at six minutes. The
74+
limit is reset if no new failed pods appear between controller checks.
75+
76+
Use the `spec.backoffLimit` parameter to set the number of retries for a job.
77+
78+
[[jobs-artifacts]]
79+
== Understanding how to configure a CronJob to remove artifacts
80+
81+
CronJobs can leave behind artifact resources such as jobs or pods. As a user it is important
82+
to configure history limits so that old jobs and their pods are properly cleaned. There are two fields within CronJob's spec responsible for that:
83+
84+
* `.spec.successfulJobsHistoryLimit`. The number of successful finished jobs to retain (defaults to 3).
85+
86+
* `.spec.failedJobsHistoryLimit`. The number of failed finished jobs to retain (defaults to 1).
87+
88+
[TIP]
89+
====
90+
* Delete CronJobs that you no longer need:
91+
+
92+
[source,bash]
93+
----
94+
$ oc delete cronjob/<cron_job_name>
95+
----
96+
+
97+
Doing this prevents them from generating unnecessary artifacts.
98+
99+
* You can suspend further executions by setting the `spec.suspend` to true. All subsequent executions are suspended until you reset to `false`.
100+
====
101+
102+
[[jobs-limits]]
103+
= Known limitations
104+
105+
The job specification restart policy only applies to the _pods_, and not the _job controller_. However, the job controller is hard-coded to keep retrying jobs to completion.
106+
107+
As such, `restartPolicy: Never` or `--restart=Never` results in the same behavior as `restartPolicy: OnFailure` or `--restart=OnFailure`. That is, when a job fails it is restarted automatically until it succeeds (or is manually discarded). The policy only sets which subsystem performs the restart.
108+
109+
With the `Never` policy, the _job controller_ performs the restart. With each attempt, the job controller increments the number of failures in the job status and create new pods. This means that with each failed attempt, the number of pods increases.
110+
111+
With the `OnFailure` policy, _kubelet_ performs the restart. Each attempt does not increment the number of failures in the job status. In addition, kubelet will retry failed jobs starting pods on the same nodes.

0 commit comments

Comments
 (0)