Skip to content

Commit 566bbe4

Browse files
authored
Merge pull request #95248 from StephenJamesSmith/TELCODOCS-2306
TELCODOCS-2306: KMM 2.4 Release Notes
2 parents bc1af43 + 71f00fb commit 566bbe4

File tree

1 file changed

+281
-2
lines changed

1 file changed

+281
-2
lines changed

hardware_enablement/kmm-release-notes.adoc

Lines changed: 281 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ toc::[]
1414

1515
// TELCODOCS-2028
1616
* The Kernel Module Management (KMM) Operator images are now based on `rhel-els-minimal` container images instead of the `rhel-els` images. This change results in a greatly reduced image footprint, while still maintaining FIPS compliance.
17-
1817
// TELCODOCS-1994
1918
* In this release, the firmware search path has been updated to copy the contents of the specified path into the path specified in worker.setFirmwareClassPath (default: /var/lib/firmware). For more information, see xref:../hardware_enablement/kmm-kernel-module-management.adoc#kmm-example-cr_kernel-module-management-operator[Example Module CR].
2019

@@ -28,4 +27,284 @@ toc::[]
2827
* In this release, KMM uses version 1.23 of the Golang programming language to ensure test continuity for partners.
2928

3029
// TELCODOCS-2197
31-
* You can now schedule KMM pods by defining taints and tolerations. For more information, see xref:../hardware_enablement/kmm-kernel-module-management.adoc#kmm-using-tolerations-for-kernel-module-scheduling_kernel-module-management-operator[Using tolerations for kernel module scheduling].
30+
* You can now schedule KMM pods by defining taints and tolerations. For more information, see xref:../hardware_enablement/kmm-kernel-module-management.adoc#kmm-using-tolerations-for-kernel-module-scheduling_kernel-module-management-operator[Using tolerations for kernel module scheduling].
31+
32+
[id="kmm-2-4-RN"]
33+
== Release notes for Kernel Module Management Operator 2.4
34+
=== New features and enhancements
35+
// TELCODOCS-2311
36+
* In this release, you now have the option to configure the Kernel Module Management (KMM) module to not load an out-of-tree kernel driver and use the in-tree driver instead, and run only the device plugin. For more information see xref:../hardware_enablement/kmm-kernel-module-management.adoc#kmm-using-intree-modules_kernel-module-management-operator[Using in-tree modules with the device plugin].
37+
38+
// TELCODOCS-2304
39+
* In this release, KMM configurations are now persistent after cluster and KMM Operator upgrades and redeployments of KMM.
40+
+
41+
In earlier releases, a cluster or KMM upgrade, or any other action, such as upgrading a non-default configuration like the firmware path that redeploys KMM, could create the need to reconfigure KMM. In this release, KMM configurations now remain persistent regardless of any of such actions.
42+
+
43+
For more information, see xref:../hardware_enablement/kmm-kernel-module-management.adoc#kmm-configuring-kmmo_kernel-module-management-operator[Configuring the Kernel Module Management Operator].
44+
45+
// MGMT-19735
46+
* Improvements have been added to KMM so that GPU Operator vendors do not need to replicate KMM functionality in their code, but instead use KMM as is. This change greatly improves Operators' code size, tests, and reliability.
47+
48+
// MGMT-18966
49+
* In this release, KMM no longer uses HTTP(S) direct requests to check if a kmod image exists. Instead, CRI-O is used internally to check for the images. This mitigates the need to access container image registries directly from HTTP(S) requests and manually handle tasks such as reading `/etc/containers/registries.conf` for mirroring configuration, accessing the image cluster resource for TLS configuration, mounting the CAs from the node, and maintaining your own cache in Hub & Spoke.
50+
51+
// MGMT-19383
52+
* The KMM and KMM-hub Operators have been assigned the "Meets Best Practices" label in the
53+
https://catalog.redhat.com/search?searchType=software[Red Hat Catalog].
54+
55+
// MGMT20613
56+
* You can now install KMM on compute nodes, if needed. Previously, it was not possible to deploy workloads on the control-plane nodes. Because the compute nodes do not have the `node-role.kubernetes.io/control-plane` or `node-role.kubernetes.io/master` labels, the Kernel Module Management Operator might need further configurations. An internal code change has resolved this issue.
57+
58+
// MGMT-20249
59+
* In this release, the heartbeat filter for the NMC reconciler has been updated to filter the following events on nodes:
60+
61+
** `node.spec`
62+
** `metadata.labels`
63+
** `status.nodeInfo`
64+
** `status.conditions[]` (`NodeReady` only) and still filtering heartbeats
65+
66+
== Notable technical changes
67+
// TELCODOCS-2343
68+
* In this release, the preflight validation resource in the cluster has been modified. You can use the preflight validation to verify kernel modules to be installed on the nodes after cluster upgrades and possible kernel upgrades. Preflight validation also reports on the status and progress of each module in the cluster that it attempts or has attempted to validate. For more information, see xref:../updating/preparing_for_updates/kmm-preflight-validation.adoc#kmm-validation-kickoff_kmm-preflight-validation[Preflight validation for Kernel Module Management (KMM) Modules].
69+
70+
// TELCODOCS-2344
71+
* A requirement when creating a kmod image is that both the `.ko` kernel module files and the `cp` binary must be included, which is required for copying files during the image loading process.
72+
For more information, see xref:../hardware_enablement/kmm-kernel-module-management.adoc#kmm-creating-kmod-image_kernel-module-management-operator[Creating a kmod image].
73+
74+
//MGMT-19919
75+
* The `capabilities` field that refers to the Operator maturity level has been changed from `Basic Install` to `Seamless upgrades`. `Basic Install` indicates that the Operator does not have an upgrade option. This is not the case for KMM, where seamless upgrades are supported.
76+
77+
=== Bug fixes
78+
// MGMT-19548
79+
* Webhook deployment has been renamed from `webhook-server` to `webhook`.
80+
81+
** *Cause*: Generating files with `controller-gen` generated a service called `webhook-service` that is not configurable. And, when deploying KMM with {olm-first}, OLM deploys a service for the webhook called `-service`.
82+
83+
** *Consequence*: Two services were generated for the same deployment. One generated by `controller-gen` and added to the bundle manifests and the other that the OLM created.
84+
85+
** *Fix*: Make OLM find an already existing service called `webhook-service` in the cluster because the deployment is called `webhook`.
86+
87+
** *Result*: A second service is no longer created.
88+
89+
// MGMT-20892
90+
* Using `imageRepoSecret` object in conjunction with DTK as the image stream results in `authorization required` error.
91+
92+
** *Cause*: On the Kernel Module Management (KMM) Operator, when you set `imageRepoSecret` object in the KMM module, and the build's resulting container image is defined to be stored in the cluster's internal registry, the build fails to push the final image and generate an `authorization required` error.
93+
94+
** *Consequence*: The KMM Operator does not work as expected.
95+
96+
** *Fix*: When the `imageRepoSecret` object is user-defined, it is used as both a pull and push secret by the build process. To support using the cluster's internal registry, you must add the authorization token for that registry to the `imageRepoSecret` object. You can obtain the token from the "build" service account of the KMM module's namespace.
97+
98+
** *Result*: The KMM Operator works as expected.
99+
100+
// MGMT-16797
101+
* Creating or deleting the image or creating an MCM module does not load the module on the spoke.
102+
103+
** *Cause*: In a hub and spoke environment, when creating or deleting the image in registry, or when creating a `ManagedClusterModule` (MCM), the module on the spoke cluster is not loaded.
104+
105+
** *Consequence*: The module on the spoke is not created.
106+
107+
** *Fix*: Remove the cache package and image translation from the hub and spoke environment.
108+
109+
** *Result:* The module on the spoke is created for the second time the MCM object is created.
110+
111+
// MGMT-19859
112+
* KMM cannot pull images from the private registry while doing in-cluster builds.
113+
114+
** *Cause*: The Kernel Module Management (KMM) Operator cannot pull images from private registry while doing in-cluster builds.
115+
116+
** *Consequence*: Images in private registries that are used in the build process can not be pulled.
117+
118+
** *Fix*: The `imageRepoSecret` object configuration is now also used in the build process. The `imageRepoSecret` object specified must include all registries that are being used.
119+
120+
** *Result:* You can now use private registries when doing in-cluster builds.
121+
122+
// MGMT-19897
123+
* KMM worker pod is orphaned when deleting a module with a container image that can not be pulled.
124+
125+
** *Cause*: A Kernel Module Management (KMM) Operator worker pod is orphaned when deleting a module with a container image that can not be pulled.
126+
127+
** *Consequence*: Failing worker pods are left on the cluster and at no point being collected for garbage.
128+
129+
** *Fix*: KMM, now collects orphaned failing pods upon the modules deletion for garbage.
130+
131+
** *Result:* The module is successfully deleted, and all associated orphaned failing pods are also deleted.
132+
133+
// MGMT-19898
134+
* The KMM Operator tries to create a MIC even when the node selector does not match.
135+
136+
** *Cause*: The Kernel Module Management (KMM) Operator tries to create a 'ModuleImagesConfig' (MIC) resource even when the node selector does not match with any actual nodes and fails.
137+
138+
** *Consequence*: The KMM Operator reports an error when reconciling a module that does not target any node.
139+
140+
** *Fix*: The `Images` field in the MIC resource is now optional.
141+
142+
** *Result:* The KMM Operator can successfully create the MIC resource even when there are no images in it.
143+
144+
// MGMT-20247
145+
* KMM does not reload the kernel module in case the node reboot sequence is too quick.
146+
147+
** *Cause*: The Kernel Module Management (KMM) Operator does not reload the kernel module in case the node reboot sequence is too quick. The reboot is determined based on the timestamp of the status condition being later than the timestamp in the Node Machine Configuration (NMC) status.
148+
149+
** *Consequence*: When the reboot happens quickly, in less time than the grace period, the node state does not change. After the node reboots, KMM does not load the kernel module again.
150+
151+
** *Fix*: Instead of relying on the condition state, NMC can rely on the `Status.NodeInfo.BootID` field. This field is set by kubelet based on the `/proc/sys/kernel/random/boot_id` file of the server node, so it is updated after each reboot.
152+
153+
** *Result:* The more accurate timestamps enable the Kernel Module Management (KMM) Operator to reload the kernel module after the node reboot sequence.
154+
155+
// MGMT-20248
156+
* Filtering out node heartbeats events for the Node Machine Configuration (NMC) controller.
157+
158+
** *Cause*: The NMC controller gets spammed with events from node heartbeats. The node heartbeats let the Kubernetes API server know that the node is still connected and functional.
159+
160+
** *Consequence*: The spamming causes a constant reconciliation even when there is no module, and therefore no NMC, are applied to the cluster.
161+
162+
** *Fix*: The NMC controller now filter the node's heartbeat from its reconciliation loop.
163+
164+
** *Result:* The NMC controller only gets real events and filters out node heartbeats.
165+
166+
// MGMT-20286
167+
* NMC status contains toleration values, even though there are no tolerations in the `NMC.spec` or in the module.
168+
169+
** *Cause*: The Node Machine Configuration (NMC) status contains toleration values, even though there are no tolerations in the `NMC.spec` or in the module.
170+
171+
** *Consequence*: Tolerations other than Kernel Module Management-specific tolerations can appear in the status.
172+
173+
** *Fix*: The NMC status now gets its toleration from a dedicated annotation rather than from the worker pod.
174+
175+
** *Result:* The NMC status only contains the module's tolerations.
176+
177+
// MGMT-20725
178+
* The KMM Operator version 2.4 fails to start properly and cannot list the `\modulebuildsignconfigs\` resource.
179+
180+
** *Cause*: On the Kernel Module Management (KMM) Operator, when the Operator is installed using Red Hat Konflux, it does not start properly because the log files contain errors.
181+
182+
** *Consequence*: The KMM Operator does not work as expected.
183+
184+
** *Fix*: The Cluster Service Version (CSV) file is updated to list the `\modulebuildsignconfigs\` and the `moduleimagesconfig` resources .
185+
186+
** *Result:* The KMM Operator works as expected.
187+
188+
// MGMT-20752
189+
* The Red{nbsp}Hat Konflux build does not include version and git commit ID in the Operator logs.
190+
191+
** *Cause*: On the Kernel Module Management (KMM) Operator, when the Operator was built using Communications Platform as a Service (CPaas), the build included the Operator version and git commit ID in the log files. However, with Red Hat Konflux these details are not included in the log files.
192+
193+
** *Consequence*: Important information is missing from the log files.
194+
195+
** *Fix*: Some modifications are introduced in Konflux to resolve this issue.
196+
197+
** *Result:* The KMM Operator build now includes the Operator version and git commit ID in the log files.
198+
199+
// MGMT-20754
200+
* The KMM Operator does not load the module after node with taint is rebooted.
201+
202+
** *Cause*: The Kernel Module Management (KMM) Operator does not reload the kernel module in case the node reboot sequence is too quick. The reboot is determined based on the timestamp of the status condition being later than the timestamp in the Node Machine Configuration (NMC) status.
203+
204+
** *Consequence*: When the reboot happens quickly, in less time than the grace period, the node state does not change. After the node reboots, KMM does not load the kernel module again.
205+
206+
** *Fix*: Instead of relying on the condition state, NMC can rely on the `Status.NodeInfo.BootID` field. This field is set by kubelet based on the /proc/sys/kernel/random/boot_id file of the server node, so it is updated after each reboot.
207+
208+
** *Result:* The more accurate timestamps enable the Kernel Module Management (KMM) Operator to reload the kernel module after the node reboot sequence.
209+
210+
// MGMT-20775
211+
* Redeploying a module that uses in-cluster builds fails with the `ImagePullBackOff` policy.
212+
213+
** *Cause*: On the Kernel Module Management (KMM) Operator, the image pull policy for the puller pod and the worker pod is different.
214+
215+
** *Consequence*: An image can be considered as existing while, in fact, it is not.
216+
217+
** *Fix*: Make the image pull policy of the pull pod the same as the pull policy defined in the KMM module since its the same policy that is used by the worker pod.
218+
219+
** *Result:* The MIC represents the state of the image in the same way the worker pod accesses it.
220+
221+
// MGMT-20785
222+
* The MIC controller creates two pull-pods when it should create just one.
223+
224+
** *Cause*: On the Kernel Module Management (KMM) Operator, the `ModuleImagesConfig` (MIC) controller may create multiple pull-pods for the same image.
225+
226+
** *Consequence*: Resources are not used appropriately or as intended.
227+
228+
** *Fix*: The `CreateOrPatch` MIC API receives a slice of `ImageSpecs`, as the input is created by going over the the target nodes and adding their images to the slice, so any duplicate `ImageSpecs`, are now filtered out.
229+
230+
** *Result:* The KMM Operator works as expected.
231+
232+
// MGMT-20827
233+
* The `job.dcDelay` example in the documentation should specify `0s` instead of `0`.
234+
235+
** *Cause*: The Kernel Module Management (KMM) Operator default `job.gcDelay` duration field is `0s` but the documentation mentions the value as `0`.
236+
237+
** *Consequence*: Entering a custom value of `60` instead of `60s` or `1m` might result in an error due to the wrong input type.
238+
239+
** *Fix*: The `job.gcDelay` field in the documentation is updated to default value of `0s`.
240+
241+
** *Result:* Users are less likely to get confused.
242+
243+
// MGMT-20835
244+
* The KMM Operator Hub environment does not work because of missing MIC and MBSC CRDs.
245+
246+
** *Cause*: The Kernel Module Management (KMM) Operator hub environment only generates Custom Resource Definitions (CRD) files based on the `api-hub/` directory. As a result, this does not contain some CRDs that are required for the KMM Operator Hub environment, such as, `ModuleImagesConfig` (MIC) resource and Managed Kubernetes Service (MBSC).
247+
248+
** *Consequence*: The KMM Operator hub environment cannot work because it tries to start controllers reconciling CRDs that do not exist in the cluster.
249+
250+
** *Fix*: The fix generates all CRD files into the `config/crd-hub/bases` directory, but only applies the resources to the cluster that it actually needs.
251+
252+
** *Result:* The KMM Operator hub environment works as expected.
253+
254+
// MGMT-20852
255+
* The KMM OperatorHub environment cannot build when finalisers are not set on a resource.
256+
257+
** *Cause*: The Kernel Module Management (KMM) Operator displays an error with the `ManagedClusterModule` controller failing to build. This is due to the missing `ModuleImagesConfig` (MIC) resource finalizers and Role-based Action Control (RBAC) permissions for the KMM OperatorHub environment.
258+
259+
** *Consequence*: The KMM OperatorHub environment cannot build images.
260+
261+
** *Fix*: The RBAC permissions are updated to allow updating finalizers on the MIC resource, and then the appropriate rules created.
262+
263+
** *Result:* The KMM OperatorHub environment builds images without errors with the `ManagedClusterModule` controller.
264+
265+
// MGMT-20861
266+
* The `PreflightValidationOCP` custom resource, with a `kernelVersion: tesdt` causes the KMM Operator to panic.
267+
268+
** *Cause*: Creating a `PreflightValidationOCP` custom resource (CR), with a `kernelVersion` flag that is set to `tesdt`, causes the Kernel Module Management (KMM) Operator to generate a panic runtime error.
269+
270+
** *Consequence*: Entering invalid kernel versions causes the KMM Operator to panic.
271+
272+
** *Fix*: A webhook - a method for one application to automatically send real-time data to another application when a specific event occurs - is now added to the `PreflightValidationOCP` CR.
273+
274+
** *Result:* The `PreflightValidationOCP` CR with invalid kernel versions can no longer be applied to the cluster, therefore, preventing the Operator from generating a panic runtime error.
275+
276+
// MGMT-20866
277+
* The `PreFflightValidationOCP` custom resource, with a `kernelVersion` flag that is different that the one of the cluster, does not work.
278+
279+
** *Cause*: Creating a `PreflightValidationOCP` custom resource (CR), with a `kernelVersion` flag that is different from the one of the cluster, does not work.
280+
281+
** *Consequence*: The Kernel Module Management (KMM) Operator is unable to find the Driver Toolkit (DTK) input image for the new kernel version.
282+
283+
** *Fix*: You must use the `PreflightValidationOCP` CR and explicitly set the `dtkImage` field in the CR.
284+
285+
** *Result:* Using the fields `kernelVersion` and `dtkImage` the feature can build installed modules for target {product-title} versions.
286+
287+
// MGMT-20888
288+
* The KMM Operator version 2.4 documentation is updated with `PreflightValidationOCP` information.
289+
290+
** *Cause*: Previously, when creating an `PreflightValidationOCP` CR, you were required to supply the release-image. This has now changed and you need to set the `kernelVersion` the `dtkImage` fields.
291+
292+
** *Consequence*: The documentation was outdated and required an update.
293+
294+
** *Fix*: The documentation is updated with the new support details.
295+
296+
** *Result:* The KMM preflight feature is documented as expected.
297+
298+
=== Known issues
299+
// MGMT-20453 changed security settings in jira
300+
* The `ModuleUnloaded` event does not appear when a module is `Unloaded``.
301+
302+
** *Cause*: When a module is `Loaded` (using the create a `ModuleLoad` event) or `Unloaded ` (using the create a `ModuleUnloaded` event) the events might not appear. This happens when you load and unload the kernel module in a quick succession.
303+
304+
** *Consequence*: The `ModuleLoad` and the `ModuleUnloaded` events might not appear in {product-title}.
305+
306+
** *Fix*: Introduce an alerting mechanism for this potential behavior and for awareness when working with modules.
307+
308+
** *Result:* Not yet available.
309+
310+

0 commit comments

Comments
 (0)