Skip to content

Commit 14a6c31

Browse files
Airrenwaynepeking348
authored andcommitted
doc(*): add proposal for enhance orm by nri
Signed-off-by: Airren <qiang.ren@intel.com>
1 parent 4edd0e2 commit 14a6c31

File tree

5 files changed

+373
-0
lines changed

5 files changed

+373
-0
lines changed
Lines changed: 373 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,373 @@
1+
---
2+
title: Enhance ORM by NRI
3+
authors:
4+
- "airren"
5+
- "hle2"
6+
reviewers:
7+
- "caohe"
8+
creation-date: 2024-03-03
9+
last-updated: 2024-04-24
10+
status: implementable
11+
12+
---
13+
14+
# Enhance ORM by NRI
15+
16+
<!--ts-->
17+
* [Enhance ORM by NRI](#enhance-orm-by-nri)
18+
* [Summary](#summary)
19+
* [Motivation](#motivation)
20+
* [Goals](#goals)
21+
* [Non-Goals/Future Work](#non-goalsfuture-work)
22+
* [Proposal](#proposal)
23+
* [User Stories](#user-stories)
24+
* [Story1: Use origin kubernetes without intrusive modifications](#story1-use-origin-kubernetes-without--intrusive-modifications)
25+
* [Story2: Synchronous configuration of QoS policies and injection of environment variables](#story2-synchronous-configuration-of-qos-policies-and-injection-of-environment-variables)
26+
* [Requirements](#requirements)
27+
* [Functional Requirements](#functional-requirements)
28+
* [Non-Functional Requirements](#non-functional-requirements)
29+
* [Design Details](#design-details)
30+
* [Detailed working flow](#detailed-working-flow)
31+
* [Addon](#addon)
32+
* [Modification](#modification)
33+
* [Test Plan](#test-plan)
34+
* [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
35+
* [Feature Enablement and Rollback](#feature-enablement-and-rollback)
36+
* [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
37+
* [Troubleshooting](#troubleshooting)
38+
* [How does this feature react if the NRI not supported?](#how-does-this-feature-react-if-the-nri-not-supported)
39+
* [How to handle resource allocation failures?](#how-to-handle-resource-allocation-failures)
40+
* [What happens if the NRI stub times out or if the socket connection fails?](#what-happens-if-the-nri-stub-times-out-or-if-the-socket-connection-fails)
41+
* [Appendix](#appendix)
42+
* [Implementation History](#implementation-history)
43+
44+
<!-- Created by https://github.com/ekalinin/github-markdown-toc -->
45+
<!-- Added by: airren, at: Wed Mar 27 14:55:54 CST 2024 -->
46+
47+
<!--te-->
48+
49+
## Summary
50+
51+
To meet the needs of various business application scenarios, ensuring sufficient
52+
resource guarantees for latency-sensitive services is necessary, especially when
53+
online and offline tasks are mixed. This requires Kubernetes to provide more
54+
granular resource management capabilities, enhance container isolation, and reduce
55+
interference between containers.
56+
57+
As of now, Kubernetes does not offer a fully comprehensive resource management
58+
solution. Many open-source projects in the Kubernetes ecosystem have devised
59+
their methods to modify the deployment and management processes of pods, enabling
60+
fine-grained resource allocation.
61+
62+
There are various approaches to extending Kubernetes, which we have summarized
63+
as follows.
64+
65+
![kubernetes-enhance-overview](kubernetes-enhance-overview.png)
66+
67+
All the methods listed above can enhance Kubernetes, but except for the standalone
68+
approach, they unavoidably involve intrusive modifications to the upstream Kubernetes
69+
components, making it difficult for users to stay synchronized with upstream
70+
components. Although the standalone approach avoids modifications to upstream
71+
components, this asynchronous update method also has numerous drawbacks.
72+
73+
To address the need for intrusive modifications to Kubernetes and changes to the
74+
default process, enabling developers to have a more unified implementation
75+
approach, NRI has emerged.
76+
77+
[NRI](https://github.com/containerd/nri) is a plugin-based node resource management approach introduced by
78+
the upstream community. Using NRI, Kubernetes' node resource management capabilities
79+
can be enhanced through plugins without intrusive modifications to the upstream
80+
Kubernetes components.
81+
82+
> NRI allows plugging domain- or vendor-specific custom logic into OCI- compatible
83+
> runtimes. This logic can make controlled changes to containers or perform extra
84+
> actions outside the scope of OCI at certain points in a containers lifecycle.
85+
> This can be used, for instance, for improved allocation and management of devices
86+
> and other container resources.
87+
88+
![nri-architecture](nri-architecture.png)
89+
90+
This proposal introduces how to enhance Katalyst using NRI, allowing Katalyst to
91+
be deployed based on origin Kubernetes and making it easier to maintain and use.
92+
93+
## Motivation
94+
95+
Katalyst enhances Kubernetes resource management policies on a single node through
96+
the QoS Resource Manager (QRM). However, the current QRM mode involves intrusive
97+
modifications to the Kubelet, which makes it inconvenient for some users who use
98+
the origin Kubernetes but not the distribution Kubewharf. To address this, Katalyst
99+
proposes the ORM architecture, which provides a decoupled solution from Kubelet as
100+
a supplement to the QRM solution.
101+
102+
In the ORM architecture, there are two implementation approaches. The first approach
103+
is named Bypass, which polls Kubelet's API for pod events on the current node and
104+
updates pod resources. This approach is asynchronous and cannot inject parameters
105+
such as environment variables. The other approach is based on NRI. NRI (Node
106+
Resource Interface) is a general framework for CRI-compatible container runtime
107+
plugin extensions. It offers a mechanism for extensions to monitor pod/container
108+
states and make limited configuration modifications. Using NRI, Katalyst can
109+
synchronously modify resources and inject other information, such as environment
110+
variables, during pod events.
111+
112+
### Goals
113+
114+
- Expand Katalyst‘s ORM mode using NRI to enhance the Resource management capabilities
115+
of Kubernetes。
116+
- Support for fine-grained resource control when containerd is used as the CRI runtime.
117+
118+
### Non-Goals/Future Work
119+
120+
- Support for other runtimes besides containerd, such as cri-o and docker.
121+
122+
## Proposal
123+
124+
Diverging from QRM or ORM's Bypass Mode, the Katalyst-agent will work as an NRI
125+
plugin to subscribe pod/container lifecycle events from CRI runtime (in this
126+
proposal, it is containerd), and then the Katalyst-agent will return an adjusted
127+
Container spec in the hook events, or update the container spec by an active update.
128+
129+
- Get pod/container lifecycle events and pod or container information from NRI.
130+
- Transform the NRI format information into CRI format to reuse existing admit
131+
implementation by QRM Plugins.
132+
- Update the NRI format container spec to the CRI runtime.
133+
- While reconciling use NRI UpdateContainter to reconfigure resources.
134+
135+
**NRI Enhanced ORM(Along with kubelet polling)**
136+
137+
![orm-architecture](orm-architecture.png)
138+
139+
### User Stories
140+
141+
#### Story1: Use origin kubernetes without intrusive modifications
142+
143+
Extending and enhancing Kubernetes' resource management capabilities is a common
144+
requirement in many business scenarios. However, while enhancing Kubernetes, it's
145+
a common requirement to ensure that all Kubernetes components remain consistent
146+
with the upstream community and avoid making any intrusive modifications to the
147+
original Kubernetes components. After enabling NRI mode, deploying Katalyst on
148+
existing clusters does not require restarting the original cluster. Enhancements
149+
to the original Kubernetes can be achieved through a plugin-based approach.
150+
151+
#### Story2: Synchronous configuration of QoS policies and injection of environment variables
152+
153+
When enhancing QoS policies in Kubernetes, synchronous modification is the most
154+
efficient method. With NRI Mode enabled, Katalyst plugins can synchronously modify
155+
pod resources during pod creation, ensuring QoS policy allocation before pod
156+
execution. Additionally, through NRI Mode, dynamic updates to pod resources
157+
are possible. During pod creation, adjustments to pod resources, device binding,
158+
RDT, and environment variable injection can be achieved via NRI Mode.
159+
160+
### Requirements
161+
162+
- Need to upgrade containerd to >= v1.7.0
163+
164+
#### Functional Requirements
165+
166+
- Support all functionalities corresponding to Bypass Mode under the existing ORM
167+
architecture. This includes: adjusting container's cpuset / cfsquota, memory QoS.
168+
- Support injecting environment variables into containers
169+
170+
#### Non-Functional Requirements
171+
172+
- It can achieve synchronous configuration of QoS policies, improving the
173+
responsiveness of QoS policy configuration.
174+
- Fully compatible with upstream native Kubernetes components, requiring no
175+
intrusive modifications.
176+
177+
### Design Details
178+
179+
#### Detailed working flow
180+
181+
![orm-nri-details](orm-nir-details.png)
182+
183+
In this part, the method based on the Kubelet API polling is referred to as
184+
**_Bypass_** Mode, while another method based on NRI is referred to as **_NRI_** Mode.
185+
186+
#### Addon
187+
188+
- The ORM support two operational modes: Bypass or NRI. Only one mode can be active
189+
at any given time. When creating a new ORM Manger, the current operational mode can
190+
be determined by reading the configuration, and it does not support changing the
191+
mode during runtime.
192+
193+
```go
194+
type workMode string
195+
const (
196+
workModeNri workMode = "nri"
197+
workModeBypass workMode = "bypass"
198+
)
199+
200+
201+
type ManagerImpl struct {
202+
ctx context.Context
203+
....
204+
// ORM run mode: bypass or nri.
205+
// Bypass mode is triggered by polling kubelet api to get the pod event.
206+
// NRI mode is required containerd version >= 1.7.0 and NRI enabled.
207+
mode workMode
208+
....
209+
}
210+
211+
func NewManger(... config *config.Configuration){
212+
// init orm work mode with essential components
213+
m.initORMWorkMode(config, metaServer, emitter)
214+
}
215+
216+
func (m *ManagerImpl) initORMWorkMode(config *config.Configuration, metaServer *metaserver.MetaServer, emitter metrics.MetricEmitter) {
217+
// init ORM work node according to the configuration and NRI status
218+
}
219+
```
220+
221+
- The ORM ManagerImpl functions as an NRI stub, implementing processing logic
222+
within the corresponding hook event functions.
223+
224+
```go
225+
import "github.com/containerd/nri/pkg/stub"
226+
227+
type ManagerImpl struct {
228+
ctx context.Context
229+
....
230+
// nriStub is the implementtion of NRI events handlers
231+
nriStub stub.Stub
232+
// nriMask stores the specific events that need to be hooked
233+
nriMask stub.EventMask
234+
nriOptions []stub.Option
235+
nriConf nriConfig
236+
....
237+
}
238+
```
239+
240+
- In enhancing the ORM implementation, three hook functions are required:
241+
`RunPodSandbox()`, `CreateContainer()`, and `RemovePodSandbox()`.
242+
243+
**Step 1**, during `RunPodSanbox()`, the `Admit()` function is triggered.
244+
If `Admit()` succeeds, resources are allocated for the container, and the pod
245+
creation process continues. If `Admit()` fails, pod creation also fails.
246+
```go
247+
func (m *MangerImpl) RunPodSandbox(podSandbox *api.PodSandbox) error {
248+
err := m.processAddPod(pod.Uid)
249+
if err != nil {
250+
klog.Errorf("[ORM] RunPodSandbox processAddPod fail, pod: %s/%s/%s, err: %v",
251+
pod.Namespace, pod.Name, pod.Uid, err)
252+
}
253+
return err
254+
}
255+
```
256+
257+
**Step 2**, after a successful `Admit()`, the process proceeds to the
258+
`CreateContainer()` event. At this point, resources have been allocated for the
259+
container by `Admit()`. The corresponding resources are updated in the container's
260+
spec and returned.
261+
```go
262+
func (m *MangerImpl) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) {
263+
// Update Container Spec from the podResources
264+
adjust, err:= m.updateContainer(pod, container)
265+
return adjust, nil, err
266+
}
267+
```
268+
269+
**Step 3**, During `RemovePodSandbox()`, all resource allocations related to
270+
the pod are returned.
271+
272+
```go
273+
func (p *plugin) RemovePodSandbox(pod *api.PodSandbox) error {
274+
err := m.processDeletePod(pod.Uid)
275+
if err != nil {
276+
klog.Errorf("[ORM] RemovePodSandbox processDeletePod fail, pod: %s/%s/%s, err: %v",
277+
pod.Namespace, pod.Name, pod.Uid, err)
278+
}
279+
return err
280+
}
281+
```
282+
283+
#### Modification
284+
285+
- If using the NRI Mode, after the allocation of resources is completed in the
286+
`Admit()` , the `Allocate()` does not need to execute `syncContainer()`; it should
287+
simply return after the resources have been allocated.
288+
289+
```go
290+
func (m *ManagerImpl) Allocate(pod *v1.Pod, container *v1.Container) error {
291+
....
292+
err := m.addContainer(pod, container)
293+
// return after resource allocate when run in NRIMode
294+
if err != nil || m.mode == workModeNri {
295+
return err
296+
}
297+
err = m.syncContainer(pod, container)
298+
return err
299+
}
300+
```
301+
302+
- In NRI Mode, the executer in `syncContainer()` can be implemented through NRI's
303+
`updateContainer()` .
304+
305+
```go
306+
if m.mode == workModeNri {
307+
m.updateContainerByNRI(pod, container)
308+
} else {
309+
m.syncContainer(pod, &container)
310+
}
311+
```
312+
313+
- The `metaServer` as a member variable of the ORM `ManagerImpl` because it is
314+
used in both Bypass and NRI modes.
315+
- During NRI mode, halt the MetaManager's Reconcile, user NRI to hook the Pod/Container events.
316+
- During NRI mode, the executor is conduct by NRI, do not need to create an Executor.
317+
318+
#### Test Plan
319+
320+
We will test the enhancement of ORM by NRI in a real cluster by deploying simulated
321+
task invocation resource management plugins to configure QoS policies, which will
322+
cover key points listed below:
323+
324+
- ORM completes registration to Containerd as an NRI plugin and establishes a connection.
325+
- ORM can configure the correct LinuxContainerResources configuration with allocation
326+
results for containers through NRI.
327+
- ORM can add environment variables to containers through NRI.
328+
- Validate that reconcileState() of ORM will update the cgroup configs for containers
329+
by the latest resource allocation results.
330+
331+
## Production Readiness Review Questionnaire
332+
333+
### Feature Enablement and Rollback
334+
335+
#### How can this feature be enabled / disabled in a live cluster?
336+
337+
This feature is disable by default, you can enable it by configuration.
338+
If a failure is detected in the NRI runtime environment while NRI mode enables,
339+
it will fall back to Bypass Mode.
340+
341+
### Troubleshooting
342+
343+
#### How does this feature react if the NRI not supported?
344+
345+
It will fall back to Bypass mode of ORM.
346+
347+
#### How to handle resource allocation failures?
348+
349+
If encounter admit failure, the pod will enter a retry loop.
350+
351+
#### What happens if the NRI stub times out or if the socket connection fails?
352+
353+
Currently, if the NRI plugin times out, it leads to Containerd no longer invoking
354+
this plugin. To address this, the following strategy needs to be adopted.
355+
356+
While timeout, in `OnClose()` invoke `stub.Restart` to re-create connection to containerd
357+
358+
And, do `Admit()` with a timeout (configured) context, if timeout try to create again.
359+
360+
## Appendix
361+
362+
NRI : [https://github.com/containerd/nri](https://github.com/containerd/nri)
363+
364+
ORM PR: [#406](https://github.com/kubewharf/katalyst-core/pull/406) [#430](https://github.com/kubewharf/katalyst-core/issues/430)
365+
366+
## Implementation History
367+
- [x] 01/16/2024 Proposed idea in community meeting
368+
- [x] 03/12/2024 Compile a document following the proposal template
369+
- [x] 03/19/2024 Present proposal at a community meeting
370+
- [x] 04/20/2024 Complete the basic functionalities of NRI as covered in the detailed
371+
design
372+
- [ ] 05/10/2024 commence the first round of testing
373+
- [ ] 05/20/2024 open proposal PR for code
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)