Skip to content
This repository was archived by the owner on Jan 29, 2025. It is now read-only.

Commit 5a4774a

Browse files
killianmuldoontogashidm
authored andcommitted
Restructure repo to Platform Aware Scheduling
This change moves all files and docs associated with Telemetry Aware Scheduling into a new Telemetry Aware Scheduling directory. It adds a new directory to the top level of the repo with the components that can be shared in Kubernetes scheduling extenders. This change was made to allow clean reuse of the scheduler extender code, and to allow additional implementations of Kubernetes scheduling extenders to be added to this repo in future.
1 parent 4afaa95 commit 5a4774a

File tree

102 files changed

+599
-314
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+599
-314
lines changed

README.md

Lines changed: 15 additions & 189 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,24 @@
1-
# Telemetry Aware Scheduling
2-
Telemetry Aware Scheduling (TAS) makes telemetry data available to scheduling and descheduling decisions in Kubernetes. Through a user defined policy, TAS enables rule based decisions on pod placement powered by up to date platform metrics. Policies can be applied on a workload by workload basis - allowing the right indicators to be used to place the right pod.
1+
# Platform Aware Scheduling
2+
Platform Aware Scheduling (PAS) contains a group of related projects designed to expose platform specific attributes to the Kubernetes scheduler using a modular policy driven approach. The project contains a core library and information for building custom scheduler extensions as well as specific implementations that can be used in a working cluster or leveraged as a reference for creating new Kubernetes scheduler extensions.
33

4-
For example - a pod that requires certain cache characteristics can be schedule on output from Intel® RDT metrics. Likewise a combination of RDT, RAS and other platform metrics can be used to provide a signal for the overall health of a node and be used to proactively ensure workload resiliency.
4+
Telemetry Aware Scheduling is the initial reference implementation of Platform Aware Scheduling. It can expose any platform-level metric to the Kubernetes Scheduler for policy driven filtering and prioritization of workloads. You can read more about TAS [here](/telemetry-aware-scheduling).
55

66

7-
**This software is a pre-production alpha version and should not be deployed to production servers.**
7+
* [Kubernetes Scheduler Extenders](#kubernetes-scheduler-extenders)
8+
* [Extenders](#plugins)
9+
* [Telemetry Aware Scheduling](/telemetry-aware-scheduling)
10+
* [Communication and contribution](#communication-and-contribution)
811

12+
## Kubernetes Scheduler Extenders
913

10-
## Introduction
14+
Platform Aware Scheduling leverages the power of Kubernetes Scheduling Extenders. These extenders allow the core Kubernetes scheduler to make HTTP calls to an external service which can then modify scheduling decisions. This can be used to provide workload specific scheduling direction based on attributes not normally exposed to the Kubernetes scheduler.
1115

12-
Telemetry Aware Scheduler Extender is contacted by the generic Kubernetes Scheduler every time it needs to make a scheduling decision.
13-
The extender checks if there is a telemetry policy associated with the workload.
14-
If so, it inspects the strategies associated with the policy and returns opinions on pod placement to the generic scheduler.
15-
The scheduler extender has two strategies it acts on - scheduleonmetric and dontschedule.
16-
This is implemented and configured as a [Kubernetes Scheduler Extender.](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#cluster-level-extended-resources)
16+
The [extender](/extender) package at the top-level of this repo can be used to quickly create a working scheduler extender.
1717

18-
The Scheduler consumes TAS Policies - a Custom Resource. The extender parses this policy for deschedule, scheduleonmetric and dontschedule strategies and places them in a cache to make them locally available to all TAS components.
19-
It consumes new Telemetry Policies as they are created, removes them when deleted, and updates them as they are changed.
20-
The extender also monitors the current state of policies to see if they are violated. For example if it notes that a deschedule policy is violated it labels the node as a violator allowing pods relating to that policy to be descheduled.
18+
### Enabling a scheduler extender
2119

22-
## Usage
23-
A worked example for TAS is available [here](docs/health-metric-example.md)
24-
### Strategies
25-
There are three strategies that TAS acts on.
26-
27-
**1 scheduleonmetric** has only one rule. It is consumed by the Telemetry Aware Scheduling Extender and prioritizes nodes based on a comparator and an up to date metric value.
28-
- example: **scheduleonmetric** when **cache_hit_ratio** is **GreaterThan**
29-
30-
**2 dontschedule** strategy has multiple rules, each with a metric name and operator and a target. A pod with this policy will never be scheduled on a node breaking any one of these rules.
31-
- example: **dontschedule** if **gpu_usage** is **GreaterThan 10**
32-
33-
**3 deschedule** is consumed by the extender. If a pod with this policy is running on a node that violates it can be descheduled with the kubernetes descheduler.
34-
- example: **deschedule** if **network_bandwidth_percent_free** is **LessThan 10**
20+
Scheduler extenders are enabled by providing a scheduling policy to the default Kubernetes scheduler. An example policy looks like:
3521

36-
The policy definition section below describes how to actually create these strategies in a kubernetes cluster.
37-
38-
### Quick set up
39-
The deploy folder has all of the yaml files necessary to get Telemetry Aware Scheduling running in a Kubernetes cluster. Some additional steps are required to configure the generic scheduler and metrics endpoints.
40-
41-
#### Custom Metrics Pipeline
42-
TAS relies on metrics from the custom metrics pipeline. A guide on setting up the custom metrics pipeline to have it operate with TAS is [here.](./docs/custom-metrics.md)
43-
If this pipeline isn't set up, and node level metrics aren't exposed through it, TAS will have no metrics on which to make decisions.
44-
45-
#### Extender configuration
46-
Note: a shell script that shows these steps can be found [here](deploy/extender-configuration). This script should be seen as a guide only, and will not work on most Kubernetes installations.
47-
48-
The extender configuration files can be found under deploy/extender-configuration.
49-
TAS Scheduler Extender needs to be registered with the Kubernetes Scheduler. In order to do this a configmap should be created like the below:
5022
````
5123
apiVersion: v1
5224
kind: ConfigMap
@@ -83,158 +55,12 @@ data:
8355
}
8456
8557
````
86-
This file can be found [in the deploy folder](./deploy/extender-configuration/scheduler-extender-configmap.yaml). This configmap can be created with ``kubectl apply -f ./deploy/scheduler-extender-configmap.yaml``
87-
The scheduler requires flags passed to it in order to know the location of this config map. The flags are:
88-
````
89-
- --policy-configmap=scheduler-extender-policy
90-
- --policy-configmap-namespace=kube-system
91-
````
92-
93-
If scheduler is running as a service these can be added as flags to the binary. If scheduler is running as a container - as in kubeadm - these args can be passed in the deployment file.
94-
Note: For Kubeadm set ups some additional steps may be needed.
95-
1) Add the ability to get configmaps to the kubeadm scheduler config map. (A cluster role binding for this is at deploy/extender-configuration/configmap-getter.yaml)
96-
2) Add the ``dnsPolicy: ClusterFirstWithHostNet`` in order to access the scheduler extender by service name.
97-
98-
After these steps the scheduler extender should be registered with the Kubernetes Scheduler.
99-
100-
#### Deploy TAS
101-
Telemetry Aware Scheduling uses go modules. It requires Go 1.13+ with modules enabled in order to build. TAS has been tested with Kubernetes 1.14+. TAS was tested on Intel® Server Board S2600WF-Based Systems (Wolf Pass).
102-
A yaml file for TAS is contained in the deploy folder along with its service and RBAC roles and permissions.
103-
104-
**Note:** If run without the unsafe flag ([described in the table below](#tas-scheduler-extender)) a secret called extender-secret will need to be created with the cert and key for the TLS endpoint.
105-
TAS will not deploy if there is no secret available with the given deployment file.
106-
107-
A secret can be created with:
108-
109-
``
110-
kubectl create secret tls extender-secret --cert /etc/kubernetes/<PATH_TO_CERT> --key /etc/kubernetes/<PATH_TO_KEY>
111-
``
112-
In order to build and deploy run:
113-
114-
``make build && make image && kubectl apply -f deploy/``
115-
116-
After this is run TAS should be operable in the cluster and should be visible after running ``kubectl get pods``
117-
118-
#### Descheduling workloads
119-
Where there is a descheduling strategy in a policy, TAS will label nodes as violators if they break any of the associated rules. In order to deschedule these workloads the [Kubernetes Descheduler](https://github.com/kubernetes-sigs/descheduler) should be used.
120-
The strategy file for Descheduler should be:
121-
````
122-
apiVersion: "descheduler/v1alpha1"
123-
kind: "DeschedulerPolicy"
124-
strategies:
125-
"RemovePodsViolatingNodeAffinity":
126-
enabled: true
127-
params:
128-
nodeAffinityType:
129-
- "requiredDuringSchedulingIgnoredDuringExecution"
130-
````
131-
This file is available [here](deploy/health-metric-demo/descheduler-policy.yaml)
132-
133-
### Policy definition
134-
A Telemetry Policy can be created in Kubernetes using ``kubectl apply -f`` on a valid policy file.
135-
The structure of a policy file is :
136-
137-
````
138-
apiVersion: telemetry.intel.com/v1alpha1
139-
kind: TASPolicy
140-
metadata:
141-
name: scheduling-policy
142-
namespace: default
143-
spec:
144-
strategies:
145-
deschedule:
146-
rules:
147-
- metricname: node_metric
148-
operator: Equals
149-
target: -1
150-
dontschedule:
151-
rules:
152-
- metricname: node_metric
153-
operator: LessThan
154-
target: 10
155-
scheduleonmetric:
156-
rules:
157-
- metricname: node_metric
158-
operator: GreaterThan
159-
````
160-
There are three strategy types in a policy file and rules associated with each.
161-
- **scheduleonmetric** has only one rule. It is consumed by the Telemetry Aware Scheduling Extender and prioritizes nodes based on the rule.
162-
- **dontschedule** strategy has multiple rules, each with a metric name and operator and a target. A pod with this policy will never be scheduled on a node breaking any one of these rules.
163-
- **deschedule** is consumed by the extender. If a pod with this policy is running on a node that violates that pod can be descheduled with the kubernetes descheduler.
164-
165-
dontschedule and deschedule - which incorporate multiple rules - function with an OR operator. That is if any single rule is broken the strategy is considered violated.
166-
Telemetry policies are namespaced, meaning that under normal circumstances a workload can only be associated with a pod in the same namespaces.
167-
168-
### Configuration flags
169-
The below flags can be passed to the binary at run time.
170-
171-
#### TAS Scheduler Extender
172-
name |type | description| usage | default|
173-
-----|------|-----|-------|-----|
174-
|kubeConfig| string |location of kubernetes configuration file | -kubeConfig /root/filename|~/.kube/config
175-
|syncPeriod|duration string| interval between refresh of telemetry data|-syncPeriod 1m| 1s
176-
|cachePort | string | port number at which the cache server will listen for requests | --cachePort 9999 | 8111
177-
|syncPeriod|duration string| interval between refresh of telemetry data|-syncPeriod 1m| 1s
178-
|port| int | port number on which the scheduler extender will listen| -port 32000 | 9001
179-
|cert| string | location of the cert file for the TLS endpoint | --cert=/root/cert.txt| /etc/kubernetes/pki/ca.crt
180-
|key| string | location of the key file for the TLS endpoint| --key=/root/key.txt | /etc/kubernetes/pki/ca.key
181-
|cacert| string | location of the ca certificate for the TLS endpoint| --key=/root/cacert.txt | /etc/kubernetes/pki/ca.crt
182-
|unsafe| bool | whether or not to listen on a TLS endpoint with the scheduler extender | --unsafe=true| false
183-
184-
## Linking a workload to a policy
185-
Pods can be linked with policies by adding a label of the form ``telemetry-policy=<POLICY-NAME>``
186-
This also needs to be done inside higher level workload types i.e. deployments.
187-
188-
For example, in a deployment file:
189-
```
190-
apiVersion: apps/v1
191-
kind: Deployment
192-
metadata:
193-
name: demo-app
194-
labels:
195-
app: demo
196-
spec:
197-
replicas: 1
198-
selector:
199-
matchLabels:
200-
app: demo
201-
template:
202-
metadata:
203-
labels:
204-
app: demo
205-
telemetry-policy: scheduling-policy
206-
spec:
207-
containers:
208-
- name: nginx
209-
image: nginx:latest
210-
imagePullPolicy: IfNotPresent
211-
resources:
212-
limits:
213-
telemetry/scheduling: 1
214-
affinity:
215-
nodeAffinity:
216-
requiredDuringSchedulingIgnoredDuringExecution:
217-
nodeSelectorTerms:
218-
- matchExpressions:
219-
- key: scheduling-policy
220-
operator: NotIn
221-
values:
222-
- violating
223-
```
22458

225-
Here the policy scheduling-policy will apply to all pods created by this deployment.
226-
There are three changes to the demo policy here:
227-
- A label ``telemetry-policy=<POLICYNAME>`` under the pod template which is used by the scheduler to identify the policy.
228-
- A resources/limits entry requesting the resource telemetry/scheduling. This is used to restrict the use of TAS to only selected pods. If this is not in a pod spec the pod will not be scheduled by TAS.
229-
- Affinity rules which add a requiredDuringSchedulingIgnoredDuringExecution affinity to nodes which are labelled ``<POLICYNAME>=violating`` This is used by the descheduler to identify pods on nodes which break their TAS telemetry policies.
59+
There are a number of options available to us under the "extenders" configuration object. Some of these fields - such as setting the urlPrefix, filterVerb and prioritizeVerb are necessary to point the Kubernetes scheduler to our scheduling service, while other sections deal the TLS configuration of mutual TLS. The remaining fields tune the behavior of the scheduler: managedResource is used to specify which pods should be scheduled using this service, in this case pods which request the dummy resource telemetry/scheduling, ignorable tells the scheduler what to do if it can't reach our extender and weight sets the relative influence our extender has on prioritization decisions.
23060

231-
### Security
232-
TAS Scheduler Extender is set up to use in-Cluster config in order to access the Kubernetes API Server. When deployed inside the cluster this along with RBAC controls configured in the installation guide, will give it access to the required resources.
233-
If outside the cluster TAS will try to use a kubernetes config file in order to get permission to get resources from the API server. This can be passed with the --kubeconfig flag to the binary.
61+
With a policy like the above as part of the Kubernetes scheduler configuration the identified webhook becomes part of the scheduling process.
23462

235-
When TAS Scheduler Extender contacts api server an identical flag --kubeConfig can be passed if it's operating outside the cluster.
236-
Additionally TAS Scheduler Extender listens on a TLS endpoint which requires a cert and a key to be supplied.
237-
These are passed to the executable using command line flags. In the provided deployment these certs are added in a Kubernetes secret which is mounted in the pod and passed as flags to the executable from there.
63+
To read more about scheduler extenders see the [official docs](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md).
23864

23965
## Communication and contribution
24066

pkg/scheduler/scheduler.go renamed to extender/scheduler.go

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
//Package scheduler extender logic contains code to respond call from the http endpoint.
2-
package scheduler
1+
//Package extender extender logic contains code to respond call from the http endpoint.
2+
package extender
33

44
import (
55
"crypto/tls"
@@ -90,13 +90,13 @@ func checkSymLinks(filename string) error {
9090
return nil
9191
}
9292

93-
// StartServer starts the HTTP server needed for scheduler.
93+
// StartServer starts the HTTP server needed for extender.
9494
// It registers the handlers and checks for existing telemetry policies.
9595
func (m Server) StartServer(port string, certFile string, keyFile string, caFile string, unsafe bool) {
9696
mx := http.NewServeMux()
9797
mx.HandleFunc("/", handlerWithMiddleware(errorHandler))
98-
mx.HandleFunc("/scheduler/prioritize", handlerWithMiddleware(m.Prioritize))
99-
mx.HandleFunc("/scheduler/filter", handlerWithMiddleware(m.Filter))
98+
mx.HandleFunc("/extender/prioritize", handlerWithMiddleware(m.Prioritize))
99+
mx.HandleFunc("/extender/filter", handlerWithMiddleware(m.Filter))
100100
var err error
101101
if unsafe {
102102
log.Printf("Extender Listening on HTTP %v", port)

pkg/scheduler/types.go renamed to extender/types.go

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
1-
package scheduler
1+
package extender
22

33
import (
44
"net/http"
55

66
v1 "k8s.io/api/core/v1"
77
)
88

9-
//ExtenderScheduler has the capabilities needed to prioritize and filter nodes based on http requests.
10-
type ExtenderScheduler interface {
9+
//Scheduler has the capabilities needed to prioritize and filter nodes based on http requests.
10+
type Scheduler interface {
1111
Prioritize(w http.ResponseWriter, r *http.Request)
1212
Filter(w http.ResponseWriter, r *http.Request)
1313
}
1414

15-
//Server type wraps the implementation of the scheduler.
15+
//Server type wraps the implementation of the extender.
1616
type Server struct {
17-
ExtenderScheduler
17+
Scheduler
1818
}
1919

20-
//TODO: These types are in the k8s.io/kubernetes/scheduler/api package
20+
//TODO: These types are in the k8s.io/kubernetes/extender/api package
2121
// Some import issue is making them tough to access, so they are reimplemented here pending a solution.
2222

2323
// HostPriority represents the priority of scheduling to a particular host, higher priority is better.
@@ -34,9 +34,9 @@ type HostPriorityList []HostPriority
3434
// FailedNodesMap is needed by HTTP server response.
3535
type FailedNodesMap map[string]string
3636

37-
// ExtenderArgs represents the arguments needed by the extender to Filter/Prioritize
37+
// Args represents the arguments needed by the extender to Filter/Prioritize
3838
// nodes for a pod.
39-
type ExtenderArgs struct {
39+
type Args struct {
4040
// Pod being scheduled
4141
Pod v1.Pod
4242
// List of candidate nodes where the pod can be scheduled; to be populated
@@ -47,8 +47,8 @@ type ExtenderArgs struct {
4747
NodeNames *[]string
4848
}
4949

50-
// ExtenderFilterResult stores the result from extender to be sent as response.
51-
type ExtenderFilterResult struct {
50+
// FilterResult stores the result from extender to be sent as response.
51+
type FilterResult struct {
5252
// Filtered set of nodes where the pod can be scheduled; to be populated
5353
// only if ExtenderConfig.NodeCacheCapable == false
5454
Nodes *v1.NodeList

go.mod

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,4 @@ require (
77
k8s.io/apimachinery v0.20.2
88
k8s.io/client-go v0.20.2
99
k8s.io/metrics v0.20.2
10-
1110
)

0 commit comments

Comments
 (0)