|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * nodes/nodes-scheduler-default.adoc |
| 4 | + |
| 5 | +[id='nodes-scheduler-default-about_{context}'] |
| 6 | += Understanding default scheduling in {product-title} |
| 7 | + |
| 8 | +The existing generic scheduler is the default platform-provided scheduler |
| 9 | +_engine_ that selects a node to host the pod in a three-step operation: |
| 10 | + |
| 11 | + |
| 12 | +Filters the Nodes:: |
| 13 | +The available nodes are filtered based on the constraints or requirements |
| 14 | +specified. This is done by running each node through the list of filter |
| 15 | +functions called _predicates_. |
| 16 | + |
| 17 | +Prioritize the Filtered List of Nodes:: |
| 18 | +This is achieved by passing each node through a series of priority_ functions |
| 19 | +that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 |
| 20 | +indicating a good fit to host the pod. The scheduler configuration can also take |
| 21 | +in a simple _weight_ (positive numeric value) for each priority function. The |
| 22 | +node score provided by each priority function is multiplied by the weight |
| 23 | +(default weight for most priorities is 1) and then combined by adding the scores for each node |
| 24 | +provided by all the priorities. This weight attribute can be used by |
| 25 | +administrators to give higher importance to some priorities. |
| 26 | + |
| 27 | +Select the Best Fit Node:: |
| 28 | +The nodes are sorted based on their scores and the node with the highest score |
| 29 | +is selected to host the pod. If multiple nodes have the same high score, then |
| 30 | +one of them is selected at random. |
| 31 | + |
| 32 | +[nodes-scheduler-default-about-understanding] |
| 33 | +== Understanding Scheduler Policy |
| 34 | + |
| 35 | +The selection of the predicate and priorities defines the policy for the scheduler. |
| 36 | + |
| 37 | +The scheduler configuration file is a JSON file that specifies the predicates and priorities the scheduler |
| 38 | +will consider. |
| 39 | + |
| 40 | +In the absence of the scheduler policy file, the default configuration file, |
| 41 | +*_/etc/origin/master/scheduler.json_*, gets applied. |
| 42 | + |
| 43 | +// we are working on how to configures this in 4.0 right now in https://github.com/openshift/api/pull/181 |
| 44 | + |
| 45 | +[IMPORTANT] |
| 46 | +==== |
| 47 | +The predicates and priorities defined in |
| 48 | +the scheduler configuration file completely override the default scheduler |
| 49 | +policy. If any of the default predicates and priorities are required, |
| 50 | +you must explicitly specify the functions in the policy configuration. |
| 51 | +==== |
| 52 | + |
| 53 | +.Default scheduler configuration file |
| 54 | +[source,json] |
| 55 | +---- |
| 56 | +{ |
| 57 | + "apiVersion": "v1", |
| 58 | + "kind": "Policy", |
| 59 | + "predicates": [ |
| 60 | + { |
| 61 | + "name": "NoVolumeZoneConflict" |
| 62 | + }, |
| 63 | + { |
| 64 | + "name": "MaxEBSVolumeCount" |
| 65 | + }, |
| 66 | + { |
| 67 | + "name": "MaxGCEPDVolumeCount" |
| 68 | + }, |
| 69 | + { |
| 70 | + "name": "MaxAzureDiskVolumeCount" |
| 71 | + }, |
| 72 | + { |
| 73 | + "name": "MatchInterPodAffinity" |
| 74 | + }, |
| 75 | + { |
| 76 | + "name": "NoDiskConflict" |
| 77 | + }, |
| 78 | + { |
| 79 | + "name": "GeneralPredicates" |
| 80 | + }, |
| 81 | + { |
| 82 | + "name": "PodToleratesNodeTaints" |
| 83 | + }, |
| 84 | + { |
| 85 | + "name": "CheckNodeMemoryPressure" |
| 86 | + }, |
| 87 | + { |
| 88 | + "name": "CheckNodeDiskPressure" |
| 89 | + }, |
| 90 | + { |
| 91 | + "argument": { |
| 92 | + "serviceAffinity": { |
| 93 | + "labels": [ |
| 94 | + "region" |
| 95 | + ] |
| 96 | + } |
| 97 | + }, |
| 98 | + "name": "Region" |
| 99 | +
|
| 100 | + } |
| 101 | + ], |
| 102 | + "priorities": [ |
| 103 | + { |
| 104 | + "name": "SelectorSpreadPriority", |
| 105 | + "weight": 1 |
| 106 | + }, |
| 107 | + { |
| 108 | + "name": "InterPodAffinityPriority", |
| 109 | + "weight": 1 |
| 110 | + }, |
| 111 | + { |
| 112 | + "name": "LeastRequestedPriority", |
| 113 | + "weight": 1 |
| 114 | + }, |
| 115 | + { |
| 116 | + "name": "BalancedResourceAllocation", |
| 117 | + "weight": 1 |
| 118 | + }, |
| 119 | + { |
| 120 | + "name": "NodePreferAvoidPodsPriority", |
| 121 | + "weight": 10000 |
| 122 | + }, |
| 123 | + { |
| 124 | + "name": "NodeAffinityPriority", |
| 125 | + "weight": 1 |
| 126 | + }, |
| 127 | + { |
| 128 | + "name": "TaintTolerationPriority", |
| 129 | + "weight": 1 |
| 130 | + }, |
| 131 | + { |
| 132 | + "argument": { |
| 133 | + "serviceAntiAffinity": { |
| 134 | + "label": "zone" |
| 135 | + } |
| 136 | + }, |
| 137 | + "name": "Zone", |
| 138 | + "weight": 2 |
| 139 | + } |
| 140 | + ] |
| 141 | +} |
| 142 | +---- |
| 143 | + |
| 144 | +[nodes-scheduler-default-about-understanding] |
| 145 | +== Scheduler Use Cases |
| 146 | + |
| 147 | +One of the important use cases for scheduling within {product-title} is to |
| 148 | +support flexible affinity and anti-affinity policies. |
| 149 | +ifdef::openshift-enterprise,openshift-origin[] |
| 150 | + |
| 151 | +[[infrastructure-topological-levels]] |
| 152 | +=== Infrastructure Topological Levels |
| 153 | + |
| 154 | +Administrators can define multiple topological levels for their infrastructure |
| 155 | +(nodes) by specifying labels on nodes. For example: `region=r1`, `zone=z1`, `rack=s1`. |
| 156 | + |
| 157 | +These label names have no particular meaning and |
| 158 | +administrators are free to name their infrastructure levels anything, such as |
| 159 | +city/building/room. Also, administrators can define any number of levels |
| 160 | +for their infrastructure topology, with three levels usually being adequate |
| 161 | +(such as: `regions` -> `zones` -> `racks`). Administrators can specify affinity |
| 162 | +and anti-affinity rules at each of these levels in any combination. |
| 163 | +endif::openshift-enterprise,openshift-origin[] |
| 164 | + |
| 165 | +[[affinity]] |
| 166 | +=== Affinity |
| 167 | + |
| 168 | +Administrators should be able to configure the scheduler to specify affinity at |
| 169 | +any topological level, or even at multiple levels. Affinity at a particular |
| 170 | +level indicates that all pods that belong to the same service are scheduled |
| 171 | +onto nodes that belong to the same level. This handles any latency requirements |
| 172 | +of applications by allowing administrators to ensure that peer pods do not end |
| 173 | +up being too geographically separated. If no node is available within the same |
| 174 | +affinity group to host the pod, then the pod is not scheduled. |
| 175 | + |
| 176 | +If you need greater control over where the pods are scheduled, see Using Node Affinity and |
| 177 | +Using Pod Affinity and Anti-affinity. |
| 178 | + |
| 179 | +These advanced scheduling features allow administrators |
| 180 | +to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods. |
| 181 | + |
| 182 | + |
| 183 | +[[anti-affinity]] |
| 184 | +=== Anti-Affinity |
| 185 | + |
| 186 | +Administrators should be able to configure the scheduler to specify |
| 187 | +anti-affinity at any topological level, or even at multiple levels. |
| 188 | +Anti-affinity (or 'spread') at a particular level indicates that all pods that |
| 189 | +belong to the same service are spread across nodes that belong to that |
| 190 | +level. This ensures that the application is well spread for high availability |
| 191 | +purposes. The scheduler tries to balance the service pods across all |
| 192 | +applicable nodes as evenly as possible. |
| 193 | + |
| 194 | +If you need greater control over where the pods are scheduled, see Using Node Affinity and |
| 195 | +Using Pod Affinity and Anti-affinity. |
| 196 | + |
| 197 | +These advanced scheduling features allow administrators |
| 198 | +to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods. |
| 199 | + |
0 commit comments