Merge pull request #80197 from ochromy/SRVKS-1115

jeana-redhat · web-flow · commit 98075b8e5c94 · 2024-10-09T12:04:16.000-04:00
[SRVKS-1115] Performance and scalability of OpenShift Serverless Serving
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -108,6 +108,8 @@ Topics:
   Topics:
   - Name: Creating OpenShift Serverless applications
     File: serverless-applications
+- Name: Scalability and Performance
+  File: scalability-and-performance-serving
 - Name: Autoscaling
   Dir: autoscaling
   Topics:
diff --git a/install/preparing-serverless-install.adoc b/install/preparing-serverless-install.adoc
@@ -27,12 +27,17 @@ The set of supported features, configurations, and integrations for {ServerlessP
 [id="about-serverless-scalability-performance"]
 == Scalability and performance on {ocp-product-title}
 
-{ServerlessProductName} has been tested with a configuration of 3 main nodes and 3 worker nodes, each of which has 64 CPUs, 457 GB of memory, and 394 GB of storage each.
+With a configuration of 3 main nodes and 3 worker nodes, each of which has 64 CPUs, 457 GB of memory, and 394 GB of storage each, the following time values have been determined during testing for a simple Quarkus application: 
 
-The maximum number of Knative services that can be created using this configuration is 3,000. This corresponds to the link:https://docs.openshift.com/container-platform/latest/scalability_and_performance/planning-your-environment-according-to-object-maximums.html#cluster-maximums-major-releases_object-limits[{ocp-product-title} Kubernetes services limit of 10,000], since 1 Knative service creates 3 Kubernetes services.
+* The average scale-from-zero response time was approximately 3.4 seconds.
+* The maximum response time was 8 seconds.
+* The 99.9th percentile of response times was 4.5 seconds.
 
-The average scale from zero response time was approximately 3.4 seconds, with a maximum response time of 8 seconds, and a 99.9th percentile of 4.5 seconds for a simple Quarkus application. These times might vary depending on the application and the runtime of the application.
+These times might vary depending on the application and the runtime of the application.
 
+The maximum number of Knative services that can be created is 3,000. This corresponds to the link:https://docs.openshift.com/container-platform/latest/scalability_and_performance/planning-your-environment-according-to-object-maximums.html#cluster-maximums-major-releases_object-limits[{ocp-product-title} Kubernetes services limit of 10,000], since 1 Knative service creates 3 Kubernetes services.
+
+Learn more about scaling and performance of {ServerlessProductName} Serving in xref:../knative-serving/scalability-and-performance-serving.adoc#scalability-and-performance-serving[Scalability and performance of {ServerlessProductName} Serving].
 
 // OCP specific docs
 
diff --git a/knative-serving/scalability-and-performance-serving.adoc b/knative-serving/scalability-and-performance-serving.adoc
@@ -0,0 +1,33 @@
+:_mod-docs-content-type: ASSEMBLY
+include::_attributes/common-attributes.adoc[]
+[id="scalability-and-performance-serving"]
+= Scalability and performance of {ServerlessProductName} Serving
+:context: scalability-and-performance-serving
+
+toc::[]
+
+{ServerlessProductName} consists of several different components that have different resource requirements and scaling behaviors. These components are horizontally and vertically scalable, but their resource requirements and configuration highly depend on the actual use-case.
+
+Control-plane components:: These components are responsible for observing and reacting to custom resources and continuously reconfiguring the system, for example, the controller pods.
+
+Data-plane components:: These components are directly involved in requests and response handling, for example, the Knative Servings activator component.
+
+The following metrics and findings were recorded using the following test setup:
+
+* A cluster running {ocp-product-title} 4.13
+
+* The cluster running 4 compute nodes in AWS with a machine type of m6.xlarge
+
+* {ServerlessProductName} 1.30
+
+include::modules/serverless-overhead-serving.adoc[leveloffset=+1]
+
+include::modules/serverless-known-limitations-serving.adoc[leveloffset=+1]
+
+include::modules/serverless-scaling-serving.adoc[leveloffset=+1]
+
+include::modules/serverless-minimal-requirements-serving.adoc[leveloffset=+2]
+
+include::modules/serverless-config-minimal-workloads-serving.adoc[leveloffset=+2]
+
+include::modules/serverless-config-high-workloads-serving.adoc[leveloffset=+2]
diff --git a/modules/serverless-config-high-workloads-serving.adoc b/modules/serverless-config-high-workloads-serving.adoc
@@ -0,0 +1,72 @@
+// Module included in the following assemblies:
+//
+// * /knative-serving/scalability-and-performance-serving.adoc
+
+
+:_mod-docs-content-type: PROCEDURE
+[id="serverless-config-high-workloads-serving_{context}"]
+= Configuring Serving for high workloads
+
+You can configure Knative Serving for high workloads using the `KnativeServing` custom resource (CR).
+The following findings are relevant to configuring Knative Serving for a high workload:
+
+[NOTE]
+====
+These findings have been tested with requests with a payload size of 0-32 kb. The Knative Service backends used in those tests had a startup latency between 0 to 10 seconds and response times between 0 to 5 seconds.
+====
+
+* All data-plane components are mostly increasing CPU usage on higher requests and payload scenarios, so the CPU requests and limits have to be tested and potentially increased.
+* The activator component also might need more memory, when it has to buffer more or bigger request payloads, so the memory requests and limits might need to be increased as well.
+* One activator pod can handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors.
+* One `3scale-kourier-gateway` or `istio-ingressgateway` pod can also handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors.
+* Each of the data-plane components consumes up to 1 vCPU of CPU for handling 2500 requests per second. Note that this highly depends on the payload size and the response times of the Knative Service backend.
+
+[IMPORTANT]
+====
+Fast startup and fast response-times of your Knative Service user workloads are critical for good performance of the overall system. The Knative Serving components are buffering incoming requests when the Knative Service user backend is scaling up or when request concurrency has reached its capacity. If your Knative Service user workload introduces long startup or request latency, it will either overload the `activator` component (when the CPU and memory configuration is too low) or lead to errors for the calling clients.
+====
+
+.Procedure
+
+* To fine-tune your installation, use the previous findings combined with your own test results to configure the `KnativeServing` custom resource:
++
+.A high workload configuration in KnativeServing CR
+[source,yaml]
+----
+apiVersion: operator.knative.dev/v1beta1
+kind: KnativeServing
+metadata:
+  name: knative-serving
+  namespace: knative-serving
+spec:
+  high-availability:
+    replicas: 2 <1>
+  workloads:
+    - name: component-name <2>
+      replicas: 2 <3>
+      resources:
+        - container: container-name
+          requests:
+            cpu: <4>
+            memory:
+          limits:
+            cpu:
+            memory:
+  podDisruptionBudgets: <5>
+    - name: name-of-pod-disruption-budget
+      minAvailable: 1
+----
+<1> Set this parameter to at least `2` to make sure you always have at least two instances of every component running. You can also use `workloads` to override the replicas for certain components.
+<2> Use the `workloads` list to configure specific components. Use the `deployment` name of the component and set the `replicas` field. 
+<3> For the `activator`, `webhook`, and `3scale-kourier-gateway` components, which use horizontal pod autoscalers (HPAs), the `replicas` field sets the minimum number of replicas. The actual number of replicas depends on the CPU load and scaling done by the HPAs.
+<4> Set the requested and limited CPU and memory according to at least the idle consumption while also taking the previous findings and your own test results into consideration.
+<5> Adjust the `PodDistruptionBudgets` to a value lower than `replicas` to avoid problems during node maintenance. The default `minAvailable` is set to `1`, so if you increase the required replicas, you must also increase `minAvailable`.
+
+[IMPORTANT]
+====
+As each environment is highly specific, it is essential to test and find your own ideal configuration.
+Use the monitoring and alerting functionality of {ocp-product-title} to continuously monitor your actual resource consumption and make adjustments if needed.
+
+If you are using the {ServerlessProductName} and {SMProductShortName} integration, additional CPU processing is added by the `istio-proxy` sidecar containers.
+For more information about this, see the {SMProductShortName} documentation.
+====
diff --git a/modules/serverless-config-minimal-workloads-serving.adoc b/modules/serverless-config-minimal-workloads-serving.adoc
@@ -0,0 +1,69 @@
+// Module included in the following assemblies:
+//
+// * /knative-serving/scalability-and-performance-serving.adoc
+
+
+:_mod-docs-content-type: PROCEDURE
+[id="serverless-config-minimal-workloads-serving_{context}"]
+= Configuring Serving for minimal workloads
+
+.Procedure
+
+* You can configure Knative Serving for minimal workloads using the `KnativeServing` custom resource (CR):
++
+.A minimal workload configuration in KnativeServing CR
+[source,yaml]
+----
+apiVersion: operator.knative.dev/v1beta1
+kind: KnativeServing
+metadata:
+  name: knative-serving
+  namespace: knative-serving
+spec:
+  high-availability:
+    replicas: 1 <1>
+  workloads:
+    - name: activator
+      replicas: 2 <2>
+      resources:
+        - container: activator
+          requests:
+            cpu: 250m <3>
+            memory: 60Mi <4>
+          limits:
+            cpu: 1000m
+            memory: 600Mi
+    - name: controller
+      replicas: 1 <5>
+      resources:
+        - container: controller
+          requests:
+            cpu: 10m
+            memory: 100Mi
+          limits: <6>
+            cpu: 200m
+            memory: 300Mi
+    - name: webhook
+      replicas: 2
+      resources:
+        - container: webhook
+          requests:
+            cpu: 100m <7>
+            memory: 60Mi
+          limits:
+            cpu: 200m
+            memory: 200Mi
+  podDisruptionBudgets: <8>
+    - name: activator-pdb
+      minAvailable: 1
+    - name: webhook-pdb
+      minAvailable: 1
+----
+<1> Setting this to `1` scales all system components to one replica.
+<2> Activator should always be scaled to a minimum of `2` instances to avoid downtime.
+<3> Activator CPU requests should not be set lower than `250m`, as a `HorizontalPodAutoscaler` will use this as a reference to scale up and down.
+<4> Adjust memory requests to the idle values from the previous table. Also adjust memory limits according to your expected load (this might need custom testing to find the best values).
+<5> One webhook and one controller are sufficient for a minimal-workload scenario
+<6> These limits are sufficient for a minimal-workload scenario, but they also might need adjustments depending on your concrete workload.
+<7> Webhook CPU requests should not be set lower than `100m`, as a HorizontalPodAutoscaler will use this as a reference to scale up and down.
+<8> Adjust the `PodDistruptionBudgets` to a value lower than `replicas`, to avoid problems during node maintenance.
diff --git a/modules/serverless-known-limitations-serving.adoc b/modules/serverless-known-limitations-serving.adoc
@@ -0,0 +1,10 @@
+// Module included in the following assemblies:
+//
+// * /knative-serving/scalability-and-performance-serving.adoc
+
+
+:_mod-docs-content-type: CONCEPT
+[id="serverless-known-limitations-serving_{context}"]
+= Known limitations of {ServerlessProductName} Serving
+
+The maximum number of Knative Services that can be created is 3,000. This corresponds to the {ocp-product-title} Kubernetes services limit of 10,000, since 1 Knative Service creates 3 Kubernetes services.
diff --git a/modules/serverless-minimal-requirements-serving.adoc b/modules/serverless-minimal-requirements-serving.adoc
@@ -0,0 +1,69 @@
+// Module included in the following assemblies:
+//
+// * /knative-serving/scalability-and-performance-serving.adoc
+
+
+:_mod-docs-content-type: CONCEPT
+[id="serverless-minimal-requirements-serving_{context}"]
+= Minimal requirements of {ServerlessProductName} Serving
+
+While the default setup is suitable for medium-sized workloads, it might be over-sized for smaller setups or under-sized for high-workload scenarios.
+To configure {ServerlessProductName} Serving for a minimal workload scenario, you need to know the idle consumption of the system components.
+
+[id="serverless-minimal-requirements-serving-idle-consumption_{context}"]
+== Idle consumption
+
+The idle consumption is dependent on the number of Knative Services. The following memory usage has been measured for the components in the `knative-serving` and `knative-serving-ingress` {ocp-product-title} projects:
+
+[cols=5*,options="header"]
+|===
+|Component
+|0 Services
+|100 Services
+|500 Services
+|1000 Services
+
+|`activator`
+|55Mi
+|86Mi
+|300Mi
+|450Mi
+
+|`autoscaler`
+|52Mi
+|102Mi
+|225Mi
+|350Mi
+
+|`controller`
+|100Mi
+|135Mi
+|310Mi
+|500Mi
+
+|`webhook`
+|60Mi
+|60Mi
+|60Mi
+|60Mi
+
+|`3scale-kourier-gateway`
+|20Mi
+|60Mi
+|190Mi
+|330Mi
+
+|`net-kourier-controller`
+|90Mi
+|170Mi
+|340Mi
+|430Mi
+
+|===
+
+[NOTE]
+====
+Either `3scale-kourier-gateway` and `net-kourier-controller` components or `istio-ingressgateway` and `net-istio-controller` components are installed. 
+
+The memory consumption of `net-istio` is based on the total number of pods within the mesh.
+====
diff --git a/modules/serverless-overhead-serving.adoc b/modules/serverless-overhead-serving.adoc
@@ -0,0 +1,19 @@
+// Module included in the following assemblies:
+//
+// * /knative-serving/scalability-and-performance-serving.adoc
+
+
+:_mod-docs-content-type: CONTEXT
+[id="serverless-overhead-serving_{context}"]
+= Overhead of {ServerlessProductName} Serving
+
+As components of {ServerlessProductName} Serving are part of the data-plane, requests from clients are routed through:
+
+* The ingress-gateway (Kourier or {SMProductShortName})
+* The activator component
+* The queue-proxy sidecar container in each Knative Service
+
+These components introduce an additional hop in networking and perform additional tasks, for example, adding observability and request queuing. The following are the measured latency overheads:
+
+* Each additional network hop adds 0.5 ms to 1 ms latency to a request. Depending on the current load of the Knative Service and if the Knative Service was scaled to zero before the request, the activator component is not always a part of the data-plane.
+* Depending on the payload size, each of the components is consuming up to 1 vCPU of CPU for handling 2500 requests per second.
diff --git a/modules/serverless-scaling-serving.adoc b/modules/serverless-scaling-serving.adoc
@@ -0,0 +1,44 @@
+// Module included in the following assemblies:
+//
+// * /knative-serving/scalability-and-performance-serving.adoc
+
+
+:_mod-docs-content-type: CONCEPT
+[id="serverless-scaling-serving_{context}"]
+= Scaling and performance of {ServerlessProductName} Serving
+
+{ServerlessProductName} Serving has to be scaled and configured based on the following parameters:
+
+* Number of Knative Services
+* Number of Revisions
+* Amount of concurrent requests in the system
+* Size of payloads of the requests
+* The startup-latency and response latency of the Knative Service added by the user's web application
+* Number of changes of the KnativeService custom resource (CR) over time
+
+[id="serverless-scaling-serving-defaults_{context}"]
+== KnativeServing default configuration 
+
+Per default, {ServerlessProductName} Serving is configured to run all components with high-availability and medium-sized CPU and memory requests and limits. This means that the `high-available` field in `KnativeServing` CR is automatically set to a value of `2` and all system components are scaled to two replicas. This configuration is suitable for medium workload scenarios and has been tested with:
+
+* 170 Knative Services
+* 1-2 Revisions per Knative Service
+* 89 test scenarios mainly focused on testing the control plane
+* 48 re-creating scenarios where Knative Services are deleted and re-created
+* 41 stable scenarios, in which requests are slowly but continuously sent to the system
+
+During these test cases, the system components effectively consumed:
+
+[cols=2*,options="header"]
+|===
+
+|Component
+|Measured Resources
+
+|Operator in project `openshift-serverless`
+|1 GB Memory, 0.2 Cores of CPU
+
+|Serving components in project `knative-serving`
+|5 GB Memory, 2.5 Cores of CPU
+
+|===