Optimize build cluster performance

This issue aims to track all things related to optimizing our build cluster performance.

We have done a lot of work to reduce test flakes, but we still seem them relatively often. In a large number of cases, these appear to occur when things that should always succeed fail for reasons outside of poorly written tests or buggy Istiocode. For example, simple HTTP requests timing out after many seconds.

We have had two similar issues in the past:

* https://github.com/istio/test-infra/issues/1988 was caused by not properly cleaning up resources, leading to a ton of resources running in the cluster over time. This was fixed by ensuring we clean up (through many different mechanisms)
* https://github.com/istio/istio/issues/32985 jobs suddenly hang A LOT. `echo` takes over 60s in some cases. Triggered by a node upgrade in GKE. We switch from ubuntu to COS to mitigate this. Root cause unknown to date.


Current state:
* Tests often fail for reasons that are likely explained by node performance (IE trivial command is throttled heavily for N seconds, and test is not robust against this). While we expect our tests to be robust against this *to some degree*, it appears N is sometimes extremely large. For example, we have a lot of tests that send 5 requests and expect all 5 to succeed, with many retries, with a 30s timeout. These fail relatively often.
* We have a metric that captures the time it takes to run `echo`. On a health machine, this should, of course, take near 0ms. We often see this spike, correlated with increased CPU usage.
![2022-02-28_09-53-45](https://user-images.githubusercontent.com/623453/156033670-ef15aeaf-7ce5-4392-a8ba-eb2e364104cf.png)
![2022-02-28_09-53-02](https://user-images.githubusercontent.com/623453/156033671-7fd645cd-e0d3-4e1c-8cb9-44fd379c22f6.png)

Top shows grouped by node type, bottom all nodes. You can see spikes up to 2.5s. Note: the node type graph is likely misleading; we have a small fixed number of n2/t2d nodes but a large dynamic number of e2 nodes. This means there are more samples for e2 AND it has more cache misses.

Things to try:
* [x] Setting CPU limits: https://github.com/istio/test-infra/commit/9dadd370e10ebec3d85878f0b2f890c455838dd2. No tangible improvements in any metric
* [ ] Guarantee QOS test pods (superset of CPU limits)
* [ ] kubelet static CPU policy (superset of Guaranteed QOS)
* [ ] Running other node types (n2, t2d). Currently trialing this. No conclusive data.
* [ ] Using local SSDs. Currently we run 512/256gb pd-ssd. There is evidence we are IO bound in some portion of tests - graphs show our bandwidth is often at the cap, and we do see up to 8mb/s write throttling. However, there is no evidence that removing the bottleneck would change test results; most of our tests are not IO bound. kind etcd runs in tmpfs and should be unimpacted. Local SSD are actually cheaper and far faster, however they require n2 nodes.
* [x] Increasing CPU requests on some jobs. https://github.com/istio/test-infra/commit/d28ae63a6d1e026502ef2a12423380e4c40d2525 and https://github.com/istio/test-infra/commit/3a0765c2cd16f36b29d813fd99308d66eb42726d put the most expensive ones at 15 CPUs, ensuring dedicate nodes. Since this change, unit test runtime has dropped substantially, but there is not strong evidence _yet_ that it impacts other tests flakiness.
* [ ] Build once, test in many places. Currently we build all docker images N times, and some test binaries N times. This is fairly expensive even with a cache. it would be ideal to build once - possibly on some giant nodes - and then just run the tests locally. This is likely a massive effort.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize build cluster performance #3890

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Optimize build cluster performance #3890

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions