Skip to content

Optimize build cluster performance #3890

@howardjohn

Description

@howardjohn

This issue aims to track all things related to optimizing our build cluster performance.

We have done a lot of work to reduce test flakes, but we still seem them relatively often. In a large number of cases, these appear to occur when things that should always succeed fail for reasons outside of poorly written tests or buggy Istiocode. For example, simple HTTP requests timing out after many seconds.

We have had two similar issues in the past:

  • Prow cluster resource leak #1988 was caused by not properly cleaning up resources, leading to a ton of resources running in the cluster over time. This was fixed by ensuring we clean up (through many different mechanisms)
  • Test stability regression istio#32985 jobs suddenly hang A LOT. echo takes over 60s in some cases. Triggered by a node upgrade in GKE. We switch from ubuntu to COS to mitigate this. Root cause unknown to date.

Current state:

  • Tests often fail for reasons that are likely explained by node performance (IE trivial command is throttled heavily for N seconds, and test is not robust against this). While we expect our tests to be robust against this to some degree, it appears N is sometimes extremely large. For example, we have a lot of tests that send 5 requests and expect all 5 to succeed, with many retries, with a 30s timeout. These fail relatively often.
  • We have a metric that captures the time it takes to run echo. On a health machine, this should, of course, take near 0ms. We often see this spike, correlated with increased CPU usage.
    2022-02-28_09-53-45
    2022-02-28_09-53-02

Top shows grouped by node type, bottom all nodes. You can see spikes up to 2.5s. Note: the node type graph is likely misleading; we have a small fixed number of n2/t2d nodes but a large dynamic number of e2 nodes. This means there are more samples for e2 AND it has more cache misses.

Things to try:

  • Setting CPU limits: 9dadd37. No tangible improvements in any metric
  • Guarantee QOS test pods (superset of CPU limits)
  • kubelet static CPU policy (superset of Guaranteed QOS)
  • Running other node types (n2, t2d). Currently trialing this. No conclusive data.
  • Using local SSDs. Currently we run 512/256gb pd-ssd. There is evidence we are IO bound in some portion of tests - graphs show our bandwidth is often at the cap, and we do see up to 8mb/s write throttling. However, there is no evidence that removing the bottleneck would change test results; most of our tests are not IO bound. kind etcd runs in tmpfs and should be unimpacted. Local SSD are actually cheaper and far faster, however they require n2 nodes.
  • Increasing CPU requests on some jobs. d28ae63 and 3a0765c put the most expensive ones at 15 CPUs, ensuring dedicate nodes. Since this change, unit test runtime has dropped substantially, but there is not strong evidence yet that it impacts other tests flakiness.
  • Build once, test in many places. Currently we build all docker images N times, and some test binaries N times. This is fairly expensive even with a cache. it would be ideal to build once - possibly on some giant nodes - and then just run the tests locally. This is likely a massive effort.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions