-
Notifications
You must be signed in to change notification settings - Fork 2
DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 as long as Justin is ok with it
I think what you're seeing is an artifact of how developers are typically working. Each usage spike correlates directly with tests being run. It's likely that many developers are pushing changes before they sign off for the day, which causes the spike to extend after 5pm. When the tests run normally, they take about an hour to run. With retries they can take longer, with a worst case being when the RBS pool is exhausted and the tests will wait until resources are ready within the 90 minute test timeout window. You can see the history of the tests here https://github.com/DataBiosphere/jade-data-repo/actions/workflows/int-and-connected-test-run.yml And the nightly test configuration is
TL;DR as far as I can tell this is normal (expected) activity for this pool usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Audited the test runs, it appears that one test run uses roughly 10% of the pool which is about 100 workspaces (I believe I've heard this number is actually about 150 but is mitigated by workspace recovery during the test). this occurs over roughly 20 min, Buffer takes about 1 hour to recover 100 those 100 workspaces.
So about ~7-10 test runs over 1-5 hours would bankrupt the pool, which is what I saw seen.
I expect that bumping this should allow for 14-20 test runs over 5 hours.
Increase the datarepo tools pool size to prevent tests from blocking or failing unnecessarily.
This value was changed from 1500 to 2000: #293
And then was changed from 2000 to 1000: #374
The usage of this pool depends heavily on how much work is being done on TDR. If there are more than three developers creating one or two branches each, a pool size of 1000 can be exhausted due to concurrent test execution. In addition to more activity, recent changes to allow integration tests to run in parallel, and to avoid needing to "lock" an integration host to run tests on, may have put a greater burden on this resource.
Here's the pool usage graph of the last 7 days

The exhaustion spike occurred yesterday, and the capacity dipped down to 15% last week as well, so this usage doesn't seem atypical. Graphana's data only goes back 10 days so there's no way to easily see long term trends.