-
Notifications
You must be signed in to change notification settings - Fork 743
Open
Labels
contributionThis PR is from a community contributor.This PR is from a community contributor.type/bugThe issue is confirmed as a bug.The issue is confirmed as a bug.
Description
Bug Report
We noticed an issue where RG goroutines don't get cleaned up properly when the traffic ramp down:
goroutine 5042535512 [select, 1540 minutes]:
github.com/tikv/pd/client/resource_group/controller.(*groupCostController).handleTokenBucketUpdateEvent(0xc023ad4000?, {0x679abb0, 0xc002803040}, {0xc1d66f62a0?, 0xc0e3235560?, 0x9662f80?})
.../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
.../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
.../resource_group/controller/controller.go:251 +0x906
We saw thousands of tikv:8252
errors during that window and every time this happens, the TiDB server's goroutine goes up by 100K, and all goroutines are blocked on the above code. This leads to high CPU usage on the TiDB node.
What did you do?
- Create a RG group with minimal quota
RU_PER_SEC=1
- Send 10K QPS using this RG group
What did you expect to see?
TiDB server's goroutine should be stable. When the traffic ramp down it should recover to lower level.
What did you see instead?
A lot of goroutines from RG package sticked around and didn't get cleaned up.
What version of PD are you using (pd-server -V
)?
trace is provided for v7.1.0 but the same can be reproduced on our v8.5.2 cluster.
Metadata
Metadata
Assignees
Labels
contributionThis PR is from a community contributor.This PR is from a community contributor.type/bugThe issue is confirmed as a bug.The issue is confirmed as a bug.