Skip to content

Goroutine leak from controller.(*groupCostController).handleTokenBucketUpdateEvent #9745

@yzhan1

Description

@yzhan1

Bug Report

We noticed an issue where RG goroutines don't get cleaned up properly when the traffic ramp down:

goroutine 5042535512 [select, 1540 minutes]:

github.com/tikv/pd/client/resource_group/controller.(*groupCostController).handleTokenBucketUpdateEvent(0xc023ad4000?, {0x679abb0, 0xc002803040}, {0xc1d66f62a0?, 0xc0e3235560?, 0x9662f80?})
      .../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
      .../resource_group/controller/controller.go:720 +0x1ea
created by github.com/tikv/pd/client/resource_group/controller.(*ResourceGroupsController).Start.func1 in goroutine 1064
      .../resource_group/controller/controller.go:251 +0x906

We saw thousands of tikv:8252 errors during that window and every time this happens, the TiDB server's goroutine goes up by 100K, and all goroutines are blocked on the above code. This leads to high CPU usage on the TiDB node.

What did you do?

  1. Create a RG group with minimal quota RU_PER_SEC=1
  2. Send 10K QPS using this RG group

What did you expect to see?

TiDB server's goroutine should be stable. When the traffic ramp down it should recover to lower level.

What did you see instead?

A lot of goroutines from RG package sticked around and didn't get cleaned up.

What version of PD are you using (pd-server -V)?

trace is provided for v7.1.0 but the same can be reproduced on our v8.5.2 cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.type/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions