Skip to content

Commit 281cfcf

Browse files
mknyszekgopherbot
authored andcommitted
runtime: handle system goroutines later in goroutine profiling
Before CL 650697, there was only one system goroutine that could dynamically change between being a user goroutine and a system goroutine, and that was the finalizer/cleanup goroutine. In goroutine profiles, it was handled explicitly. It's status would be checked during the first STW, and its stack would be recorded. This let the goroutine profiler completely ignore system goroutines once the world was started again. CL 650697 added dedicated cleanup goroutines (there may be more than one), and with this, the logic for finalizer goroutines no longer scaled. In that CL, I let the isSystemGoroutine check be dynamic and dropped the special case, but this was based on incorrect assumptions. Namely, it's possible for the scheduler to observe, for example, the finalizer goroutine as a system goroutine and ignore it, but then later the goroutine profiler itself sees it as a user goroutine. At that point it's too late and already running. This violates the invariant of the goroutine profile that all goroutines are handled by the profiler before they start executing. In practice, the result is that the goroutine profiler can crash when it checks this invariant (not checking the invariant means racily reading goroutine stack memory). The root cause of the problem is that these system goroutines do not participate in the goroutine profiler's state machine. Normally, when profiling, goroutines transition from 'absent' to 'in-progress' to 'satisfied'. However with system goroutines, the state machine is ignored entirely. They always stay in the 'absent' state. This means that if a goroutine transitions from system to user, it is eligible for a profile record when it shouldn't be. That transition shouldn't be allowed to occur with respect to the goroutine profiler, because the goroutine profiler is trying to snapshot the state of every goroutine. The fix to this problem is simple: don't ignore system goroutines. Let them participate in the goroutine profile state machine. Instead, decide whether or not to record the stack after the goroutine has been acquired for goroutine profiling. This means if the scheduler observes the finalizer goroutine as a system goroutine, it will get promoted in the goroutine profiler's state machine, and no other part of the goroutine profiler will observe the goroutine again. Simultaneously, the stack record for the goroutine will be correctly skipped. Fixes #74090. Change-Id: Icb9a164a033be22aaa942d19e828e895f700ca74 Reviewed-on: https://go-review.googlesource.com/c/go/+/680477 Reviewed-by: Carlos Amedee <carlos@golang.org> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
1 parent 4f86f22 commit 281cfcf

File tree

2 files changed

+50
-4
lines changed

2 files changed

+50
-4
lines changed

src/runtime/mprof.go

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1431,10 +1431,6 @@ func tryRecordGoroutineProfile(gp1 *g, pcbuf []uintptr, yield func()) {
14311431
// so here we check _Gdead first.
14321432
return
14331433
}
1434-
if isSystemGoroutine(gp1, false) {
1435-
// System goroutines should not appear in the profile.
1436-
return
1437-
}
14381434

14391435
for {
14401436
prev := gp1.goroutineProfiled.Load()
@@ -1472,6 +1468,17 @@ func tryRecordGoroutineProfile(gp1 *g, pcbuf []uintptr, yield func()) {
14721468
// stack), or from the scheduler in preparation to execute gp1 (running on the
14731469
// system stack).
14741470
func doRecordGoroutineProfile(gp1 *g, pcbuf []uintptr) {
1471+
if isSystemGoroutine(gp1, false) {
1472+
// System goroutines should not appear in the profile.
1473+
// Check this here and not in tryRecordGoroutineProfile because isSystemGoroutine
1474+
// may change on a goroutine while it is executing, so while the scheduler might
1475+
// see a system goroutine, goroutineProfileWithLabelsConcurrent might not, and
1476+
// this inconsistency could cause invariants to be violated, such as trying to
1477+
// record the stack of a running goroutine below. In short, we still want system
1478+
// goroutines to participate in the same state machine on gp1.goroutineProfiled as
1479+
// everything else, we just don't record the stack in the profile.
1480+
return
1481+
}
14751482
if readgstatus(gp1) == _Grunning {
14761483
print("doRecordGoroutineProfile gp1=", gp1.goid, "\n")
14771484
throw("cannot read stack of running goroutine")

src/runtime/pprof/pprof_test.go

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1808,6 +1808,45 @@ func TestGoroutineProfileCoro(t *testing.T) {
18081808
goroutineProf.WriteTo(io.Discard, 1)
18091809
}
18101810

1811+
// This test tries to provoke a situation wherein the finalizer goroutine is
1812+
// erroneously inspected by the goroutine profiler in such a way that could
1813+
// cause a crash. See go.dev/issue/74090.
1814+
func TestGoroutineProfileIssue74090(t *testing.T) {
1815+
testenv.MustHaveParallelism(t)
1816+
1817+
goroutineProf := Lookup("goroutine")
1818+
1819+
// T is a pointer type so it won't be allocated by the tiny
1820+
// allocator, which can lead to its finalizer not being called
1821+
// during this test.
1822+
type T *byte
1823+
for range 10 {
1824+
// We use finalizers for this test because finalizers transition between
1825+
// system and user goroutine on each call, since there's substantially
1826+
// more work to do to set up a finalizer call. Cleanups, on the other hand,
1827+
// transition once for a whole batch, and so are less likely to trigger
1828+
// the failure. Under stress testing conditions this test fails approximately
1829+
// 5 times every 1000 executions on a 64 core machine without the appropriate
1830+
// fix, which is not ideal but if this test crashes at all, it's a clear
1831+
// signal that something is broken.
1832+
var objs []*T
1833+
for range 10000 {
1834+
obj := new(T)
1835+
runtime.SetFinalizer(obj, func(_ interface{}) {})
1836+
objs = append(objs, obj)
1837+
}
1838+
objs = nil
1839+
1840+
// Queue up all the finalizers.
1841+
runtime.GC()
1842+
1843+
// Try to run a goroutine profile concurrently with finalizer execution
1844+
// to trigger the bug.
1845+
var w strings.Builder
1846+
goroutineProf.WriteTo(&w, 1)
1847+
}
1848+
}
1849+
18111850
func BenchmarkGoroutine(b *testing.B) {
18121851
withIdle := func(n int, fn func(b *testing.B)) func(b *testing.B) {
18131852
return func(b *testing.B) {

0 commit comments

Comments
 (0)