Skip to content

Commit c6ac736

Browse files
mknyszekgopherbot
authored andcommitted
runtime: prevent mutual deadlock between GC stopTheWorld and suspendG
Almost everywhere we stop the world we casGToWaitingForGC to prevent mutual deadlock with the GC trying to scan our stack. This historically was only necessary if we weren't stopping the world to change the GC phase, because what we were worried about was mutual deadlock with mark workers' use of suspendG. And, they were the only users of suspendG. In Go 1.22 this changed. The execution tracer began using suspendG, too. This leads to the possibility of mutual deadlock between the execution tracer and a goroutine trying to start or end the GC mark phase. The fix is simple: make the stop-the-world calls for the GC also call casGToWaitingForGC. This way, suspendG is guaranteed to make progress in this circumstance, and once it completes, the stop-the-world can complete as well. We can take this a step further, though, and move casGToWaitingForGC into stopTheWorldWithSema, since there's no longer really a place we can afford to skip this detail. While we're here, rename casGToWaitingForGC to casGToWaitingForSuspendG, since the GC is now not the only potential source of mutual deadlock. Fixes #72740. Change-Id: I5e3739a463ef3e8173ad33c531e696e46260692f Reviewed-on: https://go-review.googlesource.com/c/go/+/681501 Reviewed-by: Carlos Amedee <carlos@golang.org> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
1 parent 53af292 commit c6ac736

File tree

7 files changed

+65
-45
lines changed

7 files changed

+65
-45
lines changed

src/runtime/mgc.go

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1048,7 +1048,7 @@ func gcMarkTermination(stw worldStop) {
10481048
// N.B. The execution tracer is not aware of this status
10491049
// transition and handles it specially based on the
10501050
// wait reason.
1051-
casGToWaitingForGC(curgp, _Grunning, waitReasonGarbageCollection)
1051+
casGToWaitingForSuspendG(curgp, _Grunning, waitReasonGarbageCollection)
10521052

10531053
// Run gc on the g0 stack. We do this so that the g stack
10541054
// we're currently running on will no longer change. Cuts
@@ -1522,7 +1522,8 @@ func gcBgMarkWorker(ready chan struct{}) {
15221522

15231523
systemstack(func() {
15241524
// Mark our goroutine preemptible so its stack
1525-
// can be scanned. This lets two mark workers
1525+
// can be scanned or observed by the execution
1526+
// tracer. This, for example, lets two mark workers
15261527
// scan each other (otherwise, they would
15271528
// deadlock). We must not modify anything on
15281529
// the G stack. However, stack shrinking is
@@ -1532,7 +1533,7 @@ func gcBgMarkWorker(ready chan struct{}) {
15321533
// N.B. The execution tracer is not aware of this status
15331534
// transition and handles it specially based on the
15341535
// wait reason.
1535-
casGToWaitingForGC(gp, _Grunning, waitReasonGCWorkerActive)
1536+
casGToWaitingForSuspendG(gp, _Grunning, waitReasonGCWorkerActive)
15361537
switch pp.gcMarkWorkerMode {
15371538
default:
15381539
throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")

src/runtime/mgcmark.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ func markroot(gcw *gcWork, i uint32, flushBgCredit bool) int64 {
227227
userG := getg().m.curg
228228
selfScan := gp == userG && readgstatus(userG) == _Grunning
229229
if selfScan {
230-
casGToWaitingForGC(userG, _Grunning, waitReasonGarbageCollectionScan)
230+
casGToWaitingForSuspendG(userG, _Grunning, waitReasonGarbageCollectionScan)
231231
}
232232

233233
// TODO: suspendG blocks (and spins) until gp
@@ -682,7 +682,7 @@ func gcAssistAlloc1(gp *g, scanWork int64) {
682682
}
683683

684684
// gcDrainN requires the caller to be preemptible.
685-
casGToWaitingForGC(gp, _Grunning, waitReasonGCAssistMarking)
685+
casGToWaitingForSuspendG(gp, _Grunning, waitReasonGCAssistMarking)
686686

687687
// drain own cached work first in the hopes that it
688688
// will be more cache friendly.

src/runtime/proc.go

Lines changed: 44 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1347,13 +1347,13 @@ func casGToWaiting(gp *g, old uint32, reason waitReason) {
13471347
casgstatus(gp, old, _Gwaiting)
13481348
}
13491349

1350-
// casGToWaitingForGC transitions gp from old to _Gwaiting, and sets the wait reason.
1351-
// The wait reason must be a valid isWaitingForGC wait reason.
1350+
// casGToWaitingForSuspendG transitions gp from old to _Gwaiting, and sets the wait reason.
1351+
// The wait reason must be a valid isWaitingForSuspendG wait reason.
13521352
//
13531353
// Use this over casgstatus when possible to ensure that a waitreason is set.
1354-
func casGToWaitingForGC(gp *g, old uint32, reason waitReason) {
1355-
if !reason.isWaitingForGC() {
1356-
throw("casGToWaitingForGC with non-isWaitingForGC wait reason")
1354+
func casGToWaitingForSuspendG(gp *g, old uint32, reason waitReason) {
1355+
if !reason.isWaitingForSuspendG() {
1356+
throw("casGToWaitingForSuspendG with non-isWaitingForSuspendG wait reason")
13571357
}
13581358
casGToWaiting(gp, old, reason)
13591359
}
@@ -1487,23 +1487,7 @@ func stopTheWorld(reason stwReason) worldStop {
14871487
gp := getg()
14881488
gp.m.preemptoff = reason.String()
14891489
systemstack(func() {
1490-
// Mark the goroutine which called stopTheWorld preemptible so its
1491-
// stack may be scanned.
1492-
// This lets a mark worker scan us while we try to stop the world
1493-
// since otherwise we could get in a mutual preemption deadlock.
1494-
// We must not modify anything on the G stack because a stack shrink
1495-
// may occur. A stack shrink is otherwise OK though because in order
1496-
// to return from this function (and to leave the system stack) we
1497-
// must have preempted all goroutines, including any attempting
1498-
// to scan our stack, in which case, any stack shrinking will
1499-
// have already completed by the time we exit.
1500-
//
1501-
// N.B. The execution tracer is not aware of this status
1502-
// transition and handles it specially based on the
1503-
// wait reason.
1504-
casGToWaitingForGC(gp, _Grunning, waitReasonStoppingTheWorld)
15051490
stopTheWorldContext = stopTheWorldWithSema(reason) // avoid write to stack
1506-
casgstatus(gp, _Gwaiting, _Grunning)
15071491
})
15081492
return stopTheWorldContext
15091493
}
@@ -1592,7 +1576,30 @@ var gcsema uint32 = 1
15921576
//
15931577
// Returns the STW context. When starting the world, this context must be
15941578
// passed to startTheWorldWithSema.
1579+
//
1580+
//go:systemstack
15951581
func stopTheWorldWithSema(reason stwReason) worldStop {
1582+
// Mark the goroutine which called stopTheWorld preemptible so its
1583+
// stack may be scanned by the GC or observed by the execution tracer.
1584+
//
1585+
// This lets a mark worker scan us or the execution tracer take our
1586+
// stack while we try to stop the world since otherwise we could get
1587+
// in a mutual preemption deadlock.
1588+
//
1589+
// We must not modify anything on the G stack because a stack shrink
1590+
// may occur, now that we switched to _Gwaiting, specifically if we're
1591+
// doing this during the mark phase (mark termination excepted, since
1592+
// we know that stack scanning is done by that point). A stack shrink
1593+
// is otherwise OK though because in order to return from this function
1594+
// (and to leave the system stack) we must have preempted all
1595+
// goroutines, including any attempting to scan our stack, in which
1596+
// case, any stack shrinking will have already completed by the time we
1597+
// exit.
1598+
//
1599+
// N.B. The execution tracer is not aware of this status transition and
1600+
// andles it specially based on the wait reason.
1601+
casGToWaitingForSuspendG(getg().m.curg, _Grunning, waitReasonStoppingTheWorld)
1602+
15961603
trace := traceAcquire()
15971604
if trace.ok() {
15981605
trace.STWStart(reason)
@@ -1700,6 +1707,9 @@ func stopTheWorldWithSema(reason stwReason) worldStop {
17001707

17011708
worldStopped()
17021709

1710+
// Switch back to _Grunning, now that the world is stopped.
1711+
casgstatus(getg().m.curg, _Gwaiting, _Grunning)
1712+
17031713
return worldStop{
17041714
reason: reason,
17051715
startedStopping: start,
@@ -2068,15 +2078,23 @@ found:
20682078
func forEachP(reason waitReason, fn func(*p)) {
20692079
systemstack(func() {
20702080
gp := getg().m.curg
2071-
// Mark the user stack as preemptible so that it may be scanned.
2072-
// Otherwise, our attempt to force all P's to a safepoint could
2073-
// result in a deadlock as we attempt to preempt a worker that's
2074-
// trying to preempt us (e.g. for a stack scan).
2081+
// Mark the user stack as preemptible so that it may be scanned
2082+
// by the GC or observed by the execution tracer. Otherwise, our
2083+
// attempt to force all P's to a safepoint could result in a
2084+
// deadlock as we attempt to preempt a goroutine that's trying
2085+
// to preempt us (e.g. for a stack scan).
2086+
//
2087+
// We must not modify anything on the G stack because a stack shrink
2088+
// may occur. A stack shrink is otherwise OK though because in order
2089+
// to return from this function (and to leave the system stack) we
2090+
// must have preempted all goroutines, including any attempting
2091+
// to scan our stack, in which case, any stack shrinking will
2092+
// have already completed by the time we exit.
20752093
//
20762094
// N.B. The execution tracer is not aware of this status
20772095
// transition and handles it specially based on the
20782096
// wait reason.
2079-
casGToWaitingForGC(gp, _Grunning, reason)
2097+
casGToWaitingForSuspendG(gp, _Grunning, reason)
20802098
forEachPInternal(fn)
20812099
casgstatus(gp, _Gwaiting, _Grunning)
20822100
})

src/runtime/runtime2.go

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1163,17 +1163,17 @@ func (w waitReason) isMutexWait() bool {
11631163
w == waitReasonSyncRWMutexLock
11641164
}
11651165

1166-
func (w waitReason) isWaitingForGC() bool {
1167-
return isWaitingForGC[w]
1166+
func (w waitReason) isWaitingForSuspendG() bool {
1167+
return isWaitingForSuspendG[w]
11681168
}
11691169

1170-
// isWaitingForGC indicates that a goroutine is only entering _Gwaiting and
1171-
// setting a waitReason because it needs to be able to let the GC take ownership
1172-
// of its stack. The G is always actually executing on the system stack, in
1173-
// these cases.
1170+
// isWaitingForSuspendG indicates that a goroutine is only entering _Gwaiting and
1171+
// setting a waitReason because it needs to be able to let the suspendG
1172+
// (used by the GC and the execution tracer) take ownership of its stack.
1173+
// The G is always actually executing on the system stack in these cases.
11741174
//
11751175
// TODO(mknyszek): Consider replacing this with a new dedicated G status.
1176-
var isWaitingForGC = [len(waitReasonStrings)]bool{
1176+
var isWaitingForSuspendG = [len(waitReasonStrings)]bool{
11771177
waitReasonStoppingTheWorld: true,
11781178
waitReasonGCMarkTermination: true,
11791179
waitReasonGarbageCollection: true,

src/runtime/stack.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1212,14 +1212,14 @@ func isShrinkStackSafe(gp *g) bool {
12121212
return false
12131213
}
12141214
// We also can't copy the stack while tracing is enabled, and
1215-
// gp is in _Gwaiting solely to make itself available to the GC.
1215+
// gp is in _Gwaiting solely to make itself available to suspendG.
12161216
// In these cases, the G is actually executing on the system
12171217
// stack, and the execution tracer may want to take a stack trace
12181218
// of the G's stack. Note: it's safe to access gp.waitreason here.
12191219
// We're only checking if this is true if we took ownership of the
12201220
// G with the _Gscan bit. This prevents the goroutine from transitioning,
12211221
// which prevents gp.waitreason from changing.
1222-
if traceEnabled() && readgstatus(gp)&^_Gscan == _Gwaiting && gp.waitreason.isWaitingForGC() {
1222+
if traceEnabled() && readgstatus(gp)&^_Gscan == _Gwaiting && gp.waitreason.isWaitingForSuspendG() {
12231223
return false
12241224
}
12251225
return true

src/runtime/trace.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -376,7 +376,7 @@ func traceAdvance(stopTrace bool) {
376376
me := getg().m.curg
377377
// We don't have to handle this G status transition because we
378378
// already eliminated ourselves from consideration above.
379-
casGToWaitingForGC(me, _Grunning, waitReasonTraceGoroutineStatus)
379+
casGToWaitingForSuspendG(me, _Grunning, waitReasonTraceGoroutineStatus)
380380
// We need to suspend and take ownership of the G to safely read its
381381
// goid. Note that we can't actually emit the event at this point
382382
// because we might stop the G in a window where it's unsafe to write

src/runtime/tracestatus.go

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -126,11 +126,12 @@ func goStatusToTraceGoStatus(status uint32, wr waitReason) tracev2.GoStatus {
126126
// There are a number of cases where a G might end up in
127127
// _Gwaiting but it's actually running in a non-preemptive
128128
// state but needs to present itself as preempted to the
129-
// garbage collector. In these cases, we're not going to
130-
// emit an event, and we want these goroutines to appear in
131-
// the final trace as if they're running, not blocked.
129+
// garbage collector and traceAdvance (via suspendG). In
130+
// these cases, we're not going to emit an event, and we
131+
// want these goroutines to appear in the final trace as
132+
// if they're running, not blocked.
132133
tgs = tracev2.GoWaiting
133-
if status == _Gwaiting && wr.isWaitingForGC() {
134+
if status == _Gwaiting && wr.isWaitingForSuspendG() {
134135
tgs = tracev2.GoRunning
135136
}
136137
case _Gdead:

0 commit comments

Comments
 (0)