Skip to content

Commit b5e044a

Browse files
authored
Disable sigaltstack overriding in asan builds (bytecodealliance#10024)
This commit is an attempt to fix a number of flaky crashes that we've been seeing on OSS-Fuzz for some time now. These crashes only reproduce under ASAN and even then have been spotty to reproduce. The current thinking is that a test with threads (e.g. only `wast_tests` using some of the threads spec tests) is required to run some wasm which will register a `sigaltstack`. Destruction of this `sigaltstack` happens with TLS destructors which seems to have a bad interaction with ASAN state additionally being destroyed around that time. This whole interaction means that no one test case is enough to reproduce the corruption. Many crashes on OSS-Fuzz are likely due to "some historical test case spawned a thread" which corrupted something to crash later. The test case that I can reproduce with locally requires rerunning it in the same process a few thousand times to get a reproduction. The purpose of the `sigaltstack` is to ensure that we have a big enough stack, primarily in debug mode, for testing if a trap is wasm. The hope is that this extra size of the Rust-standard-library-default's stack size is not necessary in release mode with ASAN. In the end time will tell with OSS-Fuzz to see if we can keep this or if we need to both install a bigger sigaltstack in addition to managing them differently in ASAN builds.
1 parent 66d5ad3 commit b5e044a

File tree

1 file changed

+53
-0
lines changed

1 file changed

+53
-0
lines changed

crates/wasmtime/src/runtime/vm/sys/unix/signals.rs

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -393,8 +393,61 @@ unsafe fn set_pc(cx: *mut libc::c_void, pc: usize, arg1: usize) {
393393
/// always large enough for our signal handling code. Override it by creating
394394
/// and registering our own alternate stack that is large enough and has a guard
395395
/// page.
396+
///
397+
/// Note that one might reasonably ask why do this at all? Why not remove
398+
/// `SA_ONSTACK` from our signal handlers entirely? The basic reason for that is
399+
/// because we want to print a message on stack overflow. The Rust standard
400+
/// library will print this message by default and by us overriding the
401+
/// `SIGSEGV` handler above we're now sharing responsibility for that as well.
402+
/// We must have `SA_ONSTACK` to even attempt to being able to printing this
403+
/// message, and so we leave it turned on. Wasmtime will determine a stack
404+
/// overflow fault isn't caused by wasm and then forward to libstd's signal
405+
/// handler which will actually print-and-abort.
406+
///
407+
/// Another reasonable question might be why we need to increase the size of the
408+
/// sigaltstack at all? This is something which we may want to reconsider in the
409+
/// future. For now it helps keep debug builds working which consume more stack
410+
/// when handling normal wasm out-of-bounds and faults. Perhaps in the future we
411+
/// could optimize this more or maybe even do something clever like lazily
412+
/// allocate the sigaltstack on the fault itself. (e.g. trampoline from a tiny
413+
/// stack to the "big stack" during a wasm fault or something like that)
396414
#[cold]
397415
pub fn lazy_per_thread_init() {
416+
// This is a load-bearing requirement to keep address-sanitizer working and
417+
// prevent crashes during fuzzing. The general idea here is that we skip the
418+
// sigaltstack setup below entirely on asan builds, aka fuzzing. The exact
419+
// reason for this is not entirely known, but the closest guess we have at
420+
// this time is something like:
421+
//
422+
// * ASAN builds intercept mmap/munmap to keep track of what's going on.
423+
// * The sigaltstack below registers a TLS destructor for when the current
424+
// thread exits to deallocate the stack.
425+
// * ASAN looks to also have TLS destructors for its own internal state.
426+
// * The current assumption is that the order of these TLS destructors can
427+
// cause corruption in ASAN state where if we run after asan's destructor
428+
// it may intercept munmap and then asan doesn't know it's been
429+
// de-initialized yet.
430+
//
431+
// The reproduction of this involved a standalone project built with
432+
// `-Zsanitizer=address` where internally it would spawn two threads. Each
433+
// thread would build a "hello world" module and then one of the threads
434+
// would execute a noop exported function. If this was run thousands of
435+
// times in a loop in the same process it would eventually crash under asan.
436+
//
437+
// It's notably not quite so simple as frobbing TLS destructors. There's
438+
// clearly something else going on with ASAN state internally which we don't
439+
// fully understand at this time. An attempt to make a standalone C++
440+
// reproduction, for example, was not successful. In lieu of that the best
441+
// we have for now is to disable our custom and larger sigaltstack in asan
442+
// builds.
443+
//
444+
// The exact source was
445+
// https://gist.github.com/alexcrichton/6815a5d57a3c5ca94a8d816a9fcc91af for
446+
// future reference if necessary.
447+
if cfg!(asan) {
448+
return;
449+
}
450+
398451
// This thread local is purely used to register a `Stack` to get deallocated
399452
// when the thread exists. Otherwise this function is only ever called at
400453
// most once per-thread.

0 commit comments

Comments
 (0)