`Mutex::try_lock` on poisoned mutex always fails with `WouldBlock`

## Background

When a panic occurs while a `MutexGuard` is held, the `Mutex` is marked as _poisoned_ and the lock is released. Future attempts to acquire the lock will fail with `PoisonError`, which can be overridden by calling `PoisonError::into_inner` to still get at the inner value. See the [standard library documentation for Mutex poisoning for details](https://doc.rust-lang.org/std/sync/struct.Mutex.html#poisoning).

Shuttle's version of `Mutex` also supports this behavior, which is validated by the existing unit test (pasted below), which panics while holding a lock, catches the panic to avoid failing the test, then asserts that attempts to re-acquire the lock after the panic fail with `PoisonError`:

```rust
#[test]
fn mutex_poison() {
    check_dfs(
        || {
            let lock = Arc::new(Mutex::new(false));

            let thd = {
                let lock = Arc::clone(&lock);
                thread::spawn(move || {
                    let _err = catch_unwind(AssertUnwindSafe(move || {
                        let _lock = lock.lock().unwrap();
                        panic!("expected panic");
                    }))
                    .unwrap_err();
                })
            };

            thd.join().unwrap();

            let result = lock.lock();
            assert!(matches!(result, Err(PoisonError { .. })));
        },
        None,
    )
}
```

## Issue

However, if we modify the test to use `Mutex::try_lock` instead of `Mutex::lock` (as below), the test no longer succeeds: we expect `try_lock` to fail with a `TryLockError::Poisoned` error, but we get a `TryLockError::WouldBlock` error instead, even though there is only one thread and the lock is not held:

```rust
#[test]
fn mutex_poison_try_lock() {
    check_dfs(
        || {
            let lock = Arc::new(Mutex::new(false));

            let thd = {
                let lock = Arc::clone(&lock);
                thread::spawn(move || {
                    let _err = catch_unwind(AssertUnwindSafe(move || {
                        let _lock = lock.lock().unwrap();
                        panic!("expected panic");
                    }))
                    .unwrap_err();
                })
            };

            thd.join().unwrap();

            match lock.try_lock() {
                Ok(_) => panic!("try_lock succeeded"),
                Err(TryLockError::WouldBlock) => panic!("try_lock failed with WouldBlock"),
                Err(TryLockError::Poisoned(_)) => (), // Expected case
            };
        },
        None,
    )
}
```

This fails with

```
thread 'basic::panic::mutex_poison_try_lock' panicked at shuttle/tests/basic/panic.rs:56:50:
try_lock failed with WouldBlock
```

## Cause

The root cause appears to come from the `BatchSemaphore` that provides the basic concurrency primitive Shuttle uses to implement `Mutex`. In `BatchSemaphore::release`, the semaphore is marked as "closed" if `ExecutionState::should_stop()` is true (which is the case if `std::thread::panicking()` is true):

```rust
pub fn release(&self, num_permits: usize) {
    ...
    if ExecutionState::should_stop() {
        ...
        state.closed = true;
        return;
    }
    ...
}
```

Then, because the semaphore is marked as closed, all calls to `BatchSemaphore::try_acquire` return `Err(TryAcquireError::Closed)`, which `Mutex::try_lock` maps to a `WouldBlock` error: `self.semaphore.try_acquire(1).map_err(|_| TryLockError::WouldBlock)?;`.

## Possible direction for a fix

It seems like the root cause of the issue is the assumption that `ExecutionState::should_stop()` should return true if the thread is panicking. This is true if the code under test doesn't have any `std::panic::catch_unwind`s, because the panic is going to propagate up and cause a test failure anyways. But if there are caught panics in the test, then it's not necessarily true that the thread panicking means the test case is going to abort, and the semaphore should still be able to vend permits in this case.

However, the implementation of `should_stop` mentions that this behavior is required in order to avoid calling into the scheduler during a panic:

```rust
    /// We also stop if we are currently panicking (e.g., perhaps we're unwinding the stack for a
    /// panic triggered while someone held a Mutex, and so are executing the Drop handler for
    /// MutexGuard). This avoids calling back into the scheduler during a panic, because the state
    /// may be poisoned or otherwise invalid.
```

I'm not sure whether the implications of removing this assumption. What could happen if we remove this check, and possibly call into the scheduler while panicking? Is there a way we could signal things differently to tell the difference between a panic coming from user code vs from Shuttle itself?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Mutex::try_lock` on poisoned mutex always fails with `WouldBlock` #190

Background

Issue

Cause

Possible direction for a fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mutex::try_lock on poisoned mutex always fails with WouldBlock #190

Description

Background

Issue

Cause

Possible direction for a fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Mutex::try_lock` on poisoned mutex always fails with `WouldBlock` #190