Skip to content

Conversation

mystenmark
Copy link
Contributor

Fix crashes in execution_driver due to inability to execute
transactions.


We must hold the lock for the object entry while inserting to the
object_by_id_cache. Otherwise, a surprising bug can occur:

  1. A thread executing TX1 can write object (O,1) to the dirty set and
    then pause.
  2. TX2, which reads (O,1) can begin executing, because
    TransactionManager immediately
    schedules transactions if their inputs are available. It does not matter
    that TX1
    hasn't finished executing yet.
  3. TX2 can write (O,2) to both the dirty set and the object_by_id_cache.
  4. The thread executing TX1 can resume and write (O,1) to the
    object_by_id_cache.

Now, any subsequent attempt to get the latest version of O will return
(O,1) instead of
(O,2).

This seems very unlikely, but it may be more likely under the following
circumstances:

  • While a thread is unlikely to pause for so long, moka cache uses
    optimistic
    lock-free algorithms that have retry loops. Possibly, under high
    contention, this
    code might spin for a surprisingly long time.
  • Additionally, many concurrent re-executions of the same tx could
    happen due to
    the tx finalizer, plus checkpoint executor, consensus, and RPCs from
    fullnodes.

Unfortunately I have not been able to reproduce this bug, so we cannot
be sure that
this fixes the crashes we've seen. But this is certainly a possible bug.

Description

Describe the changes or additions included in this PR.

Test plan

How did you test the new or updated feature?


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

@mystenmark mystenmark requested a review from ebmifa October 22, 2024 18:27
Copy link

vercel bot commented Oct 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 22, 2024 9:24pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Oct 22, 2024 9:24pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Oct 22, 2024 9:24pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Oct 22, 2024 9:24pm

@ebmifa
Copy link
Contributor

ebmifa commented Oct 22, 2024

@mystenmark do we also need this in releases/sui-v1.36.0-release branch?

Fix crashes in execution_driver due to inability to execute
transactions.

--------

We must hold the lock for the object entry while inserting to the
`object_by_id_cache`. Otherwise, a surprising bug can occur:

1. A thread executing TX1 can write object (O,1) to the dirty set and
then pause.
2. TX2, which reads (O,1) can begin executing, because
TransactionManager immediately
schedules transactions if their inputs are available. It does not matter
that TX1
   hasn't finished executing yet.
3. TX2 can write (O,2) to both the dirty set and the object_by_id_cache.
4. The thread executing TX1 can resume and write (O,1) to the
object_by_id_cache.

Now, any subsequent attempt to get the latest version of O will return
(O,1) instead of
(O,2).

This seems very unlikely, but it may be more likely under the following
circumstances:
- While a thread is unlikely to pause for so long, moka cache uses
optimistic
lock-free algorithms that have retry loops. Possibly, under high
contention, this
  code might spin for a surprisingly long time.
- Additionally, many concurrent re-executions of the same tx could
happen due to
the tx finalizer, plus checkpoint executor, consensus, and RPCs from
fullnodes.

Unfortunately I have not been able to reproduce this bug, so we cannot
be sure that
this fixes the crashes we've seen. But this is certainly a possible bug.
@mystenmark
Copy link
Contributor Author

@mystenmark do we also need this in releases/sui-v1.36.0-release branch?

yes, opened #19967

@mystenmark mystenmark enabled auto-merge (squash) October 22, 2024 21:22
@mystenmark mystenmark merged commit d640e32 into releases/sui-v1.35.0-release Oct 22, 2024
44 of 49 checks passed
@mystenmark mystenmark deleted the mlogan-cp-crash-fix branch October 22, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants