-
Notifications
You must be signed in to change notification settings - Fork 133
Description
Description
When I terminate zebrad start
using Ctrl-C (SIGINT) on Linux, I get pthread errors.
I can reliably generate these errors during startup and the initial peer connections. But they seem much less frequent after initialization.
This is a serious security risk, because calling pure virtual methods is unexpected code execution. So compilers could optimise it into something dangerous, or something that could be exploited. (That doesn't happen in this case - instead, it prints a warning. But there are no guarantees.)
Initial Analysis
We recently made a few changes to Zebra's startup and shutdown behaviour (#3057, #3071, #3078, #3091).
Some of these errors seem to come from C++ code ("pure virtual method called"), so it's probably a combination of:
- rocksdb, and
- the quick shutdown change from Re-order Zebra startup, so slow services are launched last #3091: https://github.com/ZcashFoundation/zebra/blame/68d7198e9ff269c6329c03ed9fec187660167780/zebrad/src/components/tokio.rs#L62-L63
PR #3091 also stopped zebrad
deleting ephemeral state directories. We should fix that issue as part of this bug fix.
Suggested Design
Precautions:
- upgrade to the latest version of RocksDB - Use the latest bindgen version in zcash_script and rocksdb #2981
Shut down in this order:
- wait for RocksDB to finish, by calling
CancelAllBackgroundWork()
when Zebra wants to exit: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ , - drop the database: https://github.com/facebook/rocksdb/wiki/Known-Issues ,
- delete the database files on disk, if it is ephemeral,
- exit Zebra
We can do this via a shutdown
method in the state service and finalized state, which is called:
- when the state is dropped,
- when the state reaches the
debug_stop_at_height
, or - using a
Shutdown
request to the state, in thestart
andcopy-state
(both state services) commands.
After we've waited for shutdown
to finish, it is safe to exit Zebra.
Other state requests could be processed while we are exiting Zebra:
- return errors for any requests and readiness checks after
shutdown
is called
Rejected Alternatives
Rust's sys::at_exit
handler has been removed:
rust-lang/rust#84115
There's no guarantee that libc::atexit
runs before RocksDB's static destructors, just like there's no guarantee of the exit order between Rust and C:
https://stackoverflow.com/questions/35980148/why-does-an-atexit-handler-panic-when-it-accesses-stdout
Steps to Reproduce
I tried this:
- Run
zebrad start
- Press
Ctrl-C
(SIGINT)
zebrad start &
kill -SIGINT $!
Expected Behaviour
I expected to see this happen:
zebrad
terminates quickly, without errors.
Actual Behaviour
Instead, this happened:
zebrad
terminates quickly, with errors like:
pthread lock: Invalid argument
Aborted (core dumped)
pure virtual method called
terminate called without an active exception
Aborted (core dumped)
pthread destroy mutex: Device or resource busy
Aborted (core dumped)
Zebra Logs
$ /home/dev/.cargo/bin/zebrad start
Dec 03 07:04:11.132 INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: Starting zebrad
Dec 03 07:04:11.132 INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: config=ZebradConfig { consensus: Config { checkpoint_sync: false }, metrics: MetricsSection { endpoint_addr:
None }, network: Config { listen_addr: 0.0.0.0:8233, network: Mainnet, initial_mainnet_peers: {"dnsseed.z.cash:8233", "dnsseed.str4d.xyz:8233", "mainnet.seeder.zfnd.org:8233", "mainnet.is.yol
o.money:8233"}, initial_testnet_peers: {"testnet.seeder.zfnd.org:18233", "testnet.is.yolo.money:18233", "dnsseed.testnet.z.cash:18233"}, peerset_initial_target_size: 50, crawl_new_peer_interv
al: 60s }, state: Config { cache_dir: "/home/dev/.cache/zebra", ephemeral: false, debug_stop_at_height: None }, tracing: TracingSection { use_color: true, filter: None, endpoint_addr: None, f
lamegraph: None, use_journald: false }, sync: SyncSection { max_concurrent_block_requests: 50, lookahead_limit: 2000 }, mempool: Config { debug_enable_at_height: None, tx_cost_limit: 80000000
, eviction_memory_time: 3600s } }
Dec 03 07:04:11.132 INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: initializing node state
Dec 03 07:04:11.132 INFO {zebrad="022808d" net="Main"}: zebra_state::config: the open file limit is at or above the specified limit new_limit=1024 current_limit=1024 hard_rlimit=Some(524288)
Dec 03 07:04:11.144 INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: Opened Zebra state cache at /home/dev/.cache/zebra/state/v11/mainnet
Dec 03 07:04:11.144 INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: reading cached tip height
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}: zebra_state::service::finalized_state: loaded Zebra state cache tip=None
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}: zebra_state::service: starting legacy chain check
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}: zebra_state::service: no legacy chain found
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}: zebrad::commands::start: initializing network
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}:open_listener{addr=0.0.0.0:8233}: zebra_network::peer_set::initialize: Trying to open Zcash protocol endpoint at 0.0.0.0:8233...
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}:open_listener{addr=0.0.0.0:8233}: zebra_network::peer_set::initialize: Opened Zcash protocol endpoint at 0.0.0.0:8233
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="dnsseed.z.cash:8233" remote_ip_count=25
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="mainnet.seeder.zfnd.org:8233" remote_ip_count=25
Dec 03 07:04:11.145 INFO {zebrad="022808d" net="Main"}:add_initial_peers: zebra_network::config: resolved seed peer IP addresses seed="mainnet.is.yolo.money:8233" remote_ip_count=50
^CDec 03 07:04:12.083 INFO {zebrad="022808d" net="Main"}:sig{kind=SignalKind(2) name="SIGINT"}: zebrad::signal: received SIGINT, starting shutdown
pthread destroy mutex: Device or resource busy
Aborted (core dumped)
Environment
Zebra Version
$ zebrad version
zebrad 1.0.0-beta.2
Operating System
$ uname -a
Linux ... 5.4.122 #1-NixOS SMP Wed May 26 10:05:21 UTC 2021 x86_64 GNU/Linux