Skip to content

Avoid persisting ChannelManager in some cases and separate event from persist notifies #2521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 13, 2023

Conversation

TheBlueMatt
Copy link
Collaborator

@TheBlueMatt TheBlueMatt commented Aug 24, 2023

We currently use the updatable-future from ChannelManager and
ChainMonitor to inform the background-processor that it needs
to wake to process events and persist the ChannelManager. These,
however, are two separate concepts, and just because we have some
messages to send to peers or events to process doesn't necessarily
mean that the ChannelManager needs to be persisted.

This is particularly bad during startup, where we persist a ChannelManager twice for each peer connection - once on peer_connected and once on channel_reestablish, however there's also excess persists in general cases like update_add_htlc.

Fixes #2502

@TheBlueMatt TheBlueMatt added this to the 0.0.117 milestone Aug 24, 2023
@TheBlueMatt
Copy link
Collaborator Author

Tagging 117 since we have users suffering substantially from overly-aggressive channelmanager persistence.

@TheBlueMatt TheBlueMatt linked an issue Aug 24, 2023 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Aug 28, 2023

Codecov Report

Patch coverage: 93.54% and no project coverage change.

Comparison is base (448b191) 90.45% compared to head (6d99f0a) 90.45%.
Report is 15 commits behind head on main.

❗ Current head 6d99f0a differs from pull request most recent head 32e5903. Consider uploading reports for the commit 32e5903 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2521   +/-   ##
=======================================
  Coverage   90.45%   90.45%           
=======================================
  Files         112      112           
  Lines       58564    58653   +89     
  Branches    58564    58653   +89     
=======================================
+ Hits        52976    53057   +81     
- Misses       5588     5596    +8     
Files Changed Coverage Δ
lightning/src/ln/channelmanager.rs 86.83% <93.28%> (+0.11%) ⬆️
lightning-background-processor/src/lib.rs 81.41% <100.00%> (ø)

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from 1e6a675 to 317c735 Compare August 28, 2023 01:29
@TheBlueMatt
Copy link
Collaborator Author

Pushed some additional test coverage via the fuzzer (and rebased).

@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from 317c735 to 8e00e06 Compare August 28, 2023 01:35
@dunxen dunxen self-requested a review August 29, 2023 19:43
@TheBlueMatt
Copy link
Collaborator Author

Rebased.

@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from 8e00e06 to cd4e89c Compare September 1, 2023 20:16
@jkczyz jkczyz requested review from jkczyz and wpaulino September 7, 2023 17:15
@jkczyz
Copy link
Contributor

jkczyz commented Sep 7, 2023

Instead of adding a flag, would using two Notifiers suffice?

@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Sep 8, 2023

Maybe. If we move to separate notifiers that's really saying we'll want to be able to capture the case of waking the background processor to persist without waking it to check for events. I don't think we care about that, and even the separation of waking and persisting here sucks, and I wouldn't suggest if we didn't have a good way to test it (yay fuzzers!) and didn't really need it.

That said, if we do want to avoid including the flag in the notifier, we could move to storing the "needs persistence" flag separately in the ChannelManager itself, exposing the (presumably) atomic bool directly. We'd still tie the notifier to it in the sense that we'd guarantee it doesn't flip on unless the notifier also fires, but would be exposed separately. I'd be okay with this approach too, though disconnecting the two in the API kinda sucks IMO.

@jkczyz
Copy link
Contributor

jkczyz commented Sep 8, 2023

Maybe. If we move to separate notifiers that's really saying we'll want to be able to capture the case of waking the background processor to persist without waking it to check for events. I don't think we care about that, and even the separation of waking and persisting here sucks, and I wouldn't suggest if we didn't have a good way to test it (yay fuzzers!) and didn't really need it.

I wasn't thinking about using separate notifiers for persisting and checking for events. Rather simply using one notifier for NotifyOption::DoPersist and another notifier for NotifyOption::SkipPersistHandleEvents.

I suppose currently there's some interaction required between the two with logically OR-ing the flag. Even if we stick with one notifier, it seems having that logic in the notifier is a layer violation, IIUC -- which I may not lol. If we took @wpaulino's approach and parameterized the type (using bool and ()), how would the OR-ing be implemented. Or would simply setting the flag instead of |= be sufficient?

@TheBlueMatt
Copy link
Collaborator Author

I wasn't thinking about using separate notifiers for persisting and checking for events. Rather simply using one notifier for NotifyOption::DoPersist and another notifier for NotifyOption::SkipPersistHandleEvents.

Right, that's one notifier for handle-events + persist and a separate one for events. Not quite sure I understand your suggestion here.

I suppose currently there's some interaction required between the two with logically OR-ing the flag. Even if we stick with one notifier, it seems having that logic in the notifier is a layer violation

Kinda? I mean yea, the notifier knowing that its about persisting is def a layer violation, but having an "arbitrary boolean" less so, kinda.

If we took @wpaulino's approach and parameterized the type (using bool and ()), how would the OR-ing be implemented. Or would simply setting the flag instead of |= be sufficient?

I don't think we'd need OR-ing if we had two notifiers - we could just notify the wake-background-processor-for-events and wake-background-processor-for-persist notifiers separately and go from there. (I do think we'd always process events in the BP, just only persist based on one notifier...but then we have to handle "which notifier notified me and make sure that we check if both got woken so we persist then" crap).

@jkczyz
Copy link
Contributor

jkczyz commented Sep 8, 2023

Right, that's one notifier for handle-events + persist and a separate one for events. Not quite sure I understand your suggestion here.

Just trying to tease out whether there's a clean way of avoiding the flag in the notifier, but sounds like the OR-ing may not be a concern as you mentioned below. Presumably we wouldn't need the flag at all?

I don't think we'd need OR-ing if we had two notifiers - we could just notify the wake-background-processor-for-events and wake-background-processor-for-persist notifiers separately and go from there. (I do think we'd always process events in the BP, just only persist based on one notifier...but then we have to handle "which notifier notified me and make sure that we check if both got woken so we persist then" crap).

I'm not absolutely wedded to it, but it does seem like this logic belongs in the background processor. Is it much more difficult to set a "needs persist" bool when awaiting on the Selector? Also, why don't we need to do such a check currently when ChainMonitor is the one notifying (i.e., why don't we also need to check if the ChannelManager was notified and wants to persist)?

@wpaulino Any thoughts on this? I think you've been touching / reviewing this code more so than me recently.

@wpaulino
Copy link
Contributor

wpaulino commented Sep 8, 2023

Also, why don't we need to do such a check currently when ChainMonitor is the one notifying (i.e., why don't we also need to check if the ChannelManager was notified and wants to persist)?

The ChainMonitor mostly only wakes because there was an async monitor update persisted so the ChannelManager has some events to process. Once it processes those events, we'll wake again (via the ChannelManager this time) to persist.

Any thoughts on this?

Yeah I'm also not a fan of the new notifier flag, even if from the notifier's PoV it has nothing to do with persistence. What happens once we need to expose some other value? If we can't make the generic value work (bound on core::ops::BitOr?), then I'd at least like to see a patch for the two notifiers approach so we can properly evaluate.

@TheBlueMatt
Copy link
Collaborator Author

Just trying to tease out whether there's a clean way of avoiding the flag in the notifier, but sounds like the OR-ing may not be a concern as you mentioned below. Presumably we wouldn't need the flag at all?

I don't think there is, outside of removing the implicit "notify means persist" and instead have notify from ChannelManager mean "events or persist, maybe both" with a separate get'er on the ChannelManager that lets you ask if persistence is needed right now.

I'm not absolutely wedded to it, but it does seem like this logic belongs in the background processor. Is it much more difficult to set a "needs persist" bool when awaiting on the Selector?

I think that's what this is? Or are you saying the separate get'er described above?

Also, why don't we need to do such a check currently when ChainMonitor is the one notifying (i.e., why don't we also need to check if the ChannelManager was notified and wants to persist)?

If we get woken by the ChainMonitor, when we next go around the loop we'll immediately get woken by the ChannelManager and persist then.

@TheBlueMatt
Copy link
Collaborator Author

Yeah I'm also not a fan of the new notifier flag, even if from the notifier's PoV it has nothing to do with persistence. What happens once we need to expose some other value? If we can't make the generic value work (bound on core::ops::BitOr?), then

We could absolutely do a BitOr bound, but that seems strange - like lets say you put in an int, and you once notify with 4 and once with 5 and now you get 4 XOR 5? But you never put in 4 XOR 5...

I'd at least like to see a patch for the two notifiers approach so we can properly evaluate.

I'm really not a fan of two separate notifiers - its annoying to make that work without always going around the BP twice on every notify. Instead, if we want to remove the notifier bool, I think we should move to a notify implying "events and/or persist" with a separate get'er on the ChannelManager to fetch an atomic flag and see if we need persist. I think that'd be a relatively straightforward patch, just awkward that we have two separate get'ers that are tied at the hip but separate in the ChannelManager API.

@wpaulino
Copy link
Contributor

wpaulino commented Sep 8, 2023

We could absolutely do a BitOr bound, but that seems strange - like lets say you put in an int, and you once notify with 4 and once with 5 and now you get 4 XOR 5? But you never put in 4 XOR 5...

Ah, right...

Instead, if we want to remove the notifier bool, I think we should move to a notify implying "events and/or persist" with a separate get'er on the ChannelManager to fetch an atomic flag and see if we need persist. I think that'd be a relatively straightforward patch, just awkward that we have two separate get'ers that are tied at the hip but separate in the ChannelManager API.

Right this sounds better to me as well. I agree it's a bit awkward, but you could say the say the same about the ChainMonitor future returning an unused flag -- they're both things you only have to deal with if you're implementing your own background processor. I'll leave it up to @jkczyz, I'm happy moving forward with this suggestion, but can also deal with the current patch.

@jkczyz
Copy link
Contributor

jkczyz commented Sep 8, 2023

Right this sounds better to me as well. I agree it's a bit awkward, but you could say the say the same about the ChainMonitor future returning an unused flag -- they're both things you only have to deal with if you're implementing your own background processor. I'll leave it up to @jkczyz, I'm happy moving forward with this suggestion, but can also deal with the current patch.

Yeah, I agree. Let's go with that suggestion.

@TheBlueMatt
Copy link
Collaborator Author

Done

@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch 3 times, most recently from 94b420c to f079049 Compare September 8, 2023 22:18
In the next commit, we separate `ChannelManager`'s concept of
waking a listener to both be persisted and to allow the user to
handle events. Here we rename the future-fetching method in
anticipation of this split.
@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from f079049 to c1546ff Compare September 11, 2023 00:08
@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from ecbfbd6 to 02d9367 Compare September 11, 2023 16:45
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to squash, IMO.

@TheBlueMatt
Copy link
Collaborator Author

Squashed the previous fixups but added three new ones.

@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from 02d9367 to cd93690 Compare September 11, 2023 18:24
@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from cd93690 to 7393812 Compare September 11, 2023 21:45
Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch 2 times, most recently from 6d99f0a to 79b4b58 Compare September 12, 2023 15:29
@jkczyz
Copy link
Contributor

jkczyz commented Sep 12, 2023

LGTM, fee free to squash

Prior to any actions which may generate a `ChannelMonitorUpdate`,
and in general after startup,
`ChannelManager::process_background_events` must be called. This is
mostly accomplished by doing so on taking the
`total_consistency_lock` via the `PersistenceNotifierGuard`. In
order to skip this call in block connection logic, the
`PersistenceNotifierGuard::optionally_notify` constructor did not
call the `process_background_events` method.

However, this is very easy to misuse - `optionally_notify` does not
convey to the reader that they need to call
`process_background_events` at all.

Here we fix this by adding a separate
`optionally_notify_skipping_background_events` method, making the
requirements much clearer to callers.
Currently, when a ChannelManager generates a notification for the
background processor, any pending events are handled and the
ChannelManager is always re-persisted.

Many channel related messages don't actually change the channel
state in a way that changes the persisted channel. For example,
an `update_add_htlc` or `update_fail_htlc` message simply adds the
change to a queue, changing the channel state when we receive a
`commitment_signed` message.

In these cases we shouldn't be re-persisting the ChannelManager as
it hasn't changed (persisted) state at all. In anticipation of
doing so in the next few commits, here we make the public API
handle the two concepts (somewhat) separately. The notification
still goes out via a single waker, however whether or not to
persist is now handled via a separate atomic bool.
As we now signal events-available from persistence-needed
separately, the `NotifyOption` enum should include a separate
variant for events-but-no-persistence, which we add here.
Many channel related messages don't actually change the channel
state in a way that changes the persisted channel. For example,
an `update_add_htlc` or `update_fail_htlc` message simply adds the
change to a queue, changing the channel state when we receive a
`commitment_signed` message.

In these cases there's really no reason to wake the background
processor at all - there's no response message and there's no state
update. However, note that if we close the channel we should
persist the `ChannelManager`. If we send an error message without
closing the channel, we should wake the background processor
without persisting.

Here we move to the appropriate `NotifyOption` on some of the
simpler channel message handlers.
When a peer connects and we send some `channel_reestablish`
messages or create a `per_peer_state` entry there's really no
reason to need to persist the `ChannelManager`. None of the
possible actions we take immediately result in a change to the
persisted contents of a `ChannelManager`, only the peer's later
`channel_reestablish` message does.
If we receive a `ChannelUpdate` message which was invalid, it can
cause us to force-close the channel, which should result in a
`ChannelManager` persistence, though its not critical to do so.
When we handle an inbound `channel_reestablish` from our peers it
generally doesn't change any state and thus doesn't need a
`ChannelManager` persistence. Here we avoid said persistence where
possible.
When reloading nodes A or C, the chanmon_consistency fuzzer
currently calls `get_and_clear_pending_msg_events` on the node,
potentially causing additional `ChannelMonitor` or `ChannelManager`
updates, just to check that no unexpected messages are generated.

There's not much reason to do so, the fuzzer could always swap for
a different command to call the same method, and the additional
checking requires some weird monitor persistence introspection.

Here we simplify the fuzzer by simply removing this logic.
@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from 79b4b58 to e6f564a Compare September 12, 2023 19:06
@TheBlueMatt
Copy link
Collaborator Author

Squashed without further changes.

wpaulino
wpaulino previously approved these changes Sep 12, 2023
jkczyz
jkczyz previously approved these changes Sep 12, 2023
In the `chanmon_consistency` fuzz, we currently "persist" the
`ChannelManager` on each loop iteration. With the new logic in the
past few commits to reduce the frequency of `ChannelManager`
persistences, this behavior now leaves a gap in our test coverage -
missing persistence notifications.

In order to cath (common-case) persistence misses, we update the
`chanmon_consistency` fuzzer to no longer persist the
`ChannelManager` unless the waker was woken and signaled to
persist, possibly reloading with a previous `ChannelManager` if we
were not signaled.
@TheBlueMatt TheBlueMatt dismissed stale reviews from jkczyz and wpaulino via 32e5903 September 12, 2023 21:28
@TheBlueMatt TheBlueMatt force-pushed the 2023-08-one-less-write branch from e6f564a to 32e5903 Compare September 12, 2023 21:28
@TheBlueMatt
Copy link
Collaborator Author

Grrr, fuzz was still broken, force-pushed the last commit:

$ git diff-tree -U1 e6f564a15 32e5903ef
diff --git a/fuzz/src/chanmon_consistency.rs b/fuzz/src/chanmon_consistency.rs
index 77192bf46..8afc2e151 100644
--- a/fuzz/src/chanmon_consistency.rs
+++ b/fuzz/src/chanmon_consistency.rs
@@ -1298,3 +1298,3 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out) {

-		if nodes[0].get_and_clear_needs_persistence() == Ok(true) {
+		if nodes[0].get_and_clear_needs_persistence() == true {
 			node_a_ser.0.clear();
@@ -1302,3 +1302,3 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out) {
 		}
-		if nodes[1].get_and_clear_needs_persistence() == Ok(true) {
+		if nodes[1].get_and_clear_needs_persistence() == true {
 			node_b_ser.0.clear();
@@ -1306,3 +1306,3 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out) {
 		}
-		if nodes[2].get_and_clear_needs_persistence() == Ok(true) {
+		if nodes[2].get_and_clear_needs_persistence() == true {
 			node_c_ser.0.clear();
$ 

@TheBlueMatt TheBlueMatt merged commit 286d1db into lightningdevkit:main Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many channel manager writes in LDK?
5 participants