-
Notifications
You must be signed in to change notification settings - Fork 2.2k
multi: prevent duplicates for locally dispatched HTLC attempts #10049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
multi: prevent duplicates for locally dispatched HTLC attempts #10049
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @calvinrzachman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the reliability of HTLC payments by preventing duplicate attempts, particularly in scenarios involving remote clients and the switchrpc
sub-server. It introduces an AttemptStore
interface and modifies the SendHTLC
method to ensure that only one HTLC is forwarded for a given attempt ID. The changes include persisting the intent to dispatch an HTLC before sending it to the network and introducing a PendingNetworkResult
message type.
Highlights
- Duplicate Payment Prevention: This PR introduces a mechanism to prevent duplicate payment attempts when using the
switchrpc
gRPC sub-server andSendOnion/TrackOnion
RPCs. It addresses potential fund losses due to unintentional retries. - Attempt Store Interface: An
AttemptStore
interface is defined to manage HTLC payment attempt results, supporting both local and remote lifecycle controllers. - Durable Persistence of Intent: The Switch now durably persists the intent to dispatch an HTLC before sending it, adding a checkpoint to prevent duplicates. This involves writing to disk prior to sending the HTLC.
- SendHTLC() Modification: The
SendHTLC()
method is updated to be safe for multiple calls with the same attempt ID. The Switch'snetworkResultStore
is modified to enable this. InitAttempt
Method: A newInitAttempt(attemptID)
method is added to theAttemptStore
interface. This method updates the attempt result store to durably mark the existence of an attempt before sending it to the network.- Pending Network Result: Introduces a
PendingNetworkResult
message type to mark an attempt as initiated but not yet settled/failed.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
An excellent pull request that introduces a robust mechanism to prevent duplicate HTLC dispatches. The approach of using a placeholder in the networkResultStore
to mark an attempt as "in-flight" is a clever way to achieve this without significant changes to the on-disk format.
My review has identified one critical issue in the implementation of the duplicate check. Addressing this point will help ensure the stability and correctness of this important feature.
htlcswitch/payment_result.go
Outdated
existingResult, err := store.GetResult(attemptID) | ||
if err != nil && !errors.Is(err, ErrPaymentIDNotFound) { | ||
// If the error is anything other than "not found", return it. | ||
return err | ||
} | ||
|
||
if existingResult != nil { | ||
// If the result is already in-progress, return an error | ||
// indicating that the attempt already exists. | ||
log.Warnf("Already initialized attempt for ID=%v", attemptID) | ||
|
||
return ErrPaymentIDAlreadyExists | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition/bug here that could lead to overwriting an existing payment attempt.
The GetResult
function is now designed to return ErrPaymentIDNotFound
for attempts that are in the PendingNetworkResult
state. This means that if InitAttempt
is called for an attemptID
that has already been initialized but not yet settled/failed, GetResult
will return ErrPaymentIDNotFound
. The current logic will then proceed to overwrite the existing pending attempt, which defeats the purpose of this change.
To fix this, InitAttempt
should check for the existence of any result for the given attemptID
, including the pending placeholder. This can be done by using fetchResult
within a kvdb.View
transaction directly, similar to how the old getResult
worked.
var existingResult *networkResult
err := kvdb.View(store.backend, func(tx kvdb.RTx) error {
var err error
existingResult, err = fetchResult(tx, attemptID)
return err
}, func() {
existingResult = nil
})
if err != nil && !errors.Is(err, ErrPaymentIDNotFound) {
// If the error is anything other than "not found", return it.
return err
}
if existingResult != nil {
// If a result already exists (even a pending one), return an
// error indicating that the attempt already exists.
log.Warnf("Already initialized attempt for ID=%v", attemptID)
return ErrPaymentIDAlreadyExists
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite correct that the initial implementation of InitAttempt
did not fully prevent duplicate use of an attempt ID in all scenarios. The GetResult
call had deliberately been made to preserve historical semantics for read callers like GetAttemptResult
and HasAttemptResult
which only expect to find information in the store when a full result (settle/fail) is back from the network. This could lead to a scenario in which InitAttempt
would allow a second attempt for the same ID if the result was not back yet.
Notably, we do appear to fall back on the protection offered by the CircuitMap
at that point for HTLCs which are on a commitment transaction, but that is not quite what I intended admittedly. We'd like the duplicate prevention to be fully guaranteed during the life-cycle of the attempt.
I added a unit test at both the the attempt store and Switch level which verify that InitAttempt
and SendHTLC
behave as expected with respect to duplicate attempt ID usage.
Thanks for pointing this out 🙏
149221d
to
d46a710
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for exploring this @calvinrzachman! The direction looks good.
I think something is still missing, answering the question: who deletes results and when (including how long would we like duplicate protection to last)? At the moment it is not safe to restart the client or server if we want to allow simultaneous use of SendOnion
(external) and SendPaymentV2
(internal), because there are two routers that call CleanStore
at restart. So for example, a restart of the server would delete all the in-flight attempts initiated by another client. A minimally working version of this would require a flag to turn off the server's router to prevent its calls to CleanStore
, right?
To solve this we probably need to extend the switch rpc to also implement NextID
Line 1163 in fb68f36
NextPaymentID: sequencer.NextID, |
NextID
such that the client's control tower has unique attempt ids).
In order to solve the multi client CleanStore
problem, we would need to extend the switchrpc.SendOnion/TrackOnion
call with a client id (you mentioned this idea elsewhere). This way we can set the first byte of the attempt id used as a key for the server's network results store as the client id, to avoid initial database changes (we could later extend the key and really prefix the client id to have a larger space and to enable more clients) and leave the attempt id used in the switch untouched (no prefixing). That way we can have CleanStore(clientID, keepIDs)
to only clean up the results concerned with the client. Those are just some initial thoughts and they probably require a second thorough analysis, but hopefully gives some more directions.
Preperatory refactor to allow for future alteration of the store backing the Switch.
@bitromortac yes, we can certainly expand our switchrpc server to support some method of attempt information cleanup. Two potential behaviors for deletion I have considered thus far are:
There are some differences to the two approaches, but for now we have an implementation of a CleanStore RPC which follows the approach lnd has taken historically for deletion from the Switch's attempt store here: calvinrzachman#19 You're quite right that it is not safe to have multiple HTLC dispatching entities independently calling CleanStore without coordination. The CleanStore approach to deletion requires ID space separation. It cannot work with multiple clients sharing the same flat attempt ID space. There needs to be an identity-aware mapping of attempt IDs - the Switch must know:
Though I don't have this entirely flushed out, I think it might be possible for multiple routers to share an ID space if:
Yes, exactly! We can disable lnd's typical cleanup behavior which occurs on Router startup via |
Add a new message for pending htlc attempt results in the Switch's network result store. This serves as a place holder to use during initialization of an attempt within the store which will be replaced with either a SETTLE or FAIL message once the HTLC attempt result is received from the network. NOTE: This message is not sent or received externally to channel peers. It was introduced only to avoid having to change the on-disk structure of the network result store.
This can be used to initialize the result store for a given attempt ID prior to sending the HTLC out to the network.
We're seeing what benefit to upstream clients is provided if the underlying SendHTLC implementatation (and by extension the SendOnion RPC) will not forward the same attempt ID twice without the result for a given ID having been cleaned from the result store. We accomplish the duplicate prevention using the InitAttempt method of the Switch Store.
d46a710
to
cd958d3
Compare
Change Description
As mentioned in #9489, we intend to offload path-finding, onion construction and payment life-cycle management to an external entity (with a remotely instantiated ChannelRouter type) capable of orchestrating payments across a set of remote lnd instances. The lnd instances would be relegated to accept onion payments for direct delivery to the network. Ideally lnd's existing types could be made general enough to work in both this and more traditional deployment scenarios.
We'd like to prevent duplicate payment attempts and unintentional loss of funds when using the forthcoming switchrpc gRPC sub-server and SendOnion/TrackOnion RPCs to dispatch and track payments #9489.
One idea is to make SendHTLC duplicate safe for given attempt ID so that callers of this function may retry until they receive an explicit and unambiguous acknowledgement from the Switch regarding the status of the request to dispatch an HTLC attempt. This appears un-interesting in the default case where ChannelRouter and Switch run together in a single lnd binary, but becomes more important for correctness when the Router and Switch run in separate processes with synchronous communication via RPC - as there are different failure domains which must be dealt with.
This PR explores a modification to the Switch's
networkResultStore
to enable this while hopefully minimizing the necessary ChannelRouter/Switch changes. To accomplish this, we durably persist record of intent to dispatch an HTLC prior to sending that HTLC for forwarding to the broader network.The SendHTLC() method is updated such that it is safe to call multiple times with the same attempt ID. With this change, we'd now have two writes to disk involved in the outgoing HTLC dispatch flow from the perspective of the Switch. The "routing node" responsibility of forwarding of HTLCs is not impacted.
s.attemptStore.InitAttempt(attemptID)
: proposed addition. updates attempt result store to checkpoint/durably mark in persistent store some information about the existence of an attempt prior to sending that attempt out to the network.s.circuits.CommitCircuits(circuit)
: exists today. tracks flow of HTLC through Switch. Does not prevent duplicates for attempts which have already received a result from the network (settle/fail).NOTE: The Switch's
CircuitMap
offers some protection against duplicate attempts with the same ID, but the life-cycle of this protection is not tied to the Router's management of payment life-cycle (eg:DeleteCircuits
cleansCircuitMap
potentially prior to results being read from the Switch's attempt result store) and appears insufficient for our use case. Once a result arrives and the circuit is torn down, that attempt ID becomes reusable and thus capable of having it's result overwritten within the Switch's store - which can be tricky for remote clients which may go offline for an arbitrary length of time and need to rely on the ability to safely retry an HTLC dispatch request.It appears as though it is possible to achieve our aim without modifying the on-disk structure of the
networkResult
type used by this store. One idea is to overload the use of thelnwire.Msg
field to allow it to encode a 3rd state besidesSettle/Fail
- namelyInitiated/Pending/In-Flight
. This seems to be the minimal required ability to be able to offer theInitAttempt
duplicate protection which lasts the entire life-cycle of the attempt, so that the ChannelRouter and its use of the PaymentAttemptDispatcher interface can be general enough to work both in the case where Router and Switch run in the same process (as is the case for the large majority of lnd deployments) and in the scenario where they run in different processes.TODO
switchrpc
SendOnion with a retry loop. The implementation of SendOnion/switch.SendHTLC executed server side would prevent any duplicates for the same ID and allow the client to safely retry until it receives a definitive response from the server. The ChannelRouter and its use of the PaymentAttemptDispatcher interface can be general enough to work both in the case where Router and Switch run in the same process (as is the case for the large majority of lnd deployments) and in the scenario where they run in different processes.[skip ci]
in the commit message for small changes.Steps to Test
htlcswitch
package locally and everything passes:make unit-debug log="stdlog trace" pkg=htlcswitch timeout=5m