Skip to content

Conversation

calvinrzachman
Copy link
Contributor

@calvinrzachman calvinrzachman commented Jul 8, 2025

Change Description

As mentioned in #9489, we intend to offload path-finding, onion construction and payment life-cycle management to an external entity (with a remotely instantiated ChannelRouter type) capable of orchestrating payments across a set of remote lnd instances. The lnd instances would be relegated to accept onion payments for direct delivery to the network. Ideally lnd's existing types could be made general enough to work in both this and more traditional deployment scenarios.

We'd like to prevent duplicate payment attempts and unintentional loss of funds when using the forthcoming switchrpc gRPC sub-server and SendOnion/TrackOnion RPCs to dispatch and track payments #9489.

One idea is to make SendHTLC duplicate safe for given attempt ID so that callers of this function may retry until they receive an explicit and unambiguous acknowledgement from the Switch regarding the status of the request to dispatch an HTLC attempt. This appears un-interesting in the default case where ChannelRouter and Switch run together in a single lnd binary, but becomes more important for correctness when the Router and Switch run in separate processes with synchronous communication via RPC - as there are different failure domains which must be dealt with.

This PR explores a modification to the Switch's networkResultStore to enable this while hopefully minimizing the necessary ChannelRouter/Switch changes. To accomplish this, we durably persist record of intent to dispatch an HTLC prior to sending that HTLC for forwarding to the broader network.

The SendHTLC() method is updated such that it is safe to call multiple times with the same attempt ID. With this change, we'd now have two writes to disk involved in the outgoing HTLC dispatch flow from the perspective of the Switch. The "routing node" responsibility of forwarding of HTLCs is not impacted.

  1. s.attemptStore.InitAttempt(attemptID): proposed addition. updates attempt result store to checkpoint/durably mark in persistent store some information about the existence of an attempt prior to sending that attempt out to the network.
  2. s.circuits.CommitCircuits(circuit): exists today. tracks flow of HTLC through Switch. Does not prevent duplicates for attempts which have already received a result from the network (settle/fail).

NOTE: The Switch's CircuitMap offers some protection against duplicate attempts with the same ID, but the life-cycle of this protection is not tied to the Router's management of payment life-cycle (eg: DeleteCircuits cleans CircuitMap potentially prior to results being read from the Switch's attempt result store) and appears insufficient for our use case. Once a result arrives and the circuit is torn down, that attempt ID becomes reusable and thus capable of having it's result overwritten within the Switch's store - which can be tricky for remote clients which may go offline for an arbitrary length of time and need to rely on the ability to safely retry an HTLC dispatch request.

It appears as though it is possible to achieve our aim without modifying the on-disk structure of the networkResult type used by this store. One idea is to overload the use of the lnwire.Msg field to allow it to encode a 3rd state besides Settle/Fail - namely Initiated/Pending/In-Flight. This seems to be the minimal required ability to be able to offer the InitAttempt duplicate protection which lasts the entire life-cycle of the attempt, so that the ChannelRouter and its use of the PaymentAttemptDispatcher interface can be general enough to work both in the case where Router and Switch run in the same process (as is the case for the large majority of lnd deployments) and in the scenario where they run in different processes.

TODO

  • Modify the ChannelRouter to call SendHTLC defensively when resuming payments after a restart.
    • Local/Monolithic Deployment: the usual case where everything runs in a single binary. With the duplicate safety, this call harmlessly errors and we proceed to tracking the result like normal.
    • Remote Router Deployment: instantiate a type with an alternate RPC backed implementation of SendHTLC which would attempt to deliver the onion via switchrpc SendOnion with a retry loop. The implementation of SendOnion/switch.SendHTLC executed server side would prevent any duplicates for the same ID and allow the client to safely retry until it receives a definitive response from the server. The ChannelRouter and its use of the PaymentAttemptDispatcher interface can be general enough to work both in the case where Router and Switch run in the same process (as is the case for the large majority of lnd deployments) and in the scenario where they run in different processes.
// Possibly via an alternate implementation of SendHTLC, but this pattern applies more generally.
// func (p *rpcPaymentDispatcher) SendHTLC(firstHop lnwire.ShortChannelID,
// attemptID uint64, htlcAdd *lnwire.UpdateAddHTLC) error {

       for {
               resp, err := SendOnion(A)
               if err == nil || err == ErrDuplicateAdd {
                       break // HTLC in-flight or just accepted; proceed to track
               }
               continue
       }

       // ... Sometime later possibly from another go routine (eg: Router via
       // GetAttemptResult)
       TrackOnion(A)

Steps to Test

  • Ran the unit tests for htlcswitch package locally and everything passes: make unit-debug log="stdlog trace" pkg=htlcswitch timeout=5m
  • go test -v -timeout 30s -run ^TestNetworkResultStore$ github.com/lightningnetwork/lnd/htlcswitch
  • More tests needed? ...

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @calvinrzachman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the reliability of HTLC payments by preventing duplicate attempts, particularly in scenarios involving remote clients and the switchrpc sub-server. It introduces an AttemptStore interface and modifies the SendHTLC method to ensure that only one HTLC is forwarded for a given attempt ID. The changes include persisting the intent to dispatch an HTLC before sending it to the network and introducing a PendingNetworkResult message type.

Highlights

  • Duplicate Payment Prevention: This PR introduces a mechanism to prevent duplicate payment attempts when using the switchrpc gRPC sub-server and SendOnion/TrackOnion RPCs. It addresses potential fund losses due to unintentional retries.
  • Attempt Store Interface: An AttemptStore interface is defined to manage HTLC payment attempt results, supporting both local and remote lifecycle controllers.
  • Durable Persistence of Intent: The Switch now durably persists the intent to dispatch an HTLC before sending it, adding a checkpoint to prevent duplicates. This involves writing to disk prior to sending the HTLC.
  • SendHTLC() Modification: The SendHTLC() method is updated to be safe for multiple calls with the same attempt ID. The Switch's networkResultStore is modified to enable this.
  • InitAttempt Method: A new InitAttempt(attemptID) method is added to the AttemptStore interface. This method updates the attempt result store to durably mark the existence of an attempt before sending it to the network.
  • Pending Network Result: Introduces a PendingNetworkResult message type to mark an attempt as initiated but not yet settled/failed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

An excellent pull request that introduces a robust mechanism to prevent duplicate HTLC dispatches. The approach of using a placeholder in the networkResultStore to mark an attempt as "in-flight" is a clever way to achieve this without significant changes to the on-disk format.

My review has identified one critical issue in the implementation of the duplicate check. Addressing this point will help ensure the stability and correctness of this important feature.

Comment on lines 122 to 146
existingResult, err := store.GetResult(attemptID)
if err != nil && !errors.Is(err, ErrPaymentIDNotFound) {
// If the error is anything other than "not found", return it.
return err
}

if existingResult != nil {
// If the result is already in-progress, return an error
// indicating that the attempt already exists.
log.Warnf("Already initialized attempt for ID=%v", attemptID)

return ErrPaymentIDAlreadyExists
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential race condition/bug here that could lead to overwriting an existing payment attempt.

The GetResult function is now designed to return ErrPaymentIDNotFound for attempts that are in the PendingNetworkResult state. This means that if InitAttempt is called for an attemptID that has already been initialized but not yet settled/failed, GetResult will return ErrPaymentIDNotFound. The current logic will then proceed to overwrite the existing pending attempt, which defeats the purpose of this change.

To fix this, InitAttempt should check for the existence of any result for the given attemptID, including the pending placeholder. This can be done by using fetchResult within a kvdb.View transaction directly, similar to how the old getResult worked.

	var existingResult *networkResult
	err := kvdb.View(store.backend, func(tx kvdb.RTx) error {
		var err error
		existingResult, err = fetchResult(tx, attemptID)
		return err
	}, func() {
		existingResult = nil
	})

	if err != nil && !errors.Is(err, ErrPaymentIDNotFound) {
		// If the error is anything other than "not found", return it.
		return err
	}

	if existingResult != nil {
		// If a result already exists (even a pending one), return an
		// error indicating that the attempt already exists.
		log.Warnf("Already initialized attempt for ID=%v", attemptID)

		return ErrPaymentIDAlreadyExists
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite correct that the initial implementation of InitAttempt did not fully prevent duplicate use of an attempt ID in all scenarios. The GetResult call had deliberately been made to preserve historical semantics for read callers like GetAttemptResult and HasAttemptResult which only expect to find information in the store when a full result (settle/fail) is back from the network. This could lead to a scenario in which InitAttempt would allow a second attempt for the same ID if the result was not back yet.

Notably, we do appear to fall back on the protection offered by the CircuitMap at that point for HTLCs which are on a commitment transaction, but that is not quite what I intended admittedly. We'd like the duplicate prevention to be fully guaranteed during the life-cycle of the attempt.

I added a unit test at both the the attempt store and Switch level which verify that InitAttempt and SendHTLC behave as expected with respect to duplicate attempt ID usage.

Thanks for pointing this out 🙏

@saubyk saubyk added this to lnd v0.20 Jul 8, 2025
@saubyk saubyk added this to the v0.20.0 milestone Jul 8, 2025
@saubyk saubyk moved this to In progress in lnd v0.20 Jul 8, 2025
@calvinrzachman calvinrzachman force-pushed the switch-attempt-store branch from 149221d to d46a710 Compare July 8, 2025 21:39
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for exploring this @calvinrzachman! The direction looks good.

I think something is still missing, answering the question: who deletes results and when (including how long would we like duplicate protection to last)? At the moment it is not safe to restart the client or server if we want to allow simultaneous use of SendOnion (external) and SendPaymentV2 (internal), because there are two routers that call CleanStore at restart. So for example, a restart of the server would delete all the in-flight attempts initiated by another client. A minimally working version of this would require a flag to turn off the server's router to prevent its calls to CleanStore, right?

To solve this we probably need to extend the switch rpc to also implement NextID

lnd/server.go

Line 1163 in fb68f36

NextPaymentID: sequencer.NextID,
such that the client (payment service or internal router) can request a new attempt id. This shifts some burden of internal bookkeeping to the client because it won't receive unique attempt ids across servers (so it needs to perhaps prefix the returned attempt id with a server id in its internal implementation of NextID such that the client's control tower has unique attempt ids).

In order to solve the multi client CleanStore problem, we would need to extend the switchrpc.SendOnion/TrackOnion call with a client id (you mentioned this idea elsewhere). This way we can set the first byte of the attempt id used as a key for the server's network results store as the client id, to avoid initial database changes (we could later extend the key and really prefix the client id to have a larger space and to enable more clients) and leave the attempt id used in the switch untouched (no prefixing). That way we can have CleanStore(clientID, keepIDs) to only clean up the results concerned with the client. Those are just some initial thoughts and they probably require a second thorough analysis, but hopefully gives some more directions.

Preperatory refactor to allow for future alteration
of the store backing the Switch.
@calvinrzachman
Copy link
Contributor Author

calvinrzachman commented Aug 26, 2025

@bitromortac yes, we can certainly expand our switchrpc server to support some method of attempt information cleanup.

Two potential behaviors for deletion I have considered thus far are:

  1. CleanStore(keepSet []ID): deletes everything except that which it's told to keep.
  2. DeleteAttempts(deleteSet []ID): deletes only what is explicitly specified to delete.

There are some differences to the two approaches, but for now we have an implementation of a CleanStore RPC which follows the approach lnd has taken historically for deletion from the Switch's attempt store here: calvinrzachman#19

You're quite right that it is not safe to have multiple HTLC dispatching entities independently calling CleanStore without coordination. The CleanStore approach to deletion requires ID space separation. It cannot work with multiple clients sharing the same flat attempt ID space. There needs to be an identity-aware mapping of attempt IDs - the Switch must know:

  • Which client (or namespace) "owns" each attempt ID.
  • Therefore, it can safely say: “delete all of my IDs except the ones I still care about.”

Though I don't have this entirely flushed out, I think it might be possible for multiple routers to share an ID space if:

  1. The Switch RPC server centrally hands out attempt IDs for all router clients via a NextAttemptID endpoint.
  2. The routers use a DeleteAttempts based approach to deletion.

A minimally working version of this would require a flag to turn off the server's router to prevent its calls to CleanStore, right?

Yes, exactly! We can disable lnd's typical cleanup behavior which occurs on Router startup via routing/htlcswitch configuration or by entirely disabling the routerrpc and the on-board ChannelRouter all together via something like this: #10110 or this #10178.

Add a new message for pending htlc attempt results
in the Switch's network result store. This serves as
a place holder to use during initialization of an
attempt within the store which will be replaced with
either a SETTLE or FAIL message once the HTLC attempt
result is received from the network.

NOTE: This message is not sent or received externally
to channel peers. It was introduced only to avoid having
to change the on-disk structure of the network result
store.
This can be used to initialize the result store for a
given attempt ID prior to sending the HTLC out to the
network.
We're seeing what benefit to upstream clients is
provided if the underlying SendHTLC implementatation
(and by extension the SendOnion RPC) will not forward
the same attempt ID twice without the result for a
given ID having been cleaned from the result store.

We accomplish the duplicate prevention using the
InitAttempt method of the Switch Store.
@saubyk saubyk modified the milestones: v0.20.0, v0.21.0 Sep 24, 2025
@saubyk saubyk removed this from lnd v0.20 Sep 24, 2025
@saubyk saubyk added this to v0.21 Sep 24, 2025
@saubyk saubyk moved this to In progress in v0.21 Sep 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants