-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: set zero bandwidth hint for channels that do not have any available slots #5355
Conversation
1a7e5fd
to
4a1d6a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the first commit should be dropped like you said, o/w this patch looks good!
WDYT about keeping the itest? Was planning on updating to just add the itest in the last commit. |
Yeah I think the itest is good as it exercises previously uncovered behavior |
d290346
to
0013711
Compare
sgtm, squashed all itest changes to last commit where we add the fix. |
0013711
to
15fb00e
Compare
@Roasbeef is this something we could consider for 0.13? When this condition hits it really hammers the db with payment attempts, with 60 second timeout locally I saw 200-300 unnecessary attempts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🏭 , just nits
15fb00e
to
abd2497
Compare
abd2497
to
6aa3706
Compare
@@ -145,6 +145,10 @@ type ChannelLink interface { | |||
// will use this function in forwarding decisions accordingly. | |||
EligibleToForward() bool | |||
|
|||
// MayAddOutgoingHtlc returns an error if we may not add an outgoing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this if we already have CheckHtlcTransit
which calls canSendHTLC
that eventually calls Bandwidth
to ensure that the channel has enough idle bandwidth to carry the HTLC? In this case, one could say that if a channel has no free HTLC slots, then it effectively has zero bandwidth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more, I think as is this addresses the case of backing off when we're the sender, but not the case where an outgoing link is fully saturated (in terms of idle HTLCs), and a new forward come across destined to that link. With the current control flow, we'll go all the way through and call AddHTLC
on the link with that eventually failing and returning back a temp chan failure. However if we add this check to CheckHtlcForward
, then we skip sending a message (with channel overhead, etc) to the link and can fail things back earlier at the forwarding site within the switch itself.
If we like this alternative direction, then "mock validation" via the new validateAddHtlc
can be used within the link implementation (we don't need to extend the main interface) to apply this extra step of validation before the multi-hop related checks (fee and timelocks, etc).
One tradeoff with this alternative direction is that we'll allow an HTLC to proceed to the switch (it'll fail at the call to getLocalLink
) vs it failing earlier at the router level (it'll get a nil policy as the bandwidth will show up as being zero). FWIW if it gets all the way here, then we'll emit a link failure event for the subscription RPC which might be useful for alerts to keep an eye on persistently full HTLC slots.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of moving this check into CheckHtlcForward
to get the LinkFailure
event, but it doesn't solve the issue of pathfinding becoming aware of the link failure and excluding the link as a possible route (which is what's making us spin out at present).
This endless storm of htlcs happens in the following case:
- Payment tries to launchShard on a channel that is saturated
- Our sendPaymentAttempt fails to SendHtlc because the link is full
- We fail the htlc attempt on disk and cancel the shard
- We loop back around in our payment logic and try to launch another shard using the same channel because pathfinding has no awareness that this channel has just failed us
The queryBandwidth
check felt like the best way to communicate this unavailability to RequestRoute
? The alternative would be to update handleSendError's error processing to specifically understand this type of switch error and report it to mission control differently, since right now we're just logging it. This is more involved (previously discussed and decided against it iirc), but it does seem like the less hacky way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would just like to point out that Bandwidth()
is only concerned with bandwidth, and doesn't take into account space on the commitment. So the commitment could be full, but not all of its bandwidth consumed, which would be misleading if we relied on canSendHtlc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This endless storm of htlcs happens in the following case:
Ah yeh this makes more sense, thanks for the breakdown.
The alternative would be to update handleSendError's error processing to specifically understand this type of switch error and report it to mission control differently, since right now we're just logging i
Don't think this is the correct route either as we'd end up "penalizing" our local channels using our existing observation system when we we're actually able to have an up-to-date view of the state since it's a local channel.
The thing that made be to a double take was the fact that we're extending a fundamental interface in the switch with a method that was effectively only put in place to solve a one-off issue. However looking at this addition from another angle, it forces callers that handle a link to be aware of that fact that we actually have two constraints w.r.t being able to forward HTLCs: the set of available bandwidth, and the current flow control constraints.
With that said, how about renaming the newly added method from MayAddOutgoingHtlc
to something like func PaymentSlotAvailable(amt) bool
(just a suggestion)? This reflects that the method is a best-effort attempt at claiming an available payment slots (which encapsulates things like the future dust limits, max htlc, max in flight, etc). It's best-effort as no strong guarantees are made as it's possible that a forwarded HTLC instead claims the slots, and the caller would need to be aware of that. With this interface change, we'd also start to thread through the amount as then we ensure we're able to factor in all the other flow control constraints. Actually without this, I think other infinite loops exist where adding an HTLC violates another constraint (aside from max HTLCs).
Doesn't look like we have the amount available here when we go to query for bandwidth hints, but with a small tweak we could have the bandwidthHintsMap
map to a function closure (or interface) to allow it to accept an amount (and possibly later other parmaters). Looking at this section of the codebase with fresh eyes, it doesn't look like we attempt to adhere to the link level constraints enforced by the remote party w.r.t what type of HTLCs they'll accept (like the link level min HTLC param). Taking this final step to make the bandwidth hints (from the PoV of the router) aware (via direction) of these link level constraints should address this gap and any other lingering ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that made be to a double take was the fact that we're extending a fundamental interface in the switch with a method that was effectively only put in place to solve a one-off issue.
yeah def, also didn't love this approach
Doesn't look like we have the amount available here when we go to query for bandwidth hints, but with a small tweak we could have the bandwidthHintsMap map to a function closure (or interface) to allow it to accept an amount (and possibly later other parmaters).
Took a look at this approach today and it looks like a good direction to go in - there's a working branch here, although I haven't fully tested it.
One thing I ran into is that there are two cases where we use our hints:
- Local policy: here we want to include htlc amount
- Outgoing balance: iiuc, we don't actually want to check flow constraints here? Since it's just a total balance check
Ended up using *lnwire.Millisatoshi
to infer whether we want to check flow as well, could possibly be a different method in the bandwidthHints
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outgoing balance: iiuc, we don't actually want to check flow constraints here? Since it's just a total balance check
Yeah in that case, I'd expect that the map look up is !ok
(we don't define a function for that edge) so we instead fall through to using the existing balance.
closeChannelAndAssert(ctxt, t, net, net.Alice, chanPoint, false) | ||
} | ||
|
||
type holdSubscription struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thought about this a bit more, and I think it's important we patch this issue in the short-term as it's affecting users in the wild. As I outlined above, we can generalize things a bit and also address other possible vectors for a tight HTLC loop like violating constraints like the max amount allowed in flight. So I'm thinking we move forward with what we have here, package it in 0.13.1, and shoot for the more fundamental fix so we can land it in 0.14. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🥜
This PR updates our bandwidth hints to report a zero balance when we're not allowed to add any more htlcs to the remote party's commitment. This prevents pathfinding from using the channel at all; previously it would endlessly try to send shard on the channel.
The first commit just demonstrates the endless cycle of htlcs by logging (couldn't think of a better way), and should be squashed with the second commit to include a test that asserts that we don't use full channels. Open to other testing options besides an itest, but it's difficult to do because this is a combination of multiple elements - pathfinding + payment lifecycle + link - so mocking this didn't feel like it would meaningfully add much (would be testing the mock rather than the code itself).
Fixes #5297
Fixes #4656