-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate strategies for outdated commitments #1838
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these two commits should be done in the same PR, I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightning/bolts#834 and reviewing in which case we don't have a choice and really need to force-close.
As for the first commit, what do you do when the node stops and you do have your most up-to-date backup? If you restart it will stop again. So you'd need to change your config from "halt-if-just-restarted" to "always-request-remote-close" before restarting, otherwise a single faulty channel would DoS the whole node...
eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala
Outdated
Show resolved
Hide resolved
Agreed, I got a bit carried away and kept experimenting.
IIUC lightning/bolts#834 deals with protocol errors, but here I'm trying to address the case where we have an internal error. Disk full, OOM exception, that sort of things.
Exactly, it requires human intervention, that's why this mode is for more advanced users. But it's less susceptible to blind mass force-close, which is what this PR aims to address. I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism. |
So the flow if the node stops because of an outdated commitment would be:
Is that right?
I agree, the other solutions I can think of are just proxies for almost the same behavior, so it's better to choose the simplest one. |
Agree!
It was a left-over of my previous attempt, I flip-flopped around this and thought keeping an explicit strategy name would be easier to understand ... but it is probably not. |
d6f88ef
to
b9308a4
Compare
On 2nd thought, I kept both changes in the same PR, I think it makes sense since those are related topics.
I could add a third strategy, to not force-close even if our peer sends an
Done with 93d07c4. |
eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala
Outdated
Show resolved
Hide resolved
6f5b2e1
to
c00c557
Compare
Codecov Report
@@ Coverage Diff @@
## master #1838 +/- ##
==========================================
- Coverage 87.61% 87.51% -0.11%
==========================================
Files 161 161
Lines 12575 12609 +34
Branches 552 532 -20
==========================================
+ Hits 11018 11035 +17
- Misses 1557 1574 +17
|
Discovered this while working on #1838. In the following scenario, at reconnection: - `localCommit.index = 7` - `nextRemoteRevocationNumber = 6` So when `localCommit.index == nextRemoteRevocationNumber + 1` we must retransmit the revocation. ``` local remote | | | (no pending sig) | commit = 6 | | next rev = 6 |<----- sig 7 ------| commit = 7 | | |-- rev 6 --> ? | | | | (disconnection) | | | ```
So, while testing this PR I realized that in an "outdated commitment" scenario where we are on the up-to-date side, we always react by force-closing the channel immediately, not giving our peer a chance to fix their data and restart. On top of that, we consider this a Addressing this turned out to be rabbit-holey: our sync code is quite complicated and is a bit redundant because we separate between:
Also, discovered a missing corner case when syncing in Obviously this code is very sensitive, because a bug there could trigger a mass force close, which is what we are trying to avoid with this PR after all. If we choose to update our sync code, we could first deploy a version that only compares legacy/new sync results for a few weeks on our prod nodes to get a level of confidence. |
* use a map for feature->channelType resolution Instead of explicitly listing all the combination of features, and risk inconsistency, we may has well build the reverse map using the channel type objects. * better and less spammy logs We can switch the "funding tx already spent" router log from _warn_ to _debug_ because as soon as there are more than 10 of them, the peer's announcements will be ignored and there will be a warning message about that. * timedOutHtlcs -> trimmedOrTimedOutHtlcs Add a precision on trimmed htlcs, which can be failed as soon as the commitment tx has been confirmed. * proper logging of outgoing messages It is also logical to make `Outgoing` a command of `Peer`. It should have been done this way from the start if `Peer` had been a typed actor. * fixed mixed up comments Discovered this while working on #1838. In the following scenario, at reconnection: - `localCommit.index = 7` - `nextRemoteRevocationNumber = 6` So when `localCommit.index == nextRemoteRevocationNumber + 1` we must retransmit the revocation. ``` local remote | | | (no pending sig) | commit = 6 | | next rev = 6 |<----- sig 7 ------| commit = 7 | | |-- rev 6 --> ? | | | | (disconnection) | | | ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The end result feels much cleaner than previously, it's easier to simply read through the functions in Helpers.Syncing
to check that all cases are properly handled.
The behavior looks good to me, I just added many comments to help clarify the code.
// ... and ask bob to publish its current commitment | ||
|
||
// bob sees that alice is late and stays in stand-by | ||
alice2bob.forward(bob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test makes more sense if we change the order of some operations and explicitly name each channel_reestablish
:
val reestablishA = alice2bob.expectMsgType[ChannelReestablish]
val reestablishB = bob2alice.expectMsgType[ChannelReestablish]
// bob sees that alice is late and stays in stand-by
alice2bob.forward(bob, reestablishA)
bob2alice.expectNoMessage(100 millis)
// alice realizes she has an old state when receiving Bob's reestablish
bob2alice.forward(alice, reestablishB)
// alice asks bob to publish its current commitment
val error = alice2bob.expectMsgType[Error]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala
Outdated
Show resolved
Hide resolved
d8b2cbe
to
875c74b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to also add an entry in the release notes for this, it's really newsworthy ;)
@@ -875,6 +875,22 @@ object Commitments { | |||
(commitTx, htlcTxs) | |||
} | |||
|
|||
/** | |||
* When reconnecting, we drop all outgoing unsigned changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* When reconnecting, we drop all outgoing unsigned changes. | |
* When reconnecting, we drop all unsigned changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with b4f5548.
case res: SyncResult.LocalLateProven => | ||
val msg = s"counterparty proved that we have an outdated (revoked) local commitment!!! ourLocalCommitmentNumber=${res.ourLocalCommitmentNumber} theirRemoteCommitmentNumber=${res.theirRemoteCommitmentNumber}" | ||
log.error(msg) | ||
context.system.eventStream.publish(NotifyNodeOperator(NotificationsLogger.Error, msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these 4 node operator logs, you must use NotificationsLogger.logFatalError
or create a different direct function call in NotificationsLogger
.
Otherwise if the node is stopped because of the strategy, the message will likely not be logged because of the pub/sub delay.
Also, the message should explicitly include the nodeId
and channelId
since the NotificationsLogger
doesn't have access to the channel's MDC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with 9696436. I'm not even sure we should include the info about nodeId
/channelId
, if the goal is to give a general guidance to the operator on how to proceed.
Done with 9957a83. I could put the full doc, but I don't think we should inflate the release note too much? On the other hand it would save us from using a relative link, which will break at some point. |
8fda63e
to
10fcea0
Compare
Rebased on top of #2036, the diff is now much smaller. |
This PRs adds strategies to handle outdated commitments. Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node). Strategies for outdated commitments: - request the counterparty to close the channel (default). - if the node was restarted less than 10 min ago, log an error message and stop the node Default settings maintain the same behavior as before.
10fcea0
to
c585b9c
Compare
Rebased on top of #2037. This PR is not immediately useful due to how other implementations react when they detect that the peer is using an outdated commitment: they immediately force-close. Actually it would be dangerous to not use the default strategy in a non-static-remote-key, non-anchor-output commitment, because the peer will not resend their |
This PRs adds strategies to handle two typical disaster scenarios: outdated commitments and unhandled exceptions.
Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).
Strategies for outdated commitments:
Strategies for unhandled exceptions:
Default settings maintain the same behavior as before.