Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate strategies for outdated commitments #1838

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pm47
Copy link
Member

@pm47 pm47 commented Jun 7, 2021

This PRs adds strategies to handle two typical disaster scenarios: outdated commitments and unhandled exceptions.

Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).

Strategies for outdated commitments:

  • request the counterparty to close the channel (default).
  • if the node was restarted less than 10 min ago, log an error message and stop the node

Strategies for unhandled exceptions:

  • local force close of the channel (default)
  • log an error message and stop the node

Default settings maintain the same behavior as before.

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these two commits should be done in the same PR, I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightning/bolts#834 and reviewing in which case we don't have a choice and really need to force-close.

As for the first commit, what do you do when the node stops and you do have your most up-to-date backup? If you restart it will stop again. So you'd need to change your config from "halt-if-just-restarted" to "always-request-remote-close" before restarting, otherwise a single faulty channel would DoS the whole node...

@pm47
Copy link
Member Author

pm47 commented Jun 7, 2021

I don't think these two commits should be done in the same PR

Agreed, I got a bit carried away and kept experimenting.

I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightning/bolts#834 and reviewing in which case we don't have a choice and really need to force-close.

IIUC lightning/bolts#834 deals with protocol errors, but here I'm trying to address the case where we have an internal error. Disk full, OOM exception, that sort of things.

As for the first commit, what do you do when the node stops and you do have your most up-to-date backup? If you restart it will stop again. So you'd need to change your config from "halt-if-just-restarted" to "always-request-remote-close" before restarting, otherwise a single faulty channel would DoS the whole node...

Exactly, it requires human intervention, that's why this mode is for more advanced users. But it's less susceptible to blind mass force-close, which is what this PR aims to address. I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.

@t-bast
Copy link
Member

t-bast commented Jun 7, 2021

Exactly, it requires human intervention, that's why this mode is for more advanced users.

So the flow if the node stops because of an outdated commitment would be:

  1. The node stops
  2. Investigate your backups
  3. If you had a DB issue, restart after fixing your DB
  4. Otherwise if you find nothing and think you really are late, change your config to always-request-remote-close and restart, and accept losing channels

Is that right?
I think it should be documented clearly, probably a new document in the docs folder.

I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.

I agree, the other solutions I can think of are just proxies for almost the same behavior, so it's better to choose the simplest one.
I don't think it's an area where we'll want many different strategies though, so maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false or something similar?

@pm47
Copy link
Member Author

pm47 commented Jun 8, 2021

I think it should be documented clearly, probably a new document in the docs folder.

Agree!

maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false or something similar?

It was a left-over of my previous attempt, I flip-flopped around this and thought keeping an explicit strategy name would be easier to understand ... but it is probably not.

@pm47 pm47 force-pushed the optin-auto-recovery branch from d6f88ef to b9308a4 Compare September 9, 2021 14:40
@pm47 pm47 changed the title Add a strategy to deal with outdated commitments Add a strategy to deal with outdated commitments and unhandled exceptions Sep 9, 2021
@pm47 pm47 marked this pull request as ready for review September 9, 2021 14:40
@pm47 pm47 requested a review from t-bast September 9, 2021 14:47
@pm47
Copy link
Member Author

pm47 commented Sep 9, 2021

I don't think these two commits should be done in the same PR,

Agreed, I got a bit carried away and kept experimenting.

On 2nd thought, I kept both changes in the same PR, I think it makes sense since those are related topics.

I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightning/bolts#834 and reviewing in which case we don't have a choice and really need to force-close.

I could add a third strategy, to not force-close even if our peer sends an error, like lnd does (used to do?). But that would'nt be spec compliant, especially since we have warning messages now.

I think it should be documented clearly, probably a new document in the docs folder.

Done with 93d07c4.

@pm47 pm47 force-pushed the optin-auto-recovery branch from 6f5b2e1 to c00c557 Compare October 14, 2021 13:01
@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2021

Codecov Report

Merging #1838 (8fda63e) into master (765a0c5) will decrease coverage by 0.10%.
The diff coverage is 76.41%.

@@            Coverage Diff             @@
##           master    #1838      +/-   ##
==========================================
- Coverage   87.61%   87.51%   -0.11%     
==========================================
  Files         161      161              
  Lines       12575    12609      +34     
  Branches      552      532      -20     
==========================================
+ Hits        11018    11035      +17     
- Misses       1557     1574      +17     
Impacted Files Coverage Δ
...air-core/src/main/scala/fr/acinq/eclair/Logs.scala 81.35% <0.00%> (-1.41%) ⬇️
...c/main/scala/fr/acinq/eclair/channel/Channel.scala 87.56% <67.39%> (-1.14%) ⬇️
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 91.30% <71.42%> (-0.63%) ⬇️
...c/main/scala/fr/acinq/eclair/channel/Helpers.scala 94.78% <83.72%> (-0.79%) ⬇️
...in/scala/fr/acinq/eclair/channel/Commitments.scala 94.65% <100.00%> (+0.14%) ⬆️
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala 79.77% <0.00%> (-0.91%) ⬇️
...ore/src/main/scala/fr/acinq/eclair/Timestamp.scala 100.00% <0.00%> (ø)
...chain/bitcoind/rpc/BasicBitcoinJsonRPCClient.scala 100.00% <0.00%> (ø)
...main/scala/fr/acinq/eclair/io/PeerConnection.scala 84.58% <0.00%> (+0.41%) ⬆️
... and 1 more

@pm47 pm47 changed the title Add a strategy to deal with outdated commitments and unhandled exceptions Alternate strategies for outdated commitments and unhandled exceptions Oct 14, 2021
pm47 added a commit that referenced this pull request Oct 15, 2021
Discovered this while working on #1838.

In the following scenario, at reconnection:
- `localCommit.index = 7`
- `nextRemoteRevocationNumber = 6`

So when `localCommit.index == nextRemoteRevocationNumber + 1` we must retransmit the revocation.

```
          local              remote
            |                   |
            |  (no pending sig) |
 commit = 6 |                   | next rev = 6
            |<----- sig 7 ------|
 commit = 7 |                   |
            |-- rev 6 --> ?     |
            |                   |
            |  (disconnection)  |
            |                   |
```
@pm47
Copy link
Member Author

pm47 commented Oct 15, 2021

So, while testing this PR I realized that in an "outdated commitment" scenario where we are on the up-to-date side, we always react by force-closing the channel immediately, not giving our peer a chance to fix their data and restart. On top of that, we consider this a commitment sync error, instead of clearly logging that our counterparty is using outdated data.

Addressing this turned out to be rabbit-holey: our sync code is quite complicated and is a bit redundant because we separate between:

  • checking whether we are late
  • deciding what messages we need to retransmit

Also, discovered a missing corner case when syncing in SHUTDOWN state.

Obviously this code is very sensitive, because a bug there could trigger a mass force close, which is what we are trying to avoid with this PR after all. If we choose to update our sync code, we could first deploy a version that only compares legacy/new sync results for a few weeks on our prod nodes to get a level of confidence.

@pm47 pm47 requested a review from t-bast October 15, 2021 14:25
pm47 added a commit that referenced this pull request Oct 18, 2021
* use a map for feature->channelType resolution

Instead of explicitly listing all the combination of features, and risk
inconsistency, we may has well build the reverse map using the channel
type objects.

* better and less spammy logs

We can switch the "funding tx already spent" router log from _warn_ to 
_debug_ because as soon as there are more than 10 of them, the peer's
announcements will be ignored and there will be a warning message about
that.

* timedOutHtlcs -> trimmedOrTimedOutHtlcs

Add a precision on trimmed htlcs, which can be failed as soon as the
commitment tx has been confirmed.

* proper logging of outgoing messages

It is also logical to make `Outgoing` a command of `Peer`. It should
have been done this way from the start if `Peer` had been a typed actor.

* fixed mixed up comments

Discovered this while working on #1838.

In the following scenario, at reconnection:
- `localCommit.index = 7`
- `nextRemoteRevocationNumber = 6`

So when `localCommit.index == nextRemoteRevocationNumber + 1` we must retransmit the revocation.

```
          local              remote
            |                   |
            |  (no pending sig) |
 commit = 6 |                   | next rev = 6
            |<----- sig 7 ------|
 commit = 7 |                   |
            |-- rev 6 --> ?     |
            |                   |
            |  (disconnection)  |
            |                   |
```
Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end result feels much cleaner than previously, it's easier to simply read through the functions in Helpers.Syncing to check that all cases are properly handled.

The behavior looks good to me, I just added many comments to help clarify the code.

docs/Advanced.md Outdated Show resolved Hide resolved
eclair-core/src/main/resources/reference.conf Show resolved Hide resolved
// ... and ask bob to publish its current commitment

// bob sees that alice is late and stays in stand-by
alice2bob.forward(bob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test makes more sense if we change the order of some operations and explicitly name each channel_reestablish:

    val reestablishA = alice2bob.expectMsgType[ChannelReestablish]
    val reestablishB = bob2alice.expectMsgType[ChannelReestablish]

    // bob sees that alice is late and stays in stand-by
    alice2bob.forward(bob, reestablishA)
    bob2alice.expectNoMessage(100 millis)

    // alice realizes she has an old state when receiving Bob's reestablish
    bob2alice.forward(alice, reestablishB)
    // alice asks bob to publish its current commitment
    val error = alice2bob.expectMsgType[Error]

Copy link
Member Author

@pm47 pm47 Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 12a9184 + 8fda63e.

@pm47 pm47 force-pushed the optin-auto-recovery branch from d8b2cbe to 875c74b Compare October 25, 2021 13:41
@pm47 pm47 requested a review from t-bast October 25, 2021 13:42
Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to also add an entry in the release notes for this, it's really newsworthy ;)

@@ -875,6 +875,22 @@ object Commitments {
(commitTx, htlcTxs)
}

/**
* When reconnecting, we drop all outgoing unsigned changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* When reconnecting, we drop all outgoing unsigned changes.
* When reconnecting, we drop all unsigned changes.

Copy link
Member Author

@pm47 pm47 Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with b4f5548.

case res: SyncResult.LocalLateProven =>
val msg = s"counterparty proved that we have an outdated (revoked) local commitment!!! ourLocalCommitmentNumber=${res.ourLocalCommitmentNumber} theirRemoteCommitmentNumber=${res.theirRemoteCommitmentNumber}"
log.error(msg)
context.system.eventStream.publish(NotifyNodeOperator(NotificationsLogger.Error, msg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these 4 node operator logs, you must use NotificationsLogger.logFatalError or create a different direct function call in NotificationsLogger.

Otherwise if the node is stopped because of the strategy, the message will likely not be logged because of the pub/sub delay.

Also, the message should explicitly include the nodeId and channelId since the NotificationsLogger doesn't have access to the channel's MDC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 9696436. I'm not even sure we should include the info about nodeId/channelId, if the goal is to give a general guidance to the operator on how to proceed.

@pm47
Copy link
Member Author

pm47 commented Oct 26, 2021

Don't forget to also add an entry in the release notes for this, it's really newsworthy ;)

Done with 9957a83. I could put the full doc, but I don't think we should inflate the release note too much? On the other hand it would save us from using a relative link, which will break at some point.

@pm47 pm47 requested a review from t-bast October 26, 2021 13:10
@pm47 pm47 mentioned this pull request Oct 27, 2021
@pm47 pm47 force-pushed the optin-auto-recovery branch from 8fda63e to 10fcea0 Compare October 27, 2021 09:51
@pm47
Copy link
Member Author

pm47 commented Oct 27, 2021

Rebased on top of #2036, the diff is now much smaller.

This PRs adds strategies to handle outdated commitments.

Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).

Strategies for outdated commitments:
- request the counterparty to close the channel (default).
- if the node was restarted less than 10 min ago, log an error message and stop the node

Default settings maintain the same behavior as before.
@pm47 pm47 force-pushed the optin-auto-recovery branch from 10fcea0 to c585b9c Compare October 27, 2021 15:56
@pm47 pm47 changed the title Alternate strategies for outdated commitments and unhandled exceptions Alternate strategies for outdated commitments Oct 27, 2021
@pm47
Copy link
Member Author

pm47 commented Oct 27, 2021

Rebased on top of #2037.

This PR is not immediately useful due to how other implementations react when they detect that the peer is using an outdated commitment: they immediately force-close. Actually it would be dangerous to not use the default strategy in a non-static-remote-key, non-anchor-output commitment, because the peer will not resend their channel_reestablish which tells us to to spend our main balance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants