Skip to content

Commit

Permalink
alternate strategy for outdated commitments
Browse files Browse the repository at this point in the history
This PRs adds strategies to handle outdated commitments.

Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).

Strategies for outdated commitments:
- request the counterparty to close the channel (default).
- if the node was restarted less than 10 min ago, log an error message and stop the node

Default settings maintain the same behavior as before.
  • Loading branch information
pm47 committed Oct 27, 2021
1 parent 2c0c24e commit c585b9c
Show file tree
Hide file tree
Showing 8 changed files with 161 additions and 6 deletions.
41 changes: 41 additions & 0 deletions docs/Advanced.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Advanced usage

## Avoid mass force-close of channels

In order to minimize force-closes of channels (especially for larger nodes), it is possible to customize the way eclair handles certain situations, like outdated commitment and internal errors.

:warning: There is no magic: non-default strategies are a trade-off where it is assumed that the node is closely monitored. Instead of automatically reacting to some events, eclair will stop and await manual intervention. It is therefore reserved for advanced or professional node operators. Default strategies are best suited for smaller loosely administered nodes.

### Outdated commitments

The default behavior, when our peer tells us (or proves to us) that our channel commitment is outdated, is to request a remote force-close of the channel (a.k.a. recovery).

It may happen that due to a misconfiguration, the node was accidentally restarted using e.g. an old backup, and the data wasn't really lost. In that case, simply fixing the configuration and restarting eclair would prevent a mass force-close of channels.

This is why an alternative behavior is to simply log an error and stop the node. However, because our peer may be lying when it tells us that our channel commitment data is outdated, there is a 10 min window after restart when this strategy applies. After that, the node reverts to the default strategy.

During the 10 min window, the operator should closely monitor the node and assess, if the peer stops, whether this is really a case of using outdated data, or a peer is just lying. If it turns out that the data is really outdated due to a misconfiguration, the operator has an opportunity to fix it and restart the node. If the data is really outdated because it was simply lost, then the operator should change the strategy to the default and restart the node: this will cause the force close of outdated channels, but there is no way to avoid that.

Here is a decision tree:
```
if (node stops after restart)
if (false positive)
configure eclair to use default strategy and restart node (will force close channels to malicious peers)
else
if (more up-to-date data available)
configure eclair to point to proper database and restart node
else
configure eclair to use default strategy and restart node (will force close all outdated channels)
```

The alternate strategy can be configured by setting `eclair.outdated-commitment-strategy=stop` (see [`reference.conf`](https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/resources/reference.conf)).

### Unhandled exceptions

The default behavior, when we encounter an unhandled exception or internal error, is to locally force-close the channel.

Not only is there a delay before the channel balance gets refunded, but if the exception was due to some misconfiguration or bug in eclair that affects all channels, we risk force-closing all channels.

This is why an alternative behavior is to simply log an error and stop the node. Note that if you don't closely monitor your node, there is a risk that your peers take advantage of the downtime to try and cheat by publishing a revoked commitment. Additionally, while there is no known way of triggering an internal error in eclair from the outside, there may very well be a bug that allows just that, which could be used as a way to remotely stop the node (with the default behavior, it would "only" cause a local force-close of the channel).

The alternate strategy can be configured by setting `eclair.unhandled-exception-strategy=stop` (see [`reference.conf`](https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/resources/reference.conf)).
8 changes: 8 additions & 0 deletions docs/release-notes/eclair-vnext.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@

## Major changes

### Advanced strategies to avoid mass force-close of channels

In order to minimize force-closes of channels (especially for larger nodes), it is possible to customize the way eclair handles certain situations, like outdated commitment and internal errors.

:warning: There is no magic: non-default strategies are a trade-off where it is assumed that the node is closely monitored. Instead of automatically reacting to some events, eclair will stop and await manual intervention. It is therefore reserved for advanced or professional node operators. Default strategies are best suited for smaller loosely administered nodes.

This feature is documented [here](../Advanced.md).

### Separate log for important notifications

Eclair added a new log file (`notifications.log`) for important notifications that require an action from the node operator.
Expand Down
4 changes: 4 additions & 0 deletions eclair-core/src/main/resources/reference.conf
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,10 @@ eclair {
max-block-processing-delay = 30 seconds // we add a random delay before processing blocks, capped at this value, to prevent herd effect
max-tx-publish-retry-delay = 60 seconds // we add a random delay before retrying failed transaction publication

// see docs/Advanced.md for more information on the strategies
outdated-commitment-strategy = "remote-close" // remote-close or stop (NB: the app will only stop if it was recently restarted)
unhandled-exception-strategy = "local-close" // local-close or stop

relay {
fees {
// Fees for public channels
Expand Down
2 changes: 2 additions & 0 deletions eclair-core/src/main/scala/fr/acinq/eclair/Logs.scala
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ object NotificationsLogger {
*/
def logFatalError(message: String, t: Throwable): Unit = log.error(message, t)

def logFatalError(message: String): Unit = log.error(message)

def apply(): Behavior[NotifyNodeOperator] =
Behaviors.setup { context =>
context.system.eventStream ! EventStream.Subscribe(context.self)
Expand Down
18 changes: 18 additions & 0 deletions eclair-core/src/main/scala/fr/acinq/eclair/NodeParams.scala
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import fr.acinq.bitcoin.{Block, ByteVector32, Crypto, Satoshi}
import fr.acinq.eclair.Setup.Seeds
import fr.acinq.eclair.blockchain.fee._
import fr.acinq.eclair.channel.Channel
import fr.acinq.eclair.channel.Channel.{OutdatedCommitmentStrategy, UnhandledExceptionStrategy}
import fr.acinq.eclair.crypto.Noise.KeyPair
import fr.acinq.eclair.crypto.keymanager.{ChannelKeyManager, NodeKeyManager}
import fr.acinq.eclair.db._
Expand Down Expand Up @@ -76,6 +77,8 @@ case class NodeParams(nodeKeyManager: NodeKeyManager,
relayParams: RelayParams,
reserveToFundingRatio: Double,
maxReserveToFundingRatio: Double,
outdatedCommitmentStrategy: OutdatedCommitmentStrategy,
unhandledExceptionStrategy: UnhandledExceptionStrategy,
db: Databases,
revocationTimeout: FiniteDuration,
autoReconnect: Boolean,
Expand All @@ -95,6 +98,9 @@ case class NodeParams(nodeKeyManager: NodeKeyManager,
enableTrampolinePayment: Boolean,
balanceCheckInterval: FiniteDuration,
blockchainWatchdogSources: Seq[String]) {

val startTime: TimestampSecond = TimestampSecond.now()

val privateKey: Crypto.PrivateKey = nodeKeyManager.nodeKey.privateKey

val nodeId: PublicKey = nodeKeyManager.nodeId
Expand Down Expand Up @@ -357,6 +363,16 @@ object NodeParams extends Logging {
PathFindingExperimentConf(experiments.toMap)
}

val outdatedCommitmentStrategy = config.getString("outdated-commitment-strategy") match {
case "remote-close" => OutdatedCommitmentStrategy.RemoteClose
case "stop" => OutdatedCommitmentStrategy.Stop
}

val unhandledExceptionStrategy = config.getString("unhandled-exception-strategy") match {
case "local-close" => UnhandledExceptionStrategy.LocalClose
case "stop" => UnhandledExceptionStrategy.Stop
}

val routerSyncEncodingType = config.getString("router.sync.encoding-type") match {
case "uncompressed" => EncodingType.UNCOMPRESSED
case "zlib" => EncodingType.COMPRESSED_ZLIB
Expand Down Expand Up @@ -423,6 +439,8 @@ object NodeParams extends Logging {
),
reserveToFundingRatio = config.getDouble("reserve-to-funding-ratio"),
maxReserveToFundingRatio = config.getDouble("max-reserve-to-funding-ratio"),
outdatedCommitmentStrategy = outdatedCommitmentStrategy,
unhandledExceptionStrategy = unhandledExceptionStrategy,
db = database,
revocationTimeout = FiniteDuration(config.getDuration("revocation-timeout").getSeconds, TimeUnit.SECONDS),
autoReconnect = config.getBoolean("auto-reconnect"),
Expand Down
87 changes: 82 additions & 5 deletions eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,38 @@ object Channel {
/** We don't immediately process [[CurrentBlockCount]] to avoid herd effects */
case class ProcessCurrentBlockCount(c: CurrentBlockCount)

// @formatter:off
/** What do we do if we detect that our local commitment is outdated. */
sealed trait OutdatedCommitmentStrategy
object OutdatedCommitmentStrategy {
/**
* Ask our counterparty to close the channel, whenever our peer proves to us *or* simply tells us (could be lying)
* that we are using an outdated commitment.
* This may be the best choice for smaller loosely administered nodes.
*/
case object RemoteClose extends OutdatedCommitmentStrategy
/**
* If the node was just restarted, just log an error and stop the app. The goal is to prevent unwanted mass
* force-close of channels if we accidentally restarted the node with an outdated backup. After a few minutes, we
* revert to the default behavior of requesting our peer to force close, otherwise this opens a huge attack vector
* where any peer can remotely stop our node.
* This strategy may be better suited for larger nodes, closely administered.
*/
case object Stop extends OutdatedCommitmentStrategy
}
// @formatter:on

// @formatter:off
/** What do we do if we have a local unhandled exception. */
sealed trait UnhandledExceptionStrategy
object UnhandledExceptionStrategy {
/** Ask our counterparty to close the channel. This may be the best choice for smaller loosely administered nodes.*/
case object LocalClose extends UnhandledExceptionStrategy
/** Just log an error and stop the node. May be better for larger nodes, to prevent unwanted mass force-close.*/
case object Stop extends UnhandledExceptionStrategy
}
// @formatter:on

}

class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remoteNodeId: PublicKey, blockchain: typed.ActorRef[ZmqWatcher.Command], relayer: ActorRef, txPublisherFactory: Channel.TxPublisherFactory, origin_opt: Option[ActorRef] = None)(implicit ec: ExecutionContext = ExecutionContext.Implicits.global) extends FSM[ChannelState, ChannelData] with FSMDiagnosticActorLogging[ChannelState, ChannelData] {
Expand Down Expand Up @@ -1669,6 +1701,7 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
case syncSuccess: SyncResult.Success =>
var sendQueue = Queue.empty[LightningMessage]
// normal case, our data is up-to-date

if (channelReestablish.nextLocalCommitmentNumber == 1 && d.commitments.localCommit.index == 0) {
// If next_local_commitment_number is 1 in both the channel_reestablish it sent and received, then the node MUST retransmit funding_locked, otherwise it MUST NOT
log.debug("re-sending fundingLocked")
Expand Down Expand Up @@ -2288,7 +2321,8 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo

private def handleLocalError(cause: Throwable, d: ChannelData, msg: Option[Any]) = {
cause match {
case _: ForcedLocalCommit => log.warning(s"force-closing channel at user request")
case _: ForcedLocalCommit =>
log.warning(s"force-closing channel at user request")
case _ if msg.exists(_.isInstanceOf[OpenChannel]) || msg.exists(_.isInstanceOf[AcceptChannel]) =>
// invalid remote channel parameters are logged as warning
log.warning(s"${cause.getMessage} while processing msg=${msg.getOrElse("n/a").getClass.getSimpleName} in state=$stateName")
Expand All @@ -2308,7 +2342,31 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
log.info(s"we have a valid closing tx, publishing it instead of our commitment: closingTxId=${bestUnpublishedClosingTx.tx.txid}")
// if we were in the process of closing and already received a closing sig from the counterparty, it's always better to use that
handleMutualClose(bestUnpublishedClosingTx, Left(negotiating))
case dd: HasCommitments => spendLocalCurrent(dd) sending error // otherwise we use our current commitment
case dd: HasCommitments =>
cause match {
case _: ChannelException =>
// known channel exception: we force close using our current commitment
spendLocalCurrent(dd) sending error
case _ =>
// unhandled exception: we apply the configured strategy
nodeParams.unhandledExceptionStrategy match {
case UnhandledExceptionStrategy.LocalClose =>
spendLocalCurrent(dd) sending error
case UnhandledExceptionStrategy.Stop =>
log.error("unhandled exception: standard procedure would be to force-close the channel, but eclair has been configured to halt instead.")
NotificationsLogger.logFatalError(
s"""stopping node as configured strategy to unhandled exceptions for nodeId=$remoteNodeId channelId=${d.channelId}
|
|Eclair has been configured to shut down when an unhandled exception happens, instead of requesting a
|force-close from the peer. This gives the operator a chance of avoiding an unnecessary mass force-close
|of channels that may be caused by a bug in Eclair, or issues like running out of disk space, etc.
|
|You should get in touch with Eclair developers and provide logs of your node for analysis.
|""".stripMargin, cause)
System.exit(1)
stop(FSM.Shutdown)
}
}
case _ => goto(CLOSED) sending error // when there is no commitment yet, we just send an error to our peer and go to CLOSED state
}
}
Expand Down Expand Up @@ -2561,9 +2619,28 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
}

private def handleOutdatedCommitment(channelReestablish: ChannelReestablish, d: HasCommitments) = {
val exc = PleasePublishYourCommitment(d.channelId)
val error = Error(d.channelId, exc.getMessage)
goto(WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT) using DATA_WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT(d.commitments, channelReestablish) storing() sending error
nodeParams.outdatedCommitmentStrategy match {
case OutdatedCommitmentStrategy.Stop if (TimestampSecond.now() - nodeParams.startTime) < 10.minutes =>
log.error("we just restarted and may have an outdated commitment: standard procedure would be to request our peer to force-close, but eclair has been configured to halt instead. Please ensure your database is up-to-date and restart eclair.")
NotificationsLogger.logFatalError(
s"""stopping node as configured strategy to outdated commitment for nodeId=$remoteNodeId channelId=${d.channelId}
|
|Eclair has been configured to shut down if a sync error is detected at restart, instead of requesting a
|force-close from the peer. This gives the operator a chance of avoiding an unnecessary mass force-close
|of channels.
|
|You should investigate why Eclair appears to be using outdated data. If it turns out that this is due to a
|misconfiguration, just fix it and restart the node. If however the data was really lost, then you should
|change the outdated commitment strategy to the default and restart the node: this will cause a force
|close of outdated channels, but there is no way to avoid that.
|""".stripMargin)
System.exit(1)
stop(FSM.Shutdown)
case _ =>
val exc = PleasePublishYourCommitment(d.channelId)
val error = Error(d.channelId, exc.getMessage)
goto(WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT) using DATA_WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT(d.commitments, channelReestablish) storing() sending error
}
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,7 @@ object Helpers {
case class LocalLateUnproven(ourRemoteCommitmentNumber: Long, theirLocalCommitmentNumber: Long) extends Failure
case class RemoteLying(ourLocalCommitmentNumber: Long, theirRemoteCommitmentNumber: Long, invalidPerCommitmentSecret: PrivateKey) extends Failure
case object RemoteLate extends Failure
}
}
// @formatter:on

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import fr.acinq.bitcoin.{Block, ByteVector32, Satoshi, SatoshiLong, Script}
import fr.acinq.eclair.FeatureSupport.{Mandatory, Optional}
import fr.acinq.eclair.Features._
import fr.acinq.eclair.blockchain.fee._
import fr.acinq.eclair.channel.Channel.{OutdatedCommitmentStrategy, UnhandledExceptionStrategy}
import fr.acinq.eclair.channel.LocalParams
import fr.acinq.eclair.crypto.keymanager.{LocalChannelKeyManager, LocalNodeKeyManager}
import fr.acinq.eclair.io.{Peer, PeerConnection}
Expand Down Expand Up @@ -143,6 +144,8 @@ object TestConstants {
feeProportionalMillionths = 30)),
reserveToFundingRatio = 0.01, // note: not used (overridden below)
maxReserveToFundingRatio = 0.05,
outdatedCommitmentStrategy = OutdatedCommitmentStrategy.RemoteClose,
unhandledExceptionStrategy = UnhandledExceptionStrategy.LocalClose,
db = TestDatabases.inMemoryDb(),
revocationTimeout = 20 seconds,
autoReconnect = false,
Expand Down Expand Up @@ -269,6 +272,8 @@ object TestConstants {
feeProportionalMillionths = 30)),
reserveToFundingRatio = 0.01, // note: not used (overridden below)
maxReserveToFundingRatio = 0.05,
outdatedCommitmentStrategy = OutdatedCommitmentStrategy.RemoteClose,
unhandledExceptionStrategy = UnhandledExceptionStrategy.LocalClose,
db = TestDatabases.inMemoryDb(),
revocationTimeout = 20 seconds,
autoReconnect = false,
Expand Down

0 comments on commit c585b9c

Please sign in to comment.