Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add epoch number to Tendermint light client, alter block height representation #439

Closed
cwgoes opened this issue Jun 25, 2020 · 14 comments · Fixed by #447
Closed

Add epoch number to Tendermint light client, alter block height representation #439

cwgoes opened this issue Jun 25, 2020 · 14 comments · Fixed by #447
Assignees
Labels
brainstorming Open-ended brainstorming. from-review Feedback / alterations from specification review. tao Transport, authentication, & ordering layer.

Comments

@cwgoes
Copy link
Contributor

cwgoes commented Jun 25, 2020

Motivation

At present, upgrades to the Tendermint light client which reset the height to zero will screw up timeouts unless either:

  • No packets are in-flight (hard to ensure)
  • All in-flight packets are automatically timed out (possible, but annoying, will close ordered channels)

Neither of these options are ideal.

Proposal

Change the "block height" used through the specification from a concrete uint64 (unsigned integer) to an abstract ordered set:

type BlockHeight

enum Ord {
  LT
  EQ
  GT
}

function compare(a: BlockHeight, b: BlockHeight): Ord {
  // implementation-specific
}

Add an epoch number to the Tendermint light client height, and represent it as:

interface TendermintHeight {
  epochNumber: uint64
  epochHeight: uint64
}

function compare(a: TendermintHeight, b: TendermintHeight): Ord {
  if a.epochNumber < b.epochNumber {
    return LT
  } else if a.epochNumber == b.epochNumber {
    if a.epochHeight < b.epochHeight {
      return LT
    } else if a.epochHeight == b.epochHeight {
      return EQ
    }
  }
  return GT
}

Upgrades which reset the block height in Tendermint, represented as epochHeight here, can then increment the epochNumber, and timeouts will continue to work - if the upgrade is scheduled in advance, users can set timeouts with a later epoch number than the current epoch if they want the expected time of the timeout to be after the upgrade.

Timestamp-based timeouts should continue to work unaffected as long as the timestamp never decreases between upgrades.

This will require changes wherever a concrete uint64 height is used throughout the spec, but conceptually it's pretty simple.

@cwgoes cwgoes added tao Transport, authentication, & ordering layer. brainstorming Open-ended brainstorming. from-review Feedback / alterations from specification review. labels Jun 25, 2020
@cwgoes cwgoes self-assigned this Jun 25, 2020
@AdityaSripal
Copy link
Member

Hmm this sounds good to me. There's one case where I'd like clarification.

Suppose a chain B will upgrade from epoch e at height u. And a user on chain A still submits a packet destined for chain B with timeout height epoch: e, height: u+1. This can still happen as far as I can tell

Then a proof of absence at height e+1, 1 is sufficient to prove the timeout.

Is the above understanding correct?

@cwgoes
Copy link
Contributor Author

cwgoes commented Jun 26, 2020

Suppose a chain B will upgrade from epoch e at height u. And a user on chain A still submits a packet destined for chain B with timeout height epoch: e, height: u+1. This can still happen as far as I can tell

Yes, this is possible.

Then a proof of absence at height e+1, 1 is sufficient to prove the timeout.

Yes, because < e + 1, 1 > is GT < e, n > for any n (by the pseudocode logic above).

@ethanfrey
Copy link
Contributor

At present, upgrades to the Tendermint light client which reset the height to zero will screw up timeouts unless either:

I am quite against continuing to support this "chop block height to zero and start new 'successor' chain" approach in the ibc era.

Thia made sense for early software, but given that 0.38 includes a live upgrade module tested on various testnets, and I believe there is also an option to "dump state and restart" without resetting the height to zero (changing state, but leaving the block history intact), I see no reason to support "reset height to zero and assume IBC will keep working".

My opinion is that anyone who wants to run an ibc connected blockchain should provide some basic guarantees to other chains, including not rolling back state. I also know of no other blockchains besides the cosmos sdk, where such upgrade paths are considered normal, so I really don't think this would make too much inconvenience.

@cwgoes
Copy link
Contributor Author

cwgoes commented Jun 29, 2020

I am quite against continuing to support this "chop block height to zero and start new 'successor' chain" approach in the ibc era.

I also think we want to discourage this, but I would still rather support it cleanly than make it unnecessarily difficult.

My opinion is that anyone who wants to run an ibc connected blockchain should provide some basic guarantees to other chains, including not rolling back state. I also know of no other blockchains besides the cosmos sdk, where such upgrade paths are considered normal, so I really don't think this would make too much inconvenience.

Yes, however, several (non-Tendermint) consensus algorithms do use epoch numbers as part of normal operations, and it would be helpful to be able to incorporate them more easily into the IBC framework.

@ethanfrey
Copy link
Contributor

Yes, however, several (non-Tendermint) consensus algorithms do use epoch numbers as part of normal operations, and it would be helpful to be able to incorporate them more easily into the IBC framework.

It would be good to add some references to those algorithms (and the epoch/height issues), to ensure this change solves their issues as well. I agree, this should support as many BFT algorithms as possible.

@zmanian
Copy link
Member

zmanian commented Jun 30, 2020

A quick survey of alternative well specified consensus algorithms.

  • LibraBFT heights reset at the epoch boundary.

  • Near protocol. Heights are monotonic but not sequential.

  • Grandpa. Heights are montonic.

@ethanfrey
Copy link
Contributor

@zmanian Thank you for the summary.

Can you link to the description of those algorithms? It would be good to look at the details.

Also curious about Ava / Avalanche consensus which seems to be well-specified BFT

@ebuchman
Copy link
Member

I also think we want to discourage this, but I would still rather support it cleanly than make it unnecessarily difficult.

Is it really unnecessarily difficult or does it just mean all the channels close and need to be reopened? Given that 0-height restarts are something we want to discourage, maybe that's a reasonable price to pay.

I agree in general with having the Height be a more abstract thing (since it might be needed for certain other chains), but I wonder if we should really change the Tendermint client, especially if in the future for most Tendermint chains the epoch will never change ...

@zmanian
Copy link
Member

zmanian commented Jun 30, 2020

During the planned test upgrade of Stargate testnet, all Tendermint light clients will change epoch.

So we will be doing this at least once in a high profile setting.

I also think we want to discourage this, but I would still rather support it cleanly than make it unnecessarily difficult.

Is it really unnecessarily difficult or does it just mean all the channels close and need to be reopened? Given that 0-height restarts are something we want to discourage, maybe that's a reasonable price to pay.

I agree in general with having the Height be a more abstract thing (since it might be needed for certain other chains), but I wonder if we should really change the Tendermint client, especially if in the future for most Tendermint chains the epoch will never change ...

During the planned test upgrade of Stargate testnet, all Tendermint light clients will change epoch.

So we will be doing this at least once in a high profile setting.

@cwgoes
Copy link
Contributor Author

cwgoes commented Jun 30, 2020

Is it really unnecessarily difficult or does it just mean all the channels close and need to be reopened? Given that 0-height restarts are something we want to discourage, maybe that's a reasonable price to pay.

The particular problem is that force-closing channels may lead to isolated state, e.g. stuck tokens in ICS 20, which cannot be recovered without governance intervention (maybe that needs to happen as part of the upgrade). We could decide to accept this cost if we wanted to, though.

@ethanfrey
Copy link
Contributor

ethanfrey commented Jun 30, 2020

During the planned test upgrade of Stargate testnet, all Tendermint light clients will change epoch.

I think it would be awesome if you did the "live" / "in place" upgrade inside the stargate testnet (once you are on stargate you can upgrade to another post-stargate point without the reset). Since @aaronc is leading up the sdk for stargate and was the lead architect of this live upgrade, you should have plenty of support, plus the Regen team (and Chrous One and VitWit) have run this many times in testnets.

I am not saying we shouldn't add such epochs, but asking for the use of best possible design in testnets.

If there are backwards-incompatible breaking changes in tendermint post-stargate, yes, the chains will be forced to use "dump state and restart" or "enter a new epoch".

@cwgoes
Copy link
Contributor Author

cwgoes commented Jul 1, 2020

At the moment, Tendermint does not support non-zero-height restarts, and that support is not expected in the next release, so we will need epoch numbers to pull off such an upgrade.

(as far as I know, maybe @marbar3778 can confirm)

@tac0turtle
Copy link
Member

Tendermint does not support non-zero-height restarts,

this is correct. It is slated as a followup once 0.34 is released. There is refactoring that would need to happen at the same time. Here is the issue with the outline: tendermint/tendermint#4767

@ethanfrey
Copy link
Contributor

@cwgoes the upgrade i discuss with regen was done with backports on the 0.34 line... and definitely should work on stargate. This never shuts down the chain, just coordinates changing the binary (and running a migration) at a specific block.

But good to know that "dump state and restart" still cannot handle non-zero heights (I thought this was working already)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brainstorming Open-ended brainstorming. from-review Feedback / alterations from specification review. tao Transport, authentication, & ordering layer.
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

6 participants