Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Chain Halt #158

Closed
faddat opened this issue Mar 1, 2023 · 12 comments
Closed

[BUG] Chain Halt #158

faddat opened this issue Mar 1, 2023 · 12 comments
Labels
bug Something isn't working discuss Still being debated

Comments

@faddat
Copy link
Contributor

faddat commented Mar 1, 2023

Describe the bug
Terra classic halted on this block:

To Reproduce
Unknown

Context & versions
Latest release

(if applicable) suggested solution
This issue should be used to centralize information about the halt. Useful information includes:

  • sha256sums of binaries from validators
  • logs from validators
  • anecdotal information from developers and validators

Should speak to validators in order of their votepower. In order to bing the chain back up, you'll need 2/3rd of votepower online. So, start with the highest ranked validator, and check the version of software that they are running. If a whitelisted account was present in the halt block, but different versions were running simultaneously, this would explain the halt.

If a VaaS provider with >1/3rd of votepower was running a different version of the software, this could also be responsible for the chain halt.

I am concerned with the timing of this PR:

@inon-man may very well be correct: if binaries had already been released to validators, then it is possible that some were running a version of lunc that did not include #149. I don't know all of the other changes in the upgrade, and am looking into it now.

Additional information

  • v1.0.5...v1.1.0
  • It seems that the work on linting the repository that I did has been magically memory holed. It also seems that the main branch contains the commit that is tagged for v1.1.0.
  • In emergencies, having a fully linted repository makes the code far easier to work with, because it is standardized. So, I will begin by linting.
  • Todd thinks that it could be the golang version, and this is the easiest item to fix. Before linting, I will address that. https://twitter.com/blockpane/status/1630832835899822080

These services, as well as the security incident reporting concerning allnodes.com, are funded by delegations to the Notional validator on LUNC.

@faddat faddat added the bug Something isn't working label Mar 1, 2023
@atomlab
Copy link

atomlab commented Mar 1, 2023

My node has upgraded to v1.1.0 but it's crashing. I faced an error.

INF Starting IndexerService service impl=IndexerService module=txindex
INF ABCI Handshake App Info hash="ϳ�\x03x�G��Fk�hzi`R��\x13|D�V\x1c\x05\x1c{��O�" height=11734001 module=consensus protocol-version=0 software-version=1.1.0
INF ABCI Replay Blocks appHeight=11734001 module=consensus stateHeight=11734001 storeHeight=11734002
INF Replay last block using real app module=consensus

Error: error during handshake: error on replay: wrong Block.Header.LastResultsHash.  Expected 658EC57B453249685F1074BC1F6CE5C56C04730BD850F0F05DFAAD41BF02B3B1, got AFE97BD368F0D6B92206B83EB89A1CBC8DED4B9EE1596F4A819C6817F53F47C9
# terrad version
1.1.0

@faddat
Copy link
Contributor Author

faddat commented Mar 1, 2023

This issue identified the root cause.

Then l1tf fixed it.

Proof:

Screenshot_20230301-211123

@faddat
Copy link
Contributor Author

faddat commented Mar 1, 2023

@atomlab I bet that you're still running the previous tag. You see what they did, it was highly irresponsible, they changed the contents of the tag. So you see your local get repository it thinks that it knows what the tag is but it's not true nuke your git repository check out the tag install the chain and then start your daemon and it should work

You could also check out the git commit hash that corresponds to the new tag.

Since you're a validator most likely you should be asking very difficult questions at this time, such as:

  • why don't you bother to lint the code?
  • Isn't a billion dollar market cap enough to bother to lint the code?
  • Why did you revert the previous linting work?
  • Why didn't you listen to @inon-man ?

@faddat faddat closed this as completed Mar 1, 2023
@faddat faddat reopened this Mar 1, 2023
@atomlab
Copy link

atomlab commented Mar 1, 2023

I have compiled a terrad from this hash 70d118b.

Checking out the ref
  /usr/bin/git checkout --progress --force refs/tags/v1.1.0
  Previous HEAD position was 8bb56e9 Merge pull request #44 from classic-terra/v1.0.5-vm-fix
  HEAD is now at 70d118b add 3 binance addresses (#149)
/usr/bin/git log -1 --format='%H'
'70d118b0ab38c5c2b61288a090177fdfa33dfe76'

@faddat
Copy link
Contributor Author

faddat commented Mar 1, 2023

Yeah there's kind of like an issue there hold on a second please:

You should use the commands

git checkout 70d118b
go install ./...
terrad start

If you're still having that problem you might want to check the version of Go Lang on your computer

If both of those don't work then I would strongly suggest that you do like:

cd ~/
rm -rf core
git clone https://GitHub.com/classic-terra/core
cd core
go install ./...
terrad start

@bobbyd666
Copy link

Same here (from 70d118b). Go ver. downgraded. Any tips guys?
panic: Failed to process committed block (11734002:2567013EB5E4ED5D538672B668B57A276F30129952572150C4FEE00F62E9E727): wrong Block.Header.LastResultsHash. Expected 658EC57B453249685F1074BC1F6CE5C56C04730BD850F0F05DFAAD41BF02B3B1, got AFE97BD368F0D6B92206B83EB89A1CBC8DED4B9EE1596F4A819C6817F53F47C9

terrad version --long
name: terra
server_name: terrad
version: 1.1.0
commit: 70d118b
build_tags: netgo,ledger
go: go version go1.18.1 linux/amd64

@erzqk
Copy link

erzqk commented Mar 1, 2023

Hello. Im trying run node from pruning snapshot and got same error as @atomlab : Error: error during handshake: error on replay: wrong Block.Header.LastResultsHash. Expected 658EC57B453249685F1074BC1F6CE5C56C04730BD850F0F05DFAAD41BF02B3B1, got AFE97BD368F0D6B92206B83EB89A1CBC8DED4B9EE1596F4A819C6817F53F47C9. I;+'m also trying run node without snapshot with syncing blockhain and have panic: Must use v1.0.x for importing the columbus genesis (https://github.com/classic-terra/core/releases/). What I can do with that?

@bobbyd666
Copy link

Same discussion here https://classic-agora.terra.money/t/v1-1-0-software-upgrade-proposal/50242/21

There's an interesting tip from aeuser999. Can anyone confirm the libwasmvm.so version please (ldd terrad)?

@bobbyd666
Copy link

Daaaamn.

7:51PM INF indexed block height=11734248 module=txindex
7:51PM INF indexed block height=11734249 module=txindex
7:51PM INF indexed block height=11734250 module=txindex

This lib seems to be the problem. It also needs to be updated to:
libwasmvm.so => ~/go/pkg/mod/github.com/!cosm!wasm/wasmvm@v0.16.7/api/libwasmvm.so (0x00007f5fc0e74000)

Credit goes to aeuser999

@aeuser999
Copy link

aeuser999 commented Mar 1, 2023

In truth, the person who pointed out that during the upgrade was LordInateur (so all credit to them :) ).

I am glad it helped out though, and that you figured out the issue (and I will keep that in mind to pass along if others have a similar issue too).

@faddat
Copy link
Contributor Author

faddat commented Mar 2, 2023

Okay so one of the things that we can get from this, that is clearly positive, is a really really clear description of the causes. From what you are saying, it sounds like there are at minimum two things that caused this problem:

  1. overwriting the tag for v1.1.0
  2. version of libwasmvm.so

Great hunting, @aeuser999!

. Do you happen to know the percentages here? For example, do we have a clear picture yet of what percentage of nodes having difficulty had difficulty with item number one, versus percentage of nodes having difficulty with item too?

@aeuser999
Copy link

aeuser999 commented Mar 3, 2023

Hi @faddat,

Really the main issue, from my limited participation, was that it took a while before we hit voting power consensus. During the upgrade though there was some really great team work around one or more of these issues from everyone involved (it was a privilege to witness and experience and I am thankful - it was an enjoyable and great experience):

  • The consensus module froze for some validators on upgrade. Re-syncing worked. It is undetermined if those validators' consensus module would have recovered given time, but since it was an upgrade environment, the path to the quickest solution was favored. This appeared to be the more common of the issues (for those who had issues).

  • The overwriting of the tag for v1.1.0: There were about three people who have mentioned (either during the upgrade or after) they pre-built the code before the official instructions were released, so these had this issue.

    • My own personal reflections: Although the change to the release was a benign late addition, it was not the best from a security perspective, nor a governance perspective, in my humble personal opinion, since it happened during the vote on whether to accept the code in the release (and the release then changed during that vote for a non-emergency reason right up against the distribution). I am sure it most likely changed at all only because it was a benign change that did not materially affect the code, but according to the PR it did not appear to have the test box checked so I am not sure if it was re-tested or not. Although the reason for the change was acknowledged in the PR, it does not look like it was acknowledged in the proposal discussion (although even then the vote was already in progress - which is not appropriate to add code during a vote about the code itself). The reason it is a security issue is that the point of voting on the code is that it would be formally listed, open to discussion and review, and a determination of whether it will be accepted via governance vote, as the code was at the time of the proposal discussion before the vote - to change it at the last minute means that someone could include last minute malicious code, intended or unintended, just prior to distribution (that is a security concern). One person, just prior to the distribution, also brought up the canonical repository, since the proposal text for 11367 did not mention an alternative repository, and the proposal discussion while listing the code out from the classic-terra repository, did not acknowledge the repository to be distributed from - as an oversight (although TR did mention their repository was up to date - so it should have had the appropriate code where someone looked to pull from the canonical repository). Those are only mentioned in a spirit that is meant to be constructive and with appreciation (and thankfulness for the sacrifice and work everyone makes in this community to help others). ek826 did take note, in his response to these two issues, regarding the overwriting of the tag, as well as the repository issue. Thankfully he has been sensitive to making sure ideas are discussed with governance and feedback is gathered (and vice versa), that where ideas and proposals move to code that there is a discussion period on code (with answers to honest questions), and that the code is displayed and acknowledged in the proposal discussion, and that it is voted on. For that, I am truly thankful, and personally have appreciation (as well as for him as a person). I also appreciate the things that he, and others, do - where those things are done in order to honor the governance process.
  • There have been about four people who have acknowledged the solution from LordInateur regarding libwasmvm.so for Terra v1 core software that was built on another machine and then needed to make access to that particular library. @bobbyd666 sharing the details above also was really helpful for others that had the same issue after him, and added on to, and built upon, LordInateur's comments.

  • There have been some issues with loss of peers, although that has been mentioned here and there before this particular release as well, and may, or may not, be related to this issue (one validator has mentioned it happening more since this release):

If there is anything in there that is helpful from a code analysis perspective, from my very limited purview, then I hope that is helpful.

Thank you too for the contributions of your expertise and to code you have contributed, and Notional's continued contributions as a validator and offering public endpoints via Notional's infrastructure, to the Terra v1 community.

I hope you have a great day today :)

@fragwuerdig fragwuerdig added the discuss Still being debated label Mar 13, 2023
@classic-terra classic-terra locked and limited conversation to collaborators Mar 17, 2023
@ZaradarBH ZaradarBH converted this issue into discussion #181 Mar 17, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working discuss Still being debated
Projects
None yet
Development

No branches or pull requests

6 participants