Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating large e2e network test failures due to deployment issues #3488

Closed
staheri14 opened this issue May 16, 2024 · 7 comments
Closed
Assignees
Labels
knuu item is directly related to the usage of knuu WS: Big Blonks 🔭 Improving consensus critical gossiping protocols

Comments

@staheri14
Copy link
Collaborator

staheri14 commented May 16, 2024

Running a large network e2e test consisting of 100 Knuu instances is currently experiencing some issues, causing the tests to fail halfway through. The primary problems are that images cannot be reused, requiring the creation of a new Docker image for each instance, and that tests fail to deploy mid-process for various reasons. The former is to be resolved in the new release of knuu (here is another issue tracking and addressing the integration of the new release candidate), however, the latter still needs investigation.

Below are some of the errors observed when tests failed to complete:

Main Reasons for Failures with Sample Error Logs:

The following errors happen nondeterministically, and usually disappear from one to run to another.

  1. Context deadline exceeded
  2. Failed to create service account or create content as the namespace is being terminated
failed to create testnet: cannot handle timeout: cannot start instance: error deploying pod for instance 'timeout-handler-b6be0473': failed to create service account: serviceaccounts "timeout-handler-b6be0473" is forbidden: unable to create new content in namespace test because it is being terminated
failed to get validators GRPC endpoints: error deploying service 'val16-d5aca67e': error deploying service 'val16-d5aca67e': error creating service val16-d5aca67e: services "val16-d5aca67e" is forbidden: unable to create new content in namespace test because it is being terminated
  1. Pushing images fails
error pushing image for instance 'txsim5': failed to push image: failed to run command: exit status 1

Replicate the issue

go run ./test/e2e/benchmark LargeNetwork_BigBlock_8MiB -v
@staheri14 staheri14 added WS: Big Blonks 🔭 Improving consensus critical gossiping protocols knuu item is directly related to the usage of knuu and removed needs:triage labels May 16, 2024
@smuu smuu self-assigned this May 17, 2024
@mojtaba-esk
Copy link
Contributor

@staheri14 It could be due to knuu internal timeout handler, have you tried to increase the timeout here?

Simply set this env var before the following line:

if err := os.Setenv("KNUU_TIMEOUT", "360m"); err != nil {
	return nil, err
}

if err := knuu.InitializeWithScope(identifier); err != nil {

Note: in the coming refactor, the ENV var thing will be removed and it will be hopefully easier for users.

@smuu
Copy link
Member

smuu commented May 22, 2024

I was able to run the following tests on the branch smuu/celestiaorg-celestia-app:smuu/improvements-to-big-block-tests, which is based on the branch celestiaorg/celestia-app:sanaz/big-block-test:

  • TwoNodeSimple
  • TwoNodeBigBlock_8MiB
  • TwoNodeBigBlock_32MiB
  • TwoNodeBigBlock_64MiB
  • LargeNetwork_BigBlock_8MiB
  • LargeNetwork_BigBlock_32MiB
  • LargeNetwork_BigBlock_64MiB

See the PR: #3493

I made some fixes to the tests and some improvements to speed up the process. While doing that, I observed the following.

  1. Observed the following for all tests > 8MiB
INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1
  1. Observed that for all LargeNetwork tests
ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C
  1. Each block does not have more than 8 txs. Sometimes I even just saw 2 txs per block.
INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938
  1. Reading the blockchain takes very long -> more than 10m. Is there a way to speed that up?

  2. Creating the port forwards for all the validators takes some time. We are working on an improvement for that.

  3. With that number of instances, we cause client-side throttling against the Kubernetes API when running the tests. This does not make the test fail, but it slows down the test. We are working on an improvement for that.

Currently, the LargeNetwork tests are done with 50 validators. We will do testing to increase that number further.
To support tests with 100 validators, I expect the need for improvements in knuu and the test level.

@staheri14
Copy link
Collaborator Author

Thanks a lot @smuu for your great work on this! I also ran the tests, and they worked for me as well!
Regarding the observed issues:

  1. Observed the following for all tests > 8MiB
    INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1

I suppose this was a flakey behaviour? or not?

  1. Observed that for all LargeNetwork tests
    ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

It is interesting, none of the indicated values exceed the reported limits i.e., 286<5000 and 1072647220<1073741824. Why it fails?

  1. Each block does not have more than 8 txs. Sometimes I even just saw 2 txs per block.
    INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938

Well, it depends on the block size, all the submitted txs are of size 1.2 MiB, so, a block with size 8MiB cannot accommodate more than 8 txs. However, for block sizes of 32 and 64, it should ideally go up to 26 and 53 txs, respectively. For which test did you see this issue?

  1. Reading the blockchain takes very long -> more than 10m. Is there a way to speed that up?

Yes, I witnessed that too. We need to investigate. In my tests, I made the block height range smaller just to get a portion of the blocks to check if the network is operating. However, in long term, we might want to speed up the reading process.

@evan-forbes
Copy link
Member

Observed the following for all tests > 8MiB

I suppose this was a flakey behaviour? or not?

If this is the timeout propose, then we should not expect to reach consensus over 8MB blocks with in that period.

ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

woah why do we have over a gigabyte of txs in the mempool? how many sequences/txsim instances are we running total?

\

@smuu
Copy link
Member

smuu commented Jun 3, 2024

Is there anything else the DevOps team can support to close this issue?

@evan-forbes evan-forbes added needs:discussion item needs to be discussed as a group in the next sync. if marking an item, pls be prepped to talk and removed needs:discussion item needs to be discussed as a group in the next sync. if marking an item, pls be prepped to talk labels Jun 24, 2024
@staheri14
Copy link
Collaborator Author

staheri14 commented Jun 24, 2024

Thanks @smuu and DevOps for your great help with this issue!

I am trying to run a 100-node test (100 validators + 100 txsims) but haven't been successful. Is it actually possible to run a test at this scale, or should we stick to a 50-node test (50 validators + 50 txsims)?

I also tried running the tests for 50 nodes but couldn't make a successful run. The issue is that the first txclient or validator is unable to start:

2024/06/24 15:27:36 failed to run the benchmark test: failed to start testnet: node val0 failed to start: timeout while waiting for instance 'val0-4f0b03bb' to be running
2024/06/24 15:37:53 failed to run the benchmark test: failed to start testnet: txsim txsim0 failed to start: timeout while waiting for instance 'txsim0-e6158a5a' to be running

When I attempt to access the logs, the nodes seem to have been torn down, so I am not sure what the root cause is.

I also tried running tests for smaller size network, and the same issue occurs, so maybe there is something off with the cluster?

I'll keep you posted if I find something new.

@evan-forbes
Copy link
Member

closing as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
knuu item is directly related to the usage of knuu WS: Big Blonks 🔭 Improving consensus critical gossiping protocols
Projects
Status: Done
Development

No branches or pull requests

4 participants