Investigating large e2e network test failures due to deployment issues #3488

staheri14 · 2024-05-16T15:14:09Z

Running a large network e2e test consisting of 100 Knuu instances is currently experiencing some issues, causing the tests to fail halfway through. The primary problems are that images cannot be reused, requiring the creation of a new Docker image for each instance, and that tests fail to deploy mid-process for various reasons. The former is to be resolved in the new release of knuu (here is another issue tracking and addressing the integration of the new release candidate), however, the latter still needs investigation.

Below are some of the errors observed when tests failed to complete:

Main Reasons for Failures with Sample Error Logs:

The following errors happen nondeterministically, and usually disappear from one to run to another.

Context deadline exceeded
Failed to create service account or create content as the namespace is being terminated

failed to create testnet: cannot handle timeout: cannot start instance: error deploying pod for instance 'timeout-handler-b6be0473': failed to create service account: serviceaccounts "timeout-handler-b6be0473" is forbidden: unable to create new content in namespace test because it is being terminated

failed to get validators GRPC endpoints: error deploying service 'val16-d5aca67e': error deploying service 'val16-d5aca67e': error creating service val16-d5aca67e: services "val16-d5aca67e" is forbidden: unable to create new content in namespace test because it is being terminated

Pushing images fails

error pushing image for instance 'txsim5': failed to push image: failed to run command: exit status 1

Replicate the issue

Fetch the branch in test(e2e benchmark): adds big block tests #3415
Run

go run ./test/e2e/benchmark LargeNetwork_BigBlock_8MiB -v

The text was updated successfully, but these errors were encountered:

mojtaba-esk · 2024-05-17T09:13:47Z

@staheri14 It could be due to knuu internal timeout handler, have you tried to increase the timeout here?

Simply set this env var before the following line:

if err := os.Setenv("KNUU_TIMEOUT", "360m"); err != nil {
	return nil, err
}

celestia-app/test/e2e/testnet/testnet.go

Line 42 in 6d83e7d

if err := knuu.InitializeWithScope(identifier); err != nil {

Note: in the coming refactor, the ENV var thing will be removed and it will be hopefully easier for users.

smuu · 2024-05-22T13:09:47Z

I was able to run the following tests on the branch smuu/celestiaorg-celestia-app:smuu/improvements-to-big-block-tests, which is based on the branch celestiaorg/celestia-app:sanaz/big-block-test:

TwoNodeSimple
TwoNodeBigBlock_8MiB
TwoNodeBigBlock_32MiB
TwoNodeBigBlock_64MiB
LargeNetwork_BigBlock_8MiB
LargeNetwork_BigBlock_32MiB
LargeNetwork_BigBlock_64MiB

See the PR: #3493

I made some fixes to the tests and some improvements to speed up the process. While doing that, I observed the following.

Observed the following for all tests > 8MiB

INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1

Observed that for all LargeNetwork tests

ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

Each block does not have more than 8 txs. Sometimes I even just saw 2 txs per block.

INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938

Reading the blockchain takes very long -> more than 10m. Is there a way to speed that up?
Creating the port forwards for all the validators takes some time. We are working on an improvement for that.
With that number of instances, we cause client-side throttling against the Kubernetes API when running the tests. This does not make the test fail, but it slows down the test. We are working on an improvement for that.

Currently, the LargeNetwork tests are done with 50 validators. We will do testing to increase that number further.
To support tests with 100 validators, I expect the need for improvements in knuu and the test level.

staheri14 · 2024-05-22T18:53:07Z

Thanks a lot @smuu for your great work on this! I also ran the tests, and they worked for me as well!
Regarding the observed issues:

Observed the following for all tests > 8MiB
INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1

I suppose this was a flakey behaviour? or not?

Observed that for all LargeNetwork tests
ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

It is interesting, none of the indicated values exceed the reported limits i.e., 286<5000 and 1072647220<1073741824. Why it fails?

Each block does not have more than 8 txs. Sometimes I even just saw 2 txs per block.
INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938

Well, it depends on the block size, all the submitted txs are of size 1.2 MiB, so, a block with size 8MiB cannot accommodate more than 8 txs. However, for block sizes of 32 and 64, it should ideally go up to 26 and 53 txs, respectively. For which test did you see this issue?

Reading the blockchain takes very long -> more than 10m. Is there a way to speed that up?

Yes, I witnessed that too. We need to investigate. In my tests, I made the block height range smaller just to get a portion of the blocks to check if the network is operating. However, in long term, we might want to speed up the reading process.

evan-forbes · 2024-05-25T02:53:45Z

Observed the following for all tests > 8MiB

I suppose this was a flakey behaviour? or not?

If this is the timeout propose, then we should not expect to reach consensus over 8MB blocks with in that period.

ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

woah why do we have over a gigabyte of txs in the mempool? how many sequences/txsim instances are we running total?

\

smuu · 2024-06-03T07:24:15Z

Is there anything else the DevOps team can support to close this issue?

staheri14 · 2024-06-24T22:38:43Z

Thanks @smuu and DevOps for your great help with this issue!

I am trying to run a 100-node test (100 validators + 100 txsims) but haven't been successful. Is it actually possible to run a test at this scale, or should we stick to a 50-node test (50 validators + 50 txsims)?

I also tried running the tests for 50 nodes but couldn't make a successful run. The issue is that the first txclient or validator is unable to start:

2024/06/24 15:27:36 failed to run the benchmark test: failed to start testnet: node val0 failed to start: timeout while waiting for instance 'val0-4f0b03bb' to be running

2024/06/24 15:37:53 failed to run the benchmark test: failed to start testnet: txsim txsim0 failed to start: timeout while waiting for instance 'txsim0-e6158a5a' to be running

When I attempt to access the logs, the nodes seem to have been torn down, so I am not sure what the root cause is.

I also tried running tests for smaller size network, and the same issue occurs, so maybe there is something off with the cluster?

I'll keep you posted if I find something new.

evan-forbes · 2024-11-18T13:48:54Z

closing as completed

github-actions bot added the needs:triage label May 16, 2024

staheri14 added WS: Big Blonks 🔭 Improving consensus critical gossiping protocols knuu item is directly related to the usage of knuu and removed needs:triage labels May 16, 2024

smuu self-assigned this May 17, 2024

smuu added this to Celestia DevOps/Testing May 17, 2024

evan-forbes mentioned this issue May 21, 2024

Sanity check testground #3147

Closed

evan-forbes mentioned this issue May 27, 2024

Replicate the dynamic timeout issues in a knuu test celestiaorg/celestia-core#1333

Closed

evan-forbes added needs:discussion item needs to be discussed as a group in the next sync. if marking an item, pls be prepped to talk and removed needs:discussion item needs to be discussed as a group in the next sync. if marking an item, pls be prepped to talk labels Jun 24, 2024

evan-forbes closed this as completed Nov 18, 2024

github-project-automation bot moved this to Done in Celestia DevOps/Testing Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating large e2e network test failures due to deployment issues #3488

Investigating large e2e network test failures due to deployment issues #3488

staheri14 commented May 16, 2024 •

edited

Loading

mojtaba-esk commented May 17, 2024

smuu commented May 22, 2024

staheri14 commented May 22, 2024

evan-forbes commented May 25, 2024

smuu commented Jun 3, 2024

staheri14 commented Jun 24, 2024 •

edited

Loading

evan-forbes commented Nov 18, 2024

Investigating large e2e network test failures due to deployment issues #3488

Investigating large e2e network test failures due to deployment issues #3488

Comments

staheri14 commented May 16, 2024 • edited Loading

Main Reasons for Failures with Sample Error Logs:

Replicate the issue

mojtaba-esk commented May 17, 2024

smuu commented May 22, 2024

staheri14 commented May 22, 2024

evan-forbes commented May 25, 2024

smuu commented Jun 3, 2024

staheri14 commented Jun 24, 2024 • edited Loading

evan-forbes commented Nov 18, 2024

staheri14 commented May 16, 2024 •

edited

Loading

staheri14 commented Jun 24, 2024 •

edited

Loading