-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating large e2e network test failures due to deployment issues #3488
Comments
@staheri14 It could be due to knuu internal timeout handler, have you tried to increase the timeout here? Simply set this env var before the following line: if err := os.Setenv("KNUU_TIMEOUT", "360m"); err != nil {
return nil, err
} celestia-app/test/e2e/testnet/testnet.go Line 42 in 6d83e7d
Note: in the coming refactor, the ENV var thing will be removed and it will be hopefully easier for users. |
I was able to run the following tests on the branch smuu/celestiaorg-celestia-app:smuu/improvements-to-big-block-tests, which is based on the branch celestiaorg/celestia-app:sanaz/big-block-test:
See the PR: #3493 I made some fixes to the tests and some improvements to speed up the process. While doing that, I observed the following.
Currently, the |
Thanks a lot @smuu for your great work on this! I also ran the tests, and they worked for me as well!
I suppose this was a flakey behaviour? or not?
It is interesting, none of the indicated values exceed the reported limits i.e., 286<5000 and 1072647220<1073741824. Why it fails?
Well, it depends on the block size, all the submitted txs are of size
Yes, I witnessed that too. We need to investigate. In my tests, I made the block height range smaller just to get a portion of the blocks to check if the network is operating. However, in long term, we might want to speed up the reading process. |
If this is the timeout propose, then we should not expect to reach consensus over 8MB blocks with in that period.
woah why do we have over a gigabyte of txs in the mempool? how many sequences/txsim instances are we running total? \ |
Is there anything else the DevOps team can support to close this issue? |
Thanks @smuu and DevOps for your great help with this issue! I am trying to run a 100-node test (100 validators + 100 txsims) but haven't been successful. Is it actually possible to run a test at this scale, or should we stick to a 50-node test (50 validators + 50 txsims)? I also tried running the tests for 50 nodes but couldn't make a successful run. The issue is that the first txclient or validator is unable to start:
When I attempt to access the logs, the nodes seem to have been torn down, so I am not sure what the root cause is. I also tried running tests for smaller size network, and the same issue occurs, so maybe there is something off with the cluster? I'll keep you posted if I find something new. |
closing as completed |
Running a large network e2e test consisting of 100 Knuu instances is currently experiencing some issues, causing the tests to fail halfway through. The primary problems are that images cannot be reused, requiring the creation of a new Docker image for each instance, and that tests fail to deploy mid-process for various reasons. The former is to be resolved in the new release of knuu (here is another issue tracking and addressing the integration of the new release candidate), however, the latter still needs investigation.
Below are some of the errors observed when tests failed to complete:
Main Reasons for Failures with Sample Error Logs:
The following errors happen nondeterministically, and usually disappear from one to run to another.
Replicate the issue
The text was updated successfully, but these errors were encountered: