-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM E2E test consistently failed #15647
Comments
@kevinzs2048 @dims @geetasg for awareness. |
Is there anyone looking at this problem ? These tests have not been passing for quite sometime. The latest errors happen when trying to start etcd and not finding the db file [1]. To debug further we need access to equinix machines (link). I can ask for access if no one is looking at it. [1]
|
I see lots of " @dims who has admin permission on the arm machines? The current workaround is to SSH into the machines and manually cleanup all the running etcd processes. But in the meanwhile, we also need to think about a long-term solution.
|
Hi @ahrtr @dims I have the login and root permission of this node: etcd-c3-large-arm64-runner-01, already clean all the etcd processes as below.
|
Thank you @kevinzs2048 for the quick response. |
Thanks @kevinzs2048!! Let's see if the issue will re-occur and think about long term solutions afterwards. |
Maybe we can consider to run it in container with host networking and add |
This seems like github runner issue, not issue with our tests. Don't want to rewrite our tests just because github runner doesn't isolate it's environment. Please consider having github runner isolated. |
Latest arm64 e2e test failed https://github.com/etcd-io/etcd/actions/runs/4662910883/jobs/8253760517
Right before the time out, it was testing We can revisit it after #15637 is merged. |
https://github.com/etcd-io/etcd/actions/runs/4695428620/jobs/8324592814 The issue re-occurs. As discussed in #14748, if we want to announce arm64 as Tier one support, we need to resolve this process leak issues permanently. I remember last arm64 test attempt failed due to lack of maintenance #13181 @kevinzs2048 @dims Is there a way to share the permissions safely to etcd members so anyone can take a look at it? |
@chaochn47 yes, please share your SSH public key on a slack DM with me and i'll add it to the boxes |
@chaochn47 has confirmed that he has access! |
E2E test in ARM64 consistently succeeds now. However, I do see the risks of self-hosted github runner.
It's likely a github runner issue. |
Can we raise & track an issue for github runner? thx.
It seems that it isn't the correct way to specify the branch. I also do not see any workflows being configured to run on 3.5 and 3.4. If we want to verify release-3.5 and release-3.4, probably the best way is to backport the workflows to 3.5 and 3.4 as well. But let's get the above github runner issue sorted out firstly. etcd/.github/workflows/e2e-arm64.yaml Line 23 in 4785f5a
|
I want to reproduce it again manually with fresh new self hosted github runner set up. https://github.com/chaochn47/etcd/actions/runs/4866105332/jobs/8679105017 before raising an issue in https://github.com/actions/runner/issues just in case it's an environment issue. |
Unfortunately :(, I am not able to reproduce the process leak issue after intentionally failed the test suite with the changes. All the etcd process were clean up. I am inclined to continue looking into other failures mentioned in #15647 (comment). |
It looks that the /home/runner/actions-runner/_work/etcd/etcd will not be cleaned up after test failed, that will also can introduce some git command error like this, this PR is closed, but it can be a reference anyway. |
I think the
|
Happy to add some ideas specific to the Equinix configuration here, from my perspective from the CIL and from Equinix. If I understand things right, the E2E tests are desired to be run in a machine with a clean state. They are also running on a system that is up 24x7, but the testing itself is only one run daily. A possible solution to this is to get a freshly built system for every test. The Equinix API lets you provision a system as needed, using something like Terraform or cloud-init to do the initial environment setup so that it's ready to run tests. You'd then execute the E2E tests, and when they were done and you had the results you'd tear down the machine(s). We have a similar setup/teardown pattern for a few other projects and it has been working well. One big benefit is that you get a clean state every time. From a resources/sustainability point of view, it means that a machine used only once a day is less expensive than something up 24x7. And if you needed more than 1 or 2 systems you could burst to bigger runs where you did the teardown afterwards. Getting the Terraform or cloud-init exactly right to set up a runner test environment is certainly a separate issue from this one & should be tracked separately. |
Hey @chaochn47, @ahrtr, @serathius - I would like to propose we close this issue. The work we have done to prevent process leaks in the workflows by running them inside containers looks to be paying off, all our We found one issue during the soak period of some unexpected cache behavior but that has now been addressed. Please let me know if you have any objections, otherwise I will close it after a couple of days for lazy consensus. |
SGTM. Appreciate the great work~ @jmhbnz |
Great work, good to hear that the improved workflows make everything run smoothly. thanks! Ed |
Thanks @jmhbnz for driving and coordinating this, also thanks @chaochn47 , @kevinzs2048 , @fuweid and @vielmetti ! |
What happened?
All the E2E test arm64 workflows failed within 2 mins based on https://github.com/etcd-io/etcd/actions/workflows/e2e-arm64.yaml
What did you expect to happen?
As part of the efforts of T1 support for Arm64, the periodic test job should continuously pass.
Anything else we need to know?
It has been failing for 2 months.
The text was updated successfully, but these errors were encountered: