Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM E2E test consistently failed #15647

Closed
chaochn47 opened this issue Apr 6, 2023 · 24 comments
Closed

ARM E2E test consistently failed #15647

chaochn47 opened this issue Apr 6, 2023 · 24 comments

Comments

@chaochn47
Copy link
Member

chaochn47 commented Apr 6, 2023

What happened?

All the E2E test arm64 workflows failed within 2 mins based on https://github.com/etcd-io/etcd/actions/workflows/e2e-arm64.yaml

What did you expect to happen?

As part of the efforts of T1 support for Arm64, the periodic test job should continuously pass.

Anything else we need to know?

It has been failing for 2 months.

@chaochn47
Copy link
Member Author

@kevinzs2048 @dims @geetasg for awareness.

@pchan
Copy link
Contributor

pchan commented Apr 10, 2023

Is there anyone looking at this problem ? These tests have not been passing for quite sometime. The latest errors happen when trying to start etcd and not finding the db file [1].

To debug further we need access to equinix machines (link). I can ask for access if no one is looking at it.

[1]

2023-04-10T01:34:14.6932214Z /home/runner/actions-runner/_work/etcd/etcd/bin/etcd-last-release (TestReleaseUpgrade-test-1) (1799029): {"level":"panic","ts":"2023-04-10T01:34:12.320Z","caller":"backend/backend.go:182","msg":"failed to open database","path":"/tmp/TestReleaseUpgrade2842786494/003/member/snap/db","error":"open /tmp/TestReleaseUpgrade2842786494/003/member/snap/db: no such file or directory","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend /tmp/etcd-release-3.5.0/etcd/release/etcd/server/mvcc/backend/backend.go:182\ngo.etcd.io/etcd/server/v3/mvcc/backend.New /tmp/etcd-release-3.5.0/etcd/release/etcd/server/mvcc/backend/backend.go:156\ngo.etcd.io/etcd/server/v3/verify.Verify /tmp/etcd-release-3.5.0/etcd/release/etcd/server/verify/verify.go:76\ngo.etcd.io/etcd/server/v3/verify.VerifyIfEnabled /tmp/etcd-release-3.5.0/etcd/release/etcd/server/verify/verify.go:94\ngo.etcd.io/etcd/server/v3/verify.MustVerifyIfEnabled /tmp/etcd-release-3.5.0/etcd/release/etcd/server/verify/verify.go:103\ngo.etcd.io/etcd/server/v3/embed.(*Etcd).Close.func1 /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:370\ngo.etcd.io/etcd/server/v3/embed.(*Etcd).Close /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:430\ngo.etcd.io/etcd/server/v3/embed.StartEtcd.func1 /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:120\ngo.etcd.io/etcd/server/v3/embed.StartEtcd /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:144\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2 /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:134\ngo.etcd.io/etcd/server/v3/etcdmain.Main /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main
/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}

@ahrtr
Copy link
Member

ahrtr commented Apr 10, 2023

I see lots of "bind: address already in use" errors; It could be that some etcd processes did not exit after the test finished. @tjungblu @serathius

@dims who has admin permission on the arm machines? The current workaround is to SSH into the machines and manually cleanup all the running etcd processes.

But in the meanwhile, we also need to think about a long-term solution.

        {"level":"fatal","ts":"2023-04-10T01:34:12.62252Z","caller":"etcdmain/etcd.go:182","msg":"discovery failed","error":"listen tcp 127.0.0.1:20001: bind: address already in use","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:182\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}
        )

@kevinzs2048
Copy link
Contributor

Hi @ahrtr @dims I have the login and root permission of this node: etcd-c3-large-arm64-runner-01, already clean all the etcd processes as below.

root@etcd-c3-large-arm64-runner-01:~# ps -ef | grep etcd
root     1810525 1810510  0 06:26 pts/4    00:00:00 grep --color=auto etcd
runner   3146958       1  1 Feb24 ?        21:20:24 /home/runner/actions-runner/_work/etcd/etcd/bin/etcd --name TestUserChangePasswordPeerTLS-test-2 --listen-client-urls http://localhost:20010 --advertise-client-urls http://localhost:20010 --listen-peer-urls https://localhost:20011 --initial-advertise-peer-urls https://localhost:20011 --initial-cluster-token new --data-dir /tmp/TestUserChangePasswordPeerTLS1635779488/003 --snapshot-count 100000 --strict-reconfig-check=false --peer-cert-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.crt --peer-key-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.key.insecure --peer-trusted-ca-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/ca.crt --initial-cluster TestUserChangePasswordPeerTLS-test-0=https://localhost:20001,TestUserChangePasswordPeerTLS-test-1=https://localhost:20006,TestUserChangePasswordPeerTLS-test-2=https://localhost:20011 --initial-cluster-state new
runner   3146959       1  1 Feb24 ?        21:30:15 /home/runner/actions-runner/_work/etcd/etcd/bin/etcd --name TestUserChangePasswordPeerTLS-test-0 --listen-client-urls http://localhost:20000 --advertise-client-urls http://localhost:20000 --listen-peer-urls https://localhost:20001 --initial-advertise-peer-urls https://localhost:20001 --initial-cluster-token new --data-dir /tmp/TestUserChangePasswordPeerTLS1635779488/001 --snapshot-count 100000 --strict-reconfig-check=false --peer-cert-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.crt --peer-key-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.key.insecure --peer-trusted-ca-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/ca.crt --initial-cluster TestUserChangePasswordPeerTLS-test-0=https://localhost:20001,TestUserChangePasswordPeerTLS-test-1=https://localhost:20006,TestUserChangePasswordPeerTLS-test-2=https://localhost:20011 --initial-cluster-state new
runner   3146960       1  2 Feb24 ?        21:44:01 /home/runner/actions-runner/_work/etcd/etcd/bin/etcd --name TestUserChangePasswordPeerTLS-test-1 --listen-client-urls http://localhost:20005 --advertise-client-urls http://localhost:20005 --listen-peer-urls https://localhost:20006 --initial-advertise-peer-urls https://localhost:20006 --initial-cluster-token new --data-dir /tmp/TestUserChangePasswordPeerTLS1635779488/002 --snapshot-count 100000 --strict-reconfig-check=false --peer-cert-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.crt --peer-key-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.key.insecure --peer-trusted-ca-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/ca.crt --initial-cluster TestUserChangePasswordPeerTLS-test-0=https://localhost:20001,TestUserChangePasswordPeerTLS-test-1=https://localhost:20006,TestUserChangePasswordPeerTLS-test-2=https://localhost:20011 --initial-cluster-state new

@ahrtr
Copy link
Member

ahrtr commented Apr 10, 2023

Thank you @kevinzs2048 for the quick response.

@chaochn47
Copy link
Member Author

Thanks @kevinzs2048!! Let's see if the issue will re-occur and think about long term solutions afterwards.

@fuweid
Copy link
Member

fuweid commented Apr 11, 2023

Maybe we can consider to run it in container with host networking and add if: always() step to cleanup the container.
All the leaky processes are cleanup by deleting container.

@serathius
Copy link
Member

serathius commented Apr 11, 2023

This seems like github runner issue, not issue with our tests. Don't want to rewrite our tests just because github runner doesn't isolate it's environment. Please consider having github runner isolated.

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 11, 2023

Latest arm64 e2e test failed https://github.com/etcd-io/etcd/actions/runs/4662910883/jobs/8253760517

panic: test timed out after 30m0s

Right before the time out, it was testing TestUserDelete/MinorityLastVersion in tests/common

We can revisit it after #15637 is merged.

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 14, 2023

{"level":"fatal","ts":"2023-04-14T01:32:45.585974Z","caller":"etcdmain/etcd.go:182","msg":"discovery failed","error":"listen tcp 127.0.0.1:20001: bind: address already in use","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:182\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}

https://github.com/etcd-io/etcd/actions/runs/4695428620/jobs/8324592814

The issue re-occurs. As discussed in #14748, if we want to announce arm64 as Tier one support, we need to resolve this process leak issues permanently. I remember last arm64 test attempt failed due to lack of maintenance #13181

@kevinzs2048 @dims Is there a way to share the permissions safely to etcd members so anyone can take a look at it?

@dims
Copy link
Contributor

dims commented Apr 14, 2023

@chaochn47 yes, please share your SSH public key on a slack DM with me and i'll add it to the boxes

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 14, 2023

Thanks @dims I just sent you the SSH public keys.

Unfortunately, I don't have enough bandwidth to fix it right now maybe later next week. I am currently focusing on #15708.

If anyone else is interested in help maintaining the arm64 tests, please jump in ~

@dims
Copy link
Contributor

dims commented Apr 19, 2023

@chaochn47 has confirmed that he has access!

@chaochn47
Copy link
Member Author

chaochn47 commented May 1, 2023

E2E test in ARM64 consistently succeeds now. However, I do see the risks of self-hosted github runner.

  1. etcd processes were leaked after previous failed test attempt.
  2. The next test start is not a clean start. test result is cached.
% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	(cached)
  1. failed to run actions/setup-go
Warning: Unexpected input(s) 'ref', valid inputs are ['go-version', 'go-version-file', 'check-latest', 'token', 'cache', 'cache-dependency-path', 'architecture']

/usr/bin/tar --use-compress-program zstd -d -xf /home/runner/actions-runner/_work/_temp/a5078ef8-33d1-48be-90cc-589cb85ce484/cache.tzst -P -C /home/runner/actions-runner/_work/etcd/etcd
...
/usr/bin/tar: ../../../../go/pkg/mod/golang.org/x/[email protected]/windows/mkknownfolderids.bash: Cannot open: File exists
/usr/bin/tar: Exiting with failure status due to previous errors
Warning: Failed to restore: Tar failed with error: The process '/usr/bin/tar' failed with exit code 2

It's likely a github runner issue.

@ahrtr
Copy link
Member

ahrtr commented May 1, 2023

etcd processes were leaked after previous failed test attempt.

Can we raise & track an issue for github runner? thx.

Warning: Unexpected input(s) 'ref', valid inputs are ['go-version', 'go-version-file', 'check-latest', 'token', 'cache', 'cache-dependency-path', 'architecture']

It seems that it isn't the correct way to specify the branch. I also do not see any workflows being configured to run on 3.5 and 3.4. If we want to verify release-3.5 and release-3.4, probably the best way is to backport the workflows to 3.5 and 3.4 as well. But let's get the above github runner issue sorted out firstly.

@chaochn47
Copy link
Member Author

chaochn47 commented May 3, 2023

Can we raise & track an issue for github runner? thx.

I want to reproduce it again manually with fresh new self hosted github runner set up. https://github.com/chaochn47/etcd/actions/runs/4866105332/jobs/8679105017 before raising an issue in https://github.com/actions/runner/issues just in case it's an environment issue.

@chaochn47
Copy link
Member Author

chaochn47 commented May 4, 2023

Unfortunately :(, I am not able to reproduce the process leak issue after intentionally failed the test suite with the changes.

All the etcd process were clean up.

I am inclined to continue looking into other failures mentioned in #15647 (comment).

@kevinzs2048
Copy link
Contributor

It looks that the /home/runner/actions-runner/_work/etcd/etcd will not be cleaned up after test failed, that will also can introduce some git command error like this, this PR is closed, but it can be a reference anyway.

@serathius
Copy link
Member

It looks that the /home/runner/actions-runner/_work/etcd/etcd will not be cleaned up after test failed, that will also can introduce some git command error like this, this PR is closed, but it can be a reference anyway.

I think the checkout action caches git repo to avoid downloading files. There are two ways we could work around it:

  • Disable the cache behavior
  • Ensure clean state before/after the workflow is run.

@vielmetti
Copy link

Happy to add some ideas specific to the Equinix configuration here, from my perspective from the CIL and from Equinix.

If I understand things right, the E2E tests are desired to be run in a machine with a clean state. They are also running on a system that is up 24x7, but the testing itself is only one run daily.

A possible solution to this is to get a freshly built system for every test. The Equinix API lets you provision a system as needed, using something like Terraform or cloud-init to do the initial environment setup so that it's ready to run tests. You'd then execute the E2E tests, and when they were done and you had the results you'd tear down the machine(s).

We have a similar setup/teardown pattern for a few other projects and it has been working well. One big benefit is that you get a clean state every time. From a resources/sustainability point of view, it means that a machine used only once a day is less expensive than something up 24x7. And if you needed more than 1 or 2 systems you could burst to bigger runs where you did the teardown afterwards.

Getting the Terraform or cloud-init exactly right to set up a runner test environment is certainly a separate issue from this one & should be tracked separately.

@jmhbnz
Copy link
Member

jmhbnz commented Jun 6, 2023

Hey @chaochn47, @ahrtr, @serathius - I would like to propose we close this issue.

The work we have done to prevent process leaks in the workflows by running them inside containers looks to be paying off, all our arm64 nightly workflows appear to be running smoothly and green across the board.

We found one issue during the soak period of some unexpected cache behavior but that has now been addressed.

Please let me know if you have any objections, otherwise I will close it after a couple of days for lazy consensus.

@chaochn47
Copy link
Member Author

SGTM. Appreciate the great work~ @jmhbnz

@vielmetti
Copy link

Great work, good to hear that the improved workflows make everything run smoothly. thanks! Ed

@ahrtr
Copy link
Member

ahrtr commented Jun 7, 2023

Thanks @jmhbnz for driving and coordinating this, also thanks @chaochn47 , @kevinzs2048 , @fuweid and @vielmetti !

@jmhbnz jmhbnz closed this as completed Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants