ARM E2E test consistently failed #15647

chaochn47 · 2023-04-06T03:34:23Z

What happened?

All the E2E test arm64 workflows failed within 2 mins based on https://github.com/etcd-io/etcd/actions/workflows/e2e-arm64.yaml

What did you expect to happen?

As part of the efforts of T1 support for Arm64, the periodic test job should continuously pass.

Anything else we need to know?

It has been failing for 2 months.

chaochn47 · 2023-04-06T03:35:54Z

@kevinzs2048 @dims @geetasg for awareness.

pchan · 2023-04-10T05:45:07Z

Is there anyone looking at this problem ? These tests have not been passing for quite sometime. The latest errors happen when trying to start etcd and not finding the db file [1].

To debug further we need access to equinix machines (link). I can ask for access if no one is looking at it.

[1]

2023-04-10T01:34:14.6932214Z /home/runner/actions-runner/_work/etcd/etcd/bin/etcd-last-release (TestReleaseUpgrade-test-1) (1799029): {"level":"panic","ts":"2023-04-10T01:34:12.320Z","caller":"backend/backend.go:182","msg":"failed to open database","path":"/tmp/TestReleaseUpgrade2842786494/003/member/snap/db","error":"open /tmp/TestReleaseUpgrade2842786494/003/member/snap/db: no such file or directory","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend /tmp/etcd-release-3.5.0/etcd/release/etcd/server/mvcc/backend/backend.go:182\ngo.etcd.io/etcd/server/v3/mvcc/backend.New /tmp/etcd-release-3.5.0/etcd/release/etcd/server/mvcc/backend/backend.go:156\ngo.etcd.io/etcd/server/v3/verify.Verify /tmp/etcd-release-3.5.0/etcd/release/etcd/server/verify/verify.go:76\ngo.etcd.io/etcd/server/v3/verify.VerifyIfEnabled /tmp/etcd-release-3.5.0/etcd/release/etcd/server/verify/verify.go:94\ngo.etcd.io/etcd/server/v3/verify.MustVerifyIfEnabled /tmp/etcd-release-3.5.0/etcd/release/etcd/server/verify/verify.go:103\ngo.etcd.io/etcd/server/v3/embed.(*Etcd).Close.func1 /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:370\ngo.etcd.io/etcd/server/v3/embed.(*Etcd).Close /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:430\ngo.etcd.io/etcd/server/v3/embed.StartEtcd.func1 /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:120\ngo.etcd.io/etcd/server/v3/embed.StartEtcd /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:144\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2 /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:134\ngo.etcd.io/etcd/server/v3/etcdmain.Main /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main
/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\nruntime.main
/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}

ahrtr · 2023-04-10T06:17:52Z

I see lots of "bind: address already in use" errors; It could be that some etcd processes did not exit after the test finished. @tjungblu @serathius

@dims who has admin permission on the arm machines? The current workaround is to SSH into the machines and manually cleanup all the running etcd processes.

But in the meanwhile, we also need to think about a long-term solution.

        {"level":"fatal","ts":"2023-04-10T01:34:12.62252Z","caller":"etcdmain/etcd.go:182","msg":"discovery failed","error":"listen tcp 127.0.0.1:20001: bind: address already in use","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:182\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}
        )

kevinzs2048 · 2023-04-10T06:28:34Z

Hi @ahrtr @dims I have the login and root permission of this node: etcd-c3-large-arm64-runner-01, already clean all the etcd processes as below.

root@etcd-c3-large-arm64-runner-01:~# ps -ef | grep etcd
root     1810525 1810510  0 06:26 pts/4    00:00:00 grep --color=auto etcd
runner   3146958       1  1 Feb24 ?        21:20:24 /home/runner/actions-runner/_work/etcd/etcd/bin/etcd --name TestUserChangePasswordPeerTLS-test-2 --listen-client-urls http://localhost:20010 --advertise-client-urls http://localhost:20010 --listen-peer-urls https://localhost:20011 --initial-advertise-peer-urls https://localhost:20011 --initial-cluster-token new --data-dir /tmp/TestUserChangePasswordPeerTLS1635779488/003 --snapshot-count 100000 --strict-reconfig-check=false --peer-cert-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.crt --peer-key-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.key.insecure --peer-trusted-ca-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/ca.crt --initial-cluster TestUserChangePasswordPeerTLS-test-0=https://localhost:20001,TestUserChangePasswordPeerTLS-test-1=https://localhost:20006,TestUserChangePasswordPeerTLS-test-2=https://localhost:20011 --initial-cluster-state new
runner   3146959       1  1 Feb24 ?        21:30:15 /home/runner/actions-runner/_work/etcd/etcd/bin/etcd --name TestUserChangePasswordPeerTLS-test-0 --listen-client-urls http://localhost:20000 --advertise-client-urls http://localhost:20000 --listen-peer-urls https://localhost:20001 --initial-advertise-peer-urls https://localhost:20001 --initial-cluster-token new --data-dir /tmp/TestUserChangePasswordPeerTLS1635779488/001 --snapshot-count 100000 --strict-reconfig-check=false --peer-cert-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.crt --peer-key-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.key.insecure --peer-trusted-ca-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/ca.crt --initial-cluster TestUserChangePasswordPeerTLS-test-0=https://localhost:20001,TestUserChangePasswordPeerTLS-test-1=https://localhost:20006,TestUserChangePasswordPeerTLS-test-2=https://localhost:20011 --initial-cluster-state new
runner   3146960       1  2 Feb24 ?        21:44:01 /home/runner/actions-runner/_work/etcd/etcd/bin/etcd --name TestUserChangePasswordPeerTLS-test-1 --listen-client-urls http://localhost:20005 --advertise-client-urls http://localhost:20005 --listen-peer-urls https://localhost:20006 --initial-advertise-peer-urls https://localhost:20006 --initial-cluster-token new --data-dir /tmp/TestUserChangePasswordPeerTLS1635779488/002 --snapshot-count 100000 --strict-reconfig-check=false --peer-cert-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.crt --peer-key-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/server.key.insecure --peer-trusted-ca-file /home/runner/actions-runner/_work/etcd/etcd/tests/fixtures/ca.crt --initial-cluster TestUserChangePasswordPeerTLS-test-0=https://localhost:20001,TestUserChangePasswordPeerTLS-test-1=https://localhost:20006,TestUserChangePasswordPeerTLS-test-2=https://localhost:20011 --initial-cluster-state new

ahrtr · 2023-04-10T06:49:06Z

Thank you @kevinzs2048 for the quick response.

chaochn47 · 2023-04-10T16:09:57Z

Thanks @kevinzs2048!! Let's see if the issue will re-occur and think about long term solutions afterwards.

fuweid · 2023-04-11T08:45:07Z

Maybe we can consider to run it in container with host networking and add if: always() step to cleanup the container.
All the leaky processes are cleanup by deleting container.

serathius · 2023-04-11T10:23:55Z

This seems like github runner issue, not issue with our tests. Don't want to rewrite our tests just because github runner doesn't isolate it's environment. Please consider having github runner isolated.

chaochn47 · 2023-04-11T20:16:09Z

Latest arm64 e2e test failed https://github.com/etcd-io/etcd/actions/runs/4662910883/jobs/8253760517

panic: test timed out after 30m0s

Right before the time out, it was testing TestUserDelete/MinorityLastVersion in tests/common

We can revisit it after #15637 is merged.

chaochn47 · 2023-04-14T06:09:47Z

{"level":"fatal","ts":"2023-04-14T01:32:45.585974Z","caller":"etcdmain/etcd.go:182","msg":"discovery failed","error":"listen tcp 127.0.0.1:20001: bind: address already in use","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:182\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}

https://github.com/etcd-io/etcd/actions/runs/4695428620/jobs/8324592814

The issue re-occurs. As discussed in #14748, if we want to announce arm64 as Tier one support, we need to resolve this process leak issues permanently. I remember last arm64 test attempt failed due to lack of maintenance #13181

@kevinzs2048 @dims Is there a way to share the permissions safely to etcd members so anyone can take a look at it?

dims · 2023-04-14T10:48:19Z

@chaochn47 yes, please share your SSH public key on a slack DM with me and i'll add it to the boxes

chaochn47 · 2023-04-14T17:09:05Z

Thanks @dims I just sent you the SSH public keys.

Unfortunately, I don't have enough bandwidth to fix it right now maybe later next week. I am currently focusing on #15708.

If anyone else is interested in help maintaining the arm64 tests, please jump in ~

dims · 2023-04-19T21:33:19Z

@chaochn47 has confirmed that he has access!

chaochn47 · 2023-05-01T18:39:49Z

E2E test in ARM64 consistently succeeds now. However, I do see the risks of self-hosted github runner.

etcd processes were leaked after previous failed test attempt.
The next test start is not a clean start. test result is cached.

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	(cached)

failed to run actions/setup-go

Warning: Unexpected input(s) 'ref', valid inputs are ['go-version', 'go-version-file', 'check-latest', 'token', 'cache', 'cache-dependency-path', 'architecture']

/usr/bin/tar --use-compress-program zstd -d -xf /home/runner/actions-runner/_work/_temp/a5078ef8-33d1-48be-90cc-589cb85ce484/cache.tzst -P -C /home/runner/actions-runner/_work/etcd/etcd
...
/usr/bin/tar: ../../../../go/pkg/mod/golang.org/x/[email protected]/windows/mkknownfolderids.bash: Cannot open: File exists
/usr/bin/tar: Exiting with failure status due to previous errors
Warning: Failed to restore: Tar failed with error: The process '/usr/bin/tar' failed with exit code 2

It's likely a github runner issue.

ahrtr · 2023-05-01T22:51:35Z

etcd processes were leaked after previous failed test attempt.

Can we raise & track an issue for github runner? thx.

Warning: Unexpected input(s) 'ref', valid inputs are ['go-version', 'go-version-file', 'check-latest', 'token', 'cache', 'cache-dependency-path', 'architecture']

It seems that it isn't the correct way to specify the branch. I also do not see any workflows being configured to run on 3.5 and 3.4. If we want to verify release-3.5 and release-3.4, probably the best way is to backport the workflows to 3.5 and 3.4 as well. But let's get the above github runner issue sorted out firstly.

etcd/.github/workflows/e2e-arm64.yaml

Line 23 in 4785f5a

ref: main

chaochn47 · 2023-05-03T00:40:41Z

Can we raise & track an issue for github runner? thx.

I want to reproduce it again manually with fresh new self hosted github runner set up. https://github.com/chaochn47/etcd/actions/runs/4866105332/jobs/8679105017 before raising an issue in https://github.com/actions/runner/issues just in case it's an environment issue.

chaochn47 · 2023-05-04T22:49:56Z

Unfortunately :(, I am not able to reproduce the process leak issue after intentionally failed the test suite with the changes.

All the etcd process were clean up.

I am inclined to continue looking into other failures mentioned in #15647 (comment).

kevinzs2048 · 2023-05-05T09:13:04Z

It looks that the /home/runner/actions-runner/_work/etcd/etcd will not be cleaned up after test failed, that will also can introduce some git command error like this, this PR is closed, but it can be a reference anyway.

serathius · 2023-05-05T10:35:10Z

It looks that the /home/runner/actions-runner/_work/etcd/etcd will not be cleaned up after test failed, that will also can introduce some git command error like this, this PR is closed, but it can be a reference anyway.

I think the checkout action caches git repo to avoid downloading files. There are two ways we could work around it:

Disable the cache behavior
Ensure clean state before/after the workflow is run.

vielmetti · 2023-05-05T19:27:18Z

Happy to add some ideas specific to the Equinix configuration here, from my perspective from the CIL and from Equinix.

If I understand things right, the E2E tests are desired to be run in a machine with a clean state. They are also running on a system that is up 24x7, but the testing itself is only one run daily.

A possible solution to this is to get a freshly built system for every test. The Equinix API lets you provision a system as needed, using something like Terraform or cloud-init to do the initial environment setup so that it's ready to run tests. You'd then execute the E2E tests, and when they were done and you had the results you'd tear down the machine(s).

We have a similar setup/teardown pattern for a few other projects and it has been working well. One big benefit is that you get a clean state every time. From a resources/sustainability point of view, it means that a machine used only once a day is less expensive than something up 24x7. And if you needed more than 1 or 2 systems you could burst to bigger runs where you did the teardown afterwards.

Getting the Terraform or cloud-init exactly right to set up a runner test environment is certainly a separate issue from this one & should be tracked separately.

jmhbnz · 2023-06-06T19:07:01Z

Hey @chaochn47, @ahrtr, @serathius - I would like to propose we close this issue.

The work we have done to prevent process leaks in the workflows by running them inside containers looks to be paying off, all our arm64 nightly workflows appear to be running smoothly and green across the board.

We found one issue during the soak period of some unexpected cache behavior but that has now been addressed.

Please let me know if you have any objections, otherwise I will close it after a couple of days for lazy consensus.

chaochn47 · 2023-06-06T19:58:16Z

SGTM. Appreciate the great work~ @jmhbnz

vielmetti · 2023-06-06T21:41:24Z

Great work, good to hear that the improved workflows make everything run smoothly. thanks! Ed

ahrtr · 2023-06-07T01:24:04Z

Thanks @jmhbnz for driving and coordinating this, also thanks @chaochn47 , @kevinzs2048 , @fuweid and @vielmetti !

chaochn47 added the type/bug label Apr 6, 2023

jmhbnz added the area/testing label Apr 9, 2023

chaochn47 mentioned this issue May 4, 2023

remove invalid arm64 workflow config #15829

Merged

jmhbnz mentioned this issue May 5, 2023

Run arm64 integration and e2e workflows against a supported release branch #15832

Closed

jmhbnz mentioned this issue May 14, 2023

Ensure arm64 runners start with a clean checkout #15890

Closed

jmhbnz mentioned this issue May 25, 2023

Document access management for etcd arm64 ci infrastructure #15952

Closed

jmhbnz closed this as completed Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM E2E test consistently failed #15647

ARM E2E test consistently failed #15647

chaochn47 commented Apr 6, 2023 •

edited

Loading

chaochn47 commented Apr 6, 2023

pchan commented Apr 10, 2023

ahrtr commented Apr 10, 2023

kevinzs2048 commented Apr 10, 2023

ahrtr commented Apr 10, 2023

chaochn47 commented Apr 10, 2023

fuweid commented Apr 11, 2023

serathius commented Apr 11, 2023 •

edited

Loading

chaochn47 commented Apr 11, 2023 •

edited

Loading

chaochn47 commented Apr 14, 2023 •

edited

Loading

dims commented Apr 14, 2023

chaochn47 commented Apr 14, 2023 •

edited

Loading

dims commented Apr 19, 2023

chaochn47 commented May 1, 2023 •

edited

Loading

ahrtr commented May 1, 2023

chaochn47 commented May 3, 2023 •

edited

Loading

chaochn47 commented May 4, 2023 •

edited

Loading

kevinzs2048 commented May 5, 2023

serathius commented May 5, 2023

vielmetti commented May 5, 2023

jmhbnz commented Jun 6, 2023

chaochn47 commented Jun 6, 2023

vielmetti commented Jun 6, 2023

ahrtr commented Jun 7, 2023

ARM E2E test consistently failed #15647

ARM E2E test consistently failed #15647

Comments

chaochn47 commented Apr 6, 2023 • edited Loading

What happened?

What did you expect to happen?

Anything else we need to know?

chaochn47 commented Apr 6, 2023

pchan commented Apr 10, 2023

ahrtr commented Apr 10, 2023

kevinzs2048 commented Apr 10, 2023

ahrtr commented Apr 10, 2023

chaochn47 commented Apr 10, 2023

fuweid commented Apr 11, 2023

serathius commented Apr 11, 2023 • edited Loading

chaochn47 commented Apr 11, 2023 • edited Loading

chaochn47 commented Apr 14, 2023 • edited Loading

dims commented Apr 14, 2023

chaochn47 commented Apr 14, 2023 • edited Loading

dims commented Apr 19, 2023

chaochn47 commented May 1, 2023 • edited Loading

ahrtr commented May 1, 2023

chaochn47 commented May 3, 2023 • edited Loading

chaochn47 commented May 4, 2023 • edited Loading

kevinzs2048 commented May 5, 2023

serathius commented May 5, 2023

vielmetti commented May 5, 2023

jmhbnz commented Jun 6, 2023

chaochn47 commented Jun 6, 2023

vielmetti commented Jun 6, 2023

ahrtr commented Jun 7, 2023

chaochn47 commented Apr 6, 2023 •

edited

Loading

serathius commented Apr 11, 2023 •

edited

Loading

chaochn47 commented Apr 11, 2023 •

edited

Loading

chaochn47 commented Apr 14, 2023 •

edited

Loading

chaochn47 commented Apr 14, 2023 •

edited

Loading

chaochn47 commented May 1, 2023 •

edited

Loading

chaochn47 commented May 3, 2023 •

edited

Loading

chaochn47 commented May 4, 2023 •

edited

Loading