tests/run: Add a backend argument and support libvirt #231

wking · 2018-09-10T18:46:02Z

Make it easier for folks to run the smoke tests on libvirt. This also shifts our teardown trap installation to right before we start creating resources that might need destroying.

Also add LEAVE_RUNNING to allow using the script for cluster setup. Running the script is easier than following the README and libvirt-howto notes by hand. We still automatically destroy clusters where tectonic install fails.

This PR is a follow-up to #121.

wking · 2018-09-10T23:46:56Z

The smoke errors were:

        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:121: node ip-10-0-1-25.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-143-164.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-152-131.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-171-172.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-39-137.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-7-240.ec2.internal ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 6
        	smoke_test.go:113: retrying in 10s
        	smoke_test.go:112: failed with error: failed to list nodes: Get https://ci-op-g4pv3xg1-b28c4-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes: read tcp 172.16.146.36:60574->35.170.148.83:6443: read: connection timed out
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:146: Failed to find 7 ready nodes in 10m0s.
        --- FAIL: Test/Cluster/AllResourcesCreated (0.00s)
        	cluster_test.go:213: looking for resources defined by the provided manifests...
        	smoke_test.go:56: could not create client config: stat /tmp/cluster/generated/auth/kubeconfig: no such file or directory
        --- FAIL: Test/Cluster/AllPodsRunning (0.00s)
        	smoke_test.go:56: could not create client config: stat /tmp/cluster/generated/auth/kubeconfig: no such file or directory
FAIL

I dunno about the resource and pod issues, but I haven't thought about them much yet. The node error is really confusing. Where is 7 coming from? The choices are 5 or 4, I've already checked for busted ${BACKEND} values, and we don't have a fallback. Kicking again for sanity:

/retest

wking · 2018-09-10T23:50:15Z

The node error is really confusing. Where is 7 coming from?

🤦‍♂️ it's from here. And we're not even using run.sh in our smoke tests anymore, so I'm not clear on why this PR failed. Anyhow, the /retest should green us up.

steveej

After a couple of unrelated edits this kicks off the smoke tests properly for me on libvirt.
I can't say anything about AWS (yet).

tests/run.sh

steveej · 2018-09-11T10:52:28Z

tests/run.sh

@@ -9,6 +9,8 @@ set -e

 set -eo pipefail

+BACKEND="${1:-aws}"


Would you mind to use BACKEND="${BACKEND:-aws}" here? Somehow I read it that way when browsing the script and ran into an accidental AWS run.

Would you mind to use BACKEND="${BACKEND:-aws}" here?

I think this setting is important enough to be a positional arg. If you like, I can drop the default and print a usage error if the caller leaves it unset.

If you like, I can drop the default and print a usage error if the caller leaves it unset.

Please do

If you like, I can drop the default and print a usage error if the caller leaves it unset.

Please do

Done with 1b0b9df -> 7d839ba4a, which also rebases us onto the current master.

steveej · 2018-09-11T11:54:33Z

tests/run.sh

 	yaml.safe_dump(config, sys.stdout)
 	EOF

 echo -e "\\e[36m Initializing Tectonic...\\e[0m"
 tectonic init --config="${CLUSTER_NAME}".yaml

+trap destroy EXIT


While we're at it, how about trapping INT (control+c) too?

While we're at it, how about trapping INT (control+c) too?

I think EXIT covers that:

$ cat /tmp/test.sh #!/bin/sh testing() { echo testing } trap testing EXIT sleep 10 $ /tmp/test.sh ^Ctesting

I haven't found out why but this doesn't work in our case, i.e. the cluster isn't destroyed if you cancel the tests using ^C.

I haven't found out why but this doesn't work in our case, i.e. the cluster isn't destroyed if you cancel the tests using ^C.

Looks like it works to me:

$ ./tests/run.sh libvirt ... Creating Tectonic configuration... Initializing Tectonic... Deploying Tectonic... ^C Exiting... Destroying Tectonic... Finished! Smoke test output: Never executed. Problem with one of previous stages So Long, and Thanks for All the Fish

You can see the output from here in that session. If you're seeing leaked resources, my guess is that you have a SIGINT in the middle of our multi-step Terraform initialization, and our multi-step Terraform destruction code is choking and dying on the partially initialized cluster.

I don't see the those echo messages printed when I cancel and I have tried multiple times. I can't reproduce this in a MWE though, will have to recheck again tomorrow.

AFAIR I only pressed once, assuming the debouncing works on my system ;-)

15959 is the main tee command:

$ grep exec strace.log | grep -v ENOENT 15957 execve("./tests/run.sh", ["./tests/run.sh"], 0x7ffc1a181ec8 /* 90 vars */) = 0 15957 execve("/run/current-system/sw/bin/bash", ["bash", "./tests/run.sh"], 0x7ffd67c20198 /* 90 vars */) = 0 15957 read(3</home/steveej/src/go/src/github.com/openshift/installer/tests/run.sh>, "#!/usr/bin/env bash\n\nset -e\nexec"..., 80) = 80 15957 read(255</home/steveej/src/go/src/github.com/openshift/installer/tests/run.sh>, "#!/usr/bin/env bash\n\nset -e\nexec"..., 198) = 198 15959 execve("/home/steveej/.nix-profile/bin/tee", ["tee", "-a", "/dev/null"], 0x1a66008 /* 90 vars */ <unfinished ...> 15959 <... execve resumed> ) = 0 15962 execve("/home/steveej/.nix-profile/bin/tee", ["tee", "/dev/fd/63"], 0x1a66008 /* 90 vars */ <unfinished ...> 15962 <... execve resumed> ) = 0 15961 execve("/home/steveej/.nix-profile/bin/sleep", ["sleep", "1000"], 0x1a66008 /* 90 vars */ <unfinished ...> 15961 <... execve resumed> ) = 0 15964 execve("/home/steveej/.nix-profile/bin/cat", ["cat", "-"], 0x1a66008 /* 90 vars */ <unfinished ...> 15964 <... execve resumed> ) = 0

It makes sense that you'll get an EPIPE when that tee dies first. Do we know why it died first? Ideally, it would be the sleep that died first.

I certainly don't know, I'm not sure about you ;-) I'd regard this as something that should just work.

The EPIPE issues should be fixed by 951e16c99. Let me know if you still see them.

Fix confirmed! Awesome 🎉

wking · 2018-09-11T16:23:00Z

I've also pushed f395b93db to make it easier to run run.sh multiple times without hitting:

cp: cannot create regular file ‘tectonic-dev/smoke’: Permission denied

Details in the commit message.

wking · 2018-09-11T16:58:03Z

Both e2e-aws and e2e-aws-smoke hit:

2018/09/11 16:29:08 Building bin
2018/09/11 16:29:53 Build bin failed, printing logs:

Pulling image docker-registry.default.svc:5000/ci-op-l6ts49pi/pipeline@sha256:287a165b92181f8ece9249500f4583e5f00dc1363de36c24b60aefec10f756b0 ...
error: build error: no such image

I'm trying to figure out what's going on there.

wking · 2018-09-11T17:37:26Z

error: build error: no such image

@smarterclayton and @bparees pointed me at bzrh#1626228 for this. Kicking off the tests again:

/retest

sallyom · 2018-09-11T18:12:17Z

/test e2e-aws-smoke

sallyom · 2018-09-11T19:41:57Z

/test e2e-aws-smoke

wking · 2018-09-12T19:32:29Z

Rebased around #224 with afc4807 -> 3a4c747.

steveej · 2018-09-12T21:46:09Z

/lgtm

abhinavdahiya · 2018-09-12T23:21:52Z

#243 might affect this PR.

openshift-bot · 2018-09-13T00:50:25Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-09-13T02:51:27Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Make it easier for folks to run the smoke tests on libvirt. This also shifts our teardown trap installation to right before we start creating resources that might need destroying. The lack of default is at Stefan's request to avoid having callers launch AWS clusters by mistake [1,2]. [1]: openshift#231 (comment) [2]: openshift#231 (comment)

Running the script is easier than following the README and libvirt-howto notes by hand. We still automatically destroy clusters where 'tectonic install' fails.

Avoiding: $ ./tests/run.sh ... cp: cannot create regular file ‘tectonic-dev/smoke’: Permission denied $ ls -l tectonic-dev/smoke -r-xr-xr-x. 1 trking trking 48972051 Sep 10 12:14 tectonic-dev/smoke Ideally the Bazel output would have appropriate permissions by default, but it currently provides no way to set write permission on its output files [1]. With this commit, you can run run.sh multiple times in succession without blowing away tectonic-dev between runs. [1]: bazelbuild/bazel#5588

This should avoid issues where we get EPIPEs in the destroy trap if the listening tee dies. For example [1,2]: 1. The script uses exec to insert a tee capturing future stdout and stderr before writing them to the original stdout and stderr. 2. The SIGINT comes in and all our sub-processes (including the tee) die. 3. The exit trap launches the destroy handler. 4. The destroy handler tries to write to stdout, which is now the pipe into that tee, but the tee is dead, so we get an EPIPE and the destroy callback exits before actually doing any cleanup. With this commit, the tees will survive until the program feeding them closes its side of the pipes, so we'll continue to have working tee-managed output even after receiving a control-c. The option is in POSIX [3], so this should be portable. [1]: openshift#231 (comment) [2]: https://gist.github.com/steveeJ/86efe22e8d2195f5d19efe05d03225b2 [3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html

wking · 2018-09-13T04:16:10Z

Rebased around #243 with 951e16c -> 16fd525.

steveej · 2018-09-13T14:08:06Z

/lgtm

openshift-ci-robot · 2018-09-13T14:08:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: steveeJ, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [steveeJ,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 10, 2018

openshift-ci-robot requested review from crawford and steveej September 10, 2018 18:46

wking force-pushed the libvirt-smoke branch 2 times, most recently from a339d37 to f90b783 Compare September 10, 2018 22:40

wking force-pushed the libvirt-smoke branch from f90b783 to 1b0b9df Compare September 11, 2018 04:13

steveej reviewed Sep 11, 2018

View reviewed changes

wking force-pushed the libvirt-smoke branch from 599313a to f395b93 Compare September 11, 2018 16:21

wking force-pushed the libvirt-smoke branch from f395b93 to afc4807 Compare September 11, 2018 16:28

wking force-pushed the libvirt-smoke branch from afc4807 to 3a4c747 Compare September 12, 2018 19:31

openshift-ci-robot assigned steveej Sep 12, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2018

wking added 4 commits September 12, 2018 21:15

tests/run: Add LEAVE_RUNNING to allow using the script for cluster setup

f8500f6

Running the script is easier than following the README and libvirt-howto notes by hand. We still automatically destroy clusters where 'tectonic install' fails.

wking force-pushed the libvirt-smoke branch from 951e16c to 16fd525 Compare September 13, 2018 04:15

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2018

openshift-merge-robot merged commit 14f29f9 into openshift:master Sep 13, 2018

wking deleted the libvirt-smoke branch September 13, 2018 15:07

wking mentioned this pull request Sep 13, 2018

installer/pkg/config: Support loading InstallConfig YAML #236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/run: Add a backend argument and support libvirt #231

tests/run: Add a backend argument and support libvirt #231

wking commented Sep 10, 2018

wking commented Sep 10, 2018

wking commented Sep 10, 2018

steveej left a comment

steveej Sep 11, 2018 •

edited

Loading

wking Sep 11, 2018

steveej Sep 11, 2018

wking Sep 11, 2018

steveej Sep 11, 2018

wking Sep 11, 2018

steveej Sep 11, 2018

wking Sep 11, 2018

steveej Sep 11, 2018

steveej Sep 12, 2018 •

edited

Loading

wking Sep 12, 2018

steveej Sep 12, 2018

wking Sep 12, 2018

steveej Sep 12, 2018

wking commented Sep 11, 2018

wking commented Sep 11, 2018

wking commented Sep 11, 2018

sallyom commented Sep 11, 2018

sallyom commented Sep 11, 2018

wking commented Sep 12, 2018

steveej commented Sep 12, 2018

abhinavdahiya commented Sep 12, 2018

openshift-bot commented Sep 13, 2018

openshift-bot commented Sep 13, 2018

wking commented Sep 13, 2018

steveej commented Sep 13, 2018

openshift-ci-robot commented Sep 13, 2018

tests/run: Add a backend argument and support libvirt #231

tests/run: Add a backend argument and support libvirt #231

Conversation

wking commented Sep 10, 2018

wking commented Sep 10, 2018

wking commented Sep 10, 2018

steveej left a comment

Choose a reason for hiding this comment

steveej Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveej Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking commented Sep 11, 2018

wking commented Sep 11, 2018

wking commented Sep 11, 2018

sallyom commented Sep 11, 2018

sallyom commented Sep 11, 2018

wking commented Sep 12, 2018

steveej commented Sep 12, 2018

abhinavdahiya commented Sep 12, 2018

openshift-bot commented Sep 13, 2018

openshift-bot commented Sep 13, 2018

wking commented Sep 13, 2018

steveej commented Sep 13, 2018

openshift-ci-robot commented Sep 13, 2018

steveej Sep 11, 2018 •

edited

Loading

steveej Sep 12, 2018 •

edited

Loading