Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/run: Add a backend argument and support libvirt #231

Merged
merged 4 commits into from
Sep 13, 2018

Conversation

wking
Copy link
Member

@wking wking commented Sep 10, 2018

Make it easier for folks to run the smoke tests on libvirt. This also shifts our teardown trap installation to right before we start creating resources that might need destroying.

Also add LEAVE_RUNNING to allow using the script for cluster setup. Running the script is easier than following the README and libvirt-howto notes by hand. We still automatically destroy clusters where tectonic install fails.

This PR is a follow-up to #121.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 10, 2018
@wking wking force-pushed the libvirt-smoke branch 2 times, most recently from a339d37 to f90b783 Compare September 10, 2018 22:40
@wking
Copy link
Member Author

wking commented Sep 10, 2018

The smoke errors were:

        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:121: node ip-10-0-1-25.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-143-164.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-152-131.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-171-172.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-39-137.ec2.internal ready
        	cluster_test.go:121: node ip-10-0-7-240.ec2.internal ready
        	smoke_test.go:112: failed with error: expected 7 nodes, got 6
        	smoke_test.go:113: retrying in 10s
        	smoke_test.go:112: failed with error: failed to list nodes: Get https://ci-op-g4pv3xg1-b28c4-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes: read tcp 172.16.146.36:60574->35.170.148.83:6443: read: connection timed out
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:146: Failed to find 7 ready nodes in 10m0s.
        --- FAIL: Test/Cluster/AllResourcesCreated (0.00s)
        	cluster_test.go:213: looking for resources defined by the provided manifests...
        	smoke_test.go:56: could not create client config: stat /tmp/cluster/generated/auth/kubeconfig: no such file or directory
        --- FAIL: Test/Cluster/AllPodsRunning (0.00s)
        	smoke_test.go:56: could not create client config: stat /tmp/cluster/generated/auth/kubeconfig: no such file or directory
FAIL

I dunno about the resource and pod issues, but I haven't thought about them much yet. The node error is really confusing. Where is 7 coming from? The choices are 5 or 4, I've already checked for busted ${BACKEND} values, and we don't have a fallback. Kicking again for sanity:

/retest

@wking
Copy link
Member Author

wking commented Sep 10, 2018

The node error is really confusing. Where is 7 coming from?

🤦‍♂️ it's from here. And we're not even using run.sh in our smoke tests anymore, so I'm not clear on why this PR failed. Anyhow, the /retest should green us up.

Copy link
Contributor

@steveej steveej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a couple of unrelated edits this kicks off the smoke tests properly for me on libvirt.
I can't say anything about AWS (yet).

tests/run.sh Show resolved Hide resolved
tests/run.sh Show resolved Hide resolved
tests/run.sh Outdated
@@ -9,6 +9,8 @@ set -e

set -eo pipefail

BACKEND="${1:-aws}"
Copy link
Contributor

@steveej steveej Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to use BACKEND="${BACKEND:-aws}" here? Somehow I read it that way when browsing the script and ran into an accidental AWS run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to use BACKEND="${BACKEND:-aws}" here?

I think this setting is important enough to be a positional arg. If you like, I can drop the default and print a usage error if the caller leaves it unset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you like, I can drop the default and print a usage error if the caller leaves it unset.

Please do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you like, I can drop the default and print a usage error if the caller leaves it unset.

Please do

Done with 1b0b9df -> 7d839ba4a, which also rebases us onto the current master.

tests/run.sh Outdated
yaml.safe_dump(config, sys.stdout)
EOF

echo -e "\\e[36m Initializing Tectonic...\\e[0m"
tectonic init --config="${CLUSTER_NAME}".yaml

trap destroy EXIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, how about trapping INT (control+c) too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, how about trapping INT (control+c) too?

I think EXIT covers that:

$ cat /tmp/test.sh
#!/bin/sh

testing()
{
  echo testing
}

trap testing EXIT
sleep 10
$ /tmp/test.sh
^Ctesting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found out why but this doesn't work in our case, i.e. the cluster isn't destroyed if you cancel the tests using ^C.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found out why but this doesn't work in our case, i.e. the cluster isn't destroyed if you cancel the tests using ^C.

Looks like it works to me:

$ ./tests/run.sh libvirt
...
 Creating Tectonic configuration...
 Initializing Tectonic...
 Deploying Tectonic...
^C Exiting... Destroying Tectonic...
 Finished! Smoke test output: Never executed. Problem with one of previous stages
 So Long, and Thanks for All the Fish

You can see the output from here in that session. If you're seeing leaked resources, my guess is that you have a SIGINT in the middle of our multi-step Terraform initialization, and our multi-step Terraform destruction code is choking and dying on the partially initialized cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the those echo messages printed when I cancel and I have tried multiple times. I can't reproduce this in a MWE though, will have to recheck again tomorrow.

Copy link
Contributor

@steveej steveej Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR I only pressed once, assuming the debouncing works on my system ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15959 is the main tee command:

$ grep exec strace.log  | grep -v ENOENT
15957 execve("./tests/run.sh", ["./tests/run.sh"], 0x7ffc1a181ec8 /* 90 vars */) = 0
15957 execve("/run/current-system/sw/bin/bash", ["bash", "./tests/run.sh"], 0x7ffd67c20198 /* 90 vars */) = 0
15957 read(3</home/steveej/src/go/src/github.com/openshift/installer/tests/run.sh>, "#!/usr/bin/env bash\n\nset -e\nexec"..., 80) = 80
15957 read(255</home/steveej/src/go/src/github.com/openshift/installer/tests/run.sh>, "#!/usr/bin/env bash\n\nset -e\nexec"..., 198) = 198
15959 execve("/home/steveej/.nix-profile/bin/tee", ["tee", "-a", "/dev/null"], 0x1a66008 /* 90 vars */ <unfinished ...>
15959 <... execve resumed> )            = 0
15962 execve("/home/steveej/.nix-profile/bin/tee", ["tee", "/dev/fd/63"], 0x1a66008 /* 90 vars */ <unfinished ...>
15962 <... execve resumed> )            = 0
15961 execve("/home/steveej/.nix-profile/bin/sleep", ["sleep", "1000"], 0x1a66008 /* 90 vars */ <unfinished ...>
15961 <... execve resumed> )            = 0
15964 execve("/home/steveej/.nix-profile/bin/cat", ["cat", "-"], 0x1a66008 /* 90 vars */ <unfinished ...>
15964 <... execve resumed> )            = 0

It makes sense that you'll get an EPIPE when that tee dies first. Do we know why it died first? Ideally, it would be the sleep that died first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly don't know, I'm not sure about you ;-) I'd regard this as something that should just work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EPIPE issues should be fixed by 951e16c99. Let me know if you still see them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix confirmed! Awesome 🎉

@wking
Copy link
Member Author

wking commented Sep 11, 2018

I've also pushed f395b93db to make it easier to run run.sh multiple times without hitting:

cp: cannot create regular file ‘tectonic-dev/smoke’: Permission denied

Details in the commit message.

@wking
Copy link
Member Author

wking commented Sep 11, 2018

Both e2e-aws and e2e-aws-smoke hit:

2018/09/11 16:29:08 Building bin
2018/09/11 16:29:53 Build bin failed, printing logs:

Pulling image docker-registry.default.svc:5000/ci-op-l6ts49pi/pipeline@sha256:287a165b92181f8ece9249500f4583e5f00dc1363de36c24b60aefec10f756b0 ...
error: build error: no such image

I'm trying to figure out what's going on there.

@wking
Copy link
Member Author

wking commented Sep 11, 2018

error: build error: no such image

@smarterclayton and @bparees pointed me at bzrh#1626228 for this. Kicking off the tests again:

/retest

@sallyom
Copy link
Contributor

sallyom commented Sep 11, 2018

/test e2e-aws-smoke

1 similar comment
@sallyom
Copy link
Contributor

sallyom commented Sep 11, 2018

/test e2e-aws-smoke

@wking
Copy link
Member Author

wking commented Sep 12, 2018

Rebased around #224 with afc4807 -> 3a4c747.

@steveej
Copy link
Contributor

steveej commented Sep 12, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2018
@abhinavdahiya
Copy link
Contributor

#243 might affect this PR.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

Make it easier for folks to run the smoke tests on libvirt.  This also
shifts our teardown trap installation to right before we start
creating resources that might need destroying.

The lack of default is at Stefan's request to avoid having callers
launch AWS clusters by mistake [1,2].

[1]: openshift#231 (comment)
[2]: openshift#231 (comment)
Running the script is easier than following the README and
libvirt-howto notes by hand.  We still automatically destroy clusters
where 'tectonic install' fails.
Avoiding:

  $ ./tests/run.sh
  ...
  cp: cannot create regular file ‘tectonic-dev/smoke’: Permission denied
  $ ls -l tectonic-dev/smoke
  -r-xr-xr-x. 1 trking trking 48972051 Sep 10 12:14 tectonic-dev/smoke

Ideally the Bazel output would have appropriate permissions by
default, but it currently provides no way to set write permission on
its output files [1].  With this commit, you can run run.sh multiple
times in succession without blowing away tectonic-dev between runs.

[1]: bazelbuild/bazel#5588
This should avoid issues where we get EPIPEs in the destroy trap if
the listening tee dies.  For example [1,2]:

1. The script uses exec to insert a tee capturing future stdout and
   stderr before writing them to the original stdout and stderr.
2. The SIGINT comes in and all our sub-processes (including the tee)
   die.
3. The exit trap launches the destroy handler.
4. The destroy handler tries to write to stdout, which is now the pipe
   into that tee, but the tee is dead, so we get an EPIPE and the
   destroy callback exits before actually doing any cleanup.

With this commit, the tees will survive until the program feeding them
closes its side of the pipes, so we'll continue to have working
tee-managed output even after receiving a control-c.

The option is in POSIX [3], so this should be portable.

[1]: openshift#231 (comment)
[2]: https://gist.github.com/steveeJ/86efe22e8d2195f5d19efe05d03225b2
[3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2018
@wking
Copy link
Member Author

wking commented Sep 13, 2018

Rebased around #243 with 951e16c -> 16fd525.

@steveej
Copy link
Contributor

steveej commented Sep 13, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: steveeJ, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 14f29f9 into openshift:master Sep 13, 2018
@wking wking deleted the libvirt-smoke branch September 13, 2018 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants