-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/run: Add a backend argument and support libvirt #231
Conversation
a339d37
to
f90b783
Compare
The smoke errors were:
I dunno about the resource and pod issues, but I haven't thought about them much yet. The node error is really confusing. Where is 7 coming from? The choices are 5 or 4, I've already checked for busted /retest |
🤦♂️ it's from here. And we're not even using |
f90b783
to
1b0b9df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a couple of unrelated edits this kicks off the smoke tests properly for me on libvirt.
I can't say anything about AWS (yet).
tests/run.sh
Outdated
@@ -9,6 +9,8 @@ set -e | |||
|
|||
set -eo pipefail | |||
|
|||
BACKEND="${1:-aws}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to use BACKEND="${BACKEND:-aws}"
here? Somehow I read it that way when browsing the script and ran into an accidental AWS run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to use
BACKEND="${BACKEND:-aws}"
here?
I think this setting is important enough to be a positional arg. If you like, I can drop the default and print a usage error if the caller leaves it unset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you like, I can drop the default and print a usage error if the caller leaves it unset.
Please do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you like, I can drop the default and print a usage error if the caller leaves it unset.
Please do
Done with 1b0b9df -> 7d839ba4a, which also rebases us onto the current master.
tests/run.sh
Outdated
yaml.safe_dump(config, sys.stdout) | ||
EOF | ||
|
||
echo -e "\\e[36m Initializing Tectonic...\\e[0m" | ||
tectonic init --config="${CLUSTER_NAME}".yaml | ||
|
||
trap destroy EXIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, how about trapping INT (control+c) too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, how about trapping INT (control+c) too?
I think EXIT
covers that:
$ cat /tmp/test.sh
#!/bin/sh
testing()
{
echo testing
}
trap testing EXIT
sleep 10
$ /tmp/test.sh
^Ctesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found out why but this doesn't work in our case, i.e. the cluster isn't destroyed if you cancel the tests using ^C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found out why but this doesn't work in our case, i.e. the cluster isn't destroyed if you cancel the tests using ^C.
Looks like it works to me:
$ ./tests/run.sh libvirt
...
Creating Tectonic configuration...
Initializing Tectonic...
Deploying Tectonic...
^C Exiting... Destroying Tectonic...
Finished! Smoke test output: Never executed. Problem with one of previous stages
So Long, and Thanks for All the Fish
You can see the output from here in that session. If you're seeing leaked resources, my guess is that you have a SIGINT in the middle of our multi-step Terraform initialization, and our multi-step Terraform destruction code is choking and dying on the partially initialized cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the those echo messages printed when I cancel and I have tried multiple times. I can't reproduce this in a MWE though, will have to recheck again tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR I only pressed once, assuming the debouncing works on my system ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15959 is the main tee
command:
$ grep exec strace.log | grep -v ENOENT
15957 execve("./tests/run.sh", ["./tests/run.sh"], 0x7ffc1a181ec8 /* 90 vars */) = 0
15957 execve("/run/current-system/sw/bin/bash", ["bash", "./tests/run.sh"], 0x7ffd67c20198 /* 90 vars */) = 0
15957 read(3</home/steveej/src/go/src/github.com/openshift/installer/tests/run.sh>, "#!/usr/bin/env bash\n\nset -e\nexec"..., 80) = 80
15957 read(255</home/steveej/src/go/src/github.com/openshift/installer/tests/run.sh>, "#!/usr/bin/env bash\n\nset -e\nexec"..., 198) = 198
15959 execve("/home/steveej/.nix-profile/bin/tee", ["tee", "-a", "/dev/null"], 0x1a66008 /* 90 vars */ <unfinished ...>
15959 <... execve resumed> ) = 0
15962 execve("/home/steveej/.nix-profile/bin/tee", ["tee", "/dev/fd/63"], 0x1a66008 /* 90 vars */ <unfinished ...>
15962 <... execve resumed> ) = 0
15961 execve("/home/steveej/.nix-profile/bin/sleep", ["sleep", "1000"], 0x1a66008 /* 90 vars */ <unfinished ...>
15961 <... execve resumed> ) = 0
15964 execve("/home/steveej/.nix-profile/bin/cat", ["cat", "-"], 0x1a66008 /* 90 vars */ <unfinished ...>
15964 <... execve resumed> ) = 0
It makes sense that you'll get an EPIPE when that tee
dies first. Do we know why it died first? Ideally, it would be the sleep
that died first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I certainly don't know, I'm not sure about you ;-) I'd regard this as something that should just work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EPIPE issues should be fixed by 951e16c99. Let me know if you still see them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix confirmed! Awesome 🎉
599313a
to
f395b93
Compare
I've also pushed f395b93db to make it easier to run
Details in the commit message. |
f395b93
to
afc4807
Compare
Both
I'm trying to figure out what's going on there. |
@smarterclayton and @bparees pointed me at bzrh#1626228 for this. Kicking off the tests again: /retest |
/test e2e-aws-smoke |
1 similar comment
/test e2e-aws-smoke |
afc4807
to
3a4c747
Compare
Rebased around #224 with afc4807 -> 3a4c747. |
/lgtm |
#243 might affect this PR. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Make it easier for folks to run the smoke tests on libvirt. This also shifts our teardown trap installation to right before we start creating resources that might need destroying. The lack of default is at Stefan's request to avoid having callers launch AWS clusters by mistake [1,2]. [1]: openshift#231 (comment) [2]: openshift#231 (comment)
Running the script is easier than following the README and libvirt-howto notes by hand. We still automatically destroy clusters where 'tectonic install' fails.
Avoiding: $ ./tests/run.sh ... cp: cannot create regular file ‘tectonic-dev/smoke’: Permission denied $ ls -l tectonic-dev/smoke -r-xr-xr-x. 1 trking trking 48972051 Sep 10 12:14 tectonic-dev/smoke Ideally the Bazel output would have appropriate permissions by default, but it currently provides no way to set write permission on its output files [1]. With this commit, you can run run.sh multiple times in succession without blowing away tectonic-dev between runs. [1]: bazelbuild/bazel#5588
This should avoid issues where we get EPIPEs in the destroy trap if the listening tee dies. For example [1,2]: 1. The script uses exec to insert a tee capturing future stdout and stderr before writing them to the original stdout and stderr. 2. The SIGINT comes in and all our sub-processes (including the tee) die. 3. The exit trap launches the destroy handler. 4. The destroy handler tries to write to stdout, which is now the pipe into that tee, but the tee is dead, so we get an EPIPE and the destroy callback exits before actually doing any cleanup. With this commit, the tees will survive until the program feeding them closes its side of the pipes, so we'll continue to have working tee-managed output even after receiving a control-c. The option is in POSIX [3], so this should be portable. [1]: openshift#231 (comment) [2]: https://gist.github.com/steveeJ/86efe22e8d2195f5d19efe05d03225b2 [3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tee.html
951e16c
to
16fd525
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: steveeJ, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Make it easier for folks to run the smoke tests on libvirt. This also shifts our teardown trap installation to right before we start creating resources that might need destroying.
Also add
LEAVE_RUNNING
to allow using the script for cluster setup. Running the script is easier than following the README and libvirt-howto notes by hand. We still automatically destroy clusters wheretectonic install
fails.This PR is a follow-up to #121.