-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Github QA occasionally hangs while running unit-tests #2016
Comments
I just saw this happen in the testing repo too, which may indicate it has nothing to do with recent changes to the main repo. |
I noticed that in these cases, the logs aren't being uploaded. It looks like the timeout is set on entire jobs, so the upload-artifact steps never happen if the timeout is reached in running the Maven build. This can be addressed by moving the timeout-minutes to the steps, rather than under the job. That way, if any individual step times out, the other steps (like uploading log artifacts) still run. Also, the last thing done in a workflow before the runner shuts down is to kill orphaned processes. I noticed that there may be fewer processes running than expected. Here is an example from a recent timeout:
I'm not sure which java processes are still running at the end of the job, but it looks like mini isn't starting or shutting down correctly when everything times out and is terminated. It would be useful to get more information in the output about what these processes are, in order to determine what is causing the problem. A simple "ps aux | grep jav[a]" or similar might help provide more insight into which processes are left running (and perhaps which aren't running that should have been). |
For issue apache#2016, regarding unexplained hung test processes, update the job timeouts so the Maven task terminates, but the upload-artifact step can still be run, to see the logs to help troubleshoot the failures. Make individual Maven build steps timeout, rather than the entire job.
I updated the timeout settings, but have not yet attempted to add any logging about the still running processes. I'm hoping the logs will have some insights, now that they should be uploaded on timeouts. |
Okay, so fixing the timeout works. I was able to get the logs. It looks like services are starting up okay, but cannot talk to each other. The services register themselves using the local host name determined by using reverse DNS on the local IP address. When services are reached on localhost, everything works fine (e.g. services can talk to zookeeper on Tservers and the master in 1.10 (the build I was testing) show that they are listening on hostname
There is an additional stack trace further along, but it doesn't have any additional information, just that there was a timeout trying to connect to the tserver. So, either there is a problem with DNS/rDNS mapping between the hostname and IP address of the runner, or there is some other security / firewall policy preventing services from talking on the non-localhost IP address. This is clearly the result of some change in GitHub Actions runners, and not in our code, since it also affects minicluster in 1.10. The most likely change I can think of that could have caused this is the switch of There's a few options forward, if it is an issue with Ubuntu 20.04:
|
If it helps, locally I run Ubuntu 20.04.2 LTS (Focal Fossa) and have not seen this occur. |
According to https://github.blog/changelog/2020-10-29-github-actions-ubuntu-latest-workflows-will-use-ubuntu-20-04/. The change to 20.04 should have happened a few months ago (unless it took them this long to roll it out). I do not see any similar issue in their virtual-environments repository. If we are confident that it exists on their end, we can create a ticket there. It also says 18.04 is still supported so we test against that as well. |
Okay, so I've spent a few hours looking into this today, and I think it's a bug in GitHub Actions DNS servers. If you create a GitHub Actions job that simply prints the output of Since Accumulo services are listening on I tried a few different methods to force the name lookup to resolve correctly, including using |
I filed a bug report with GitHub support. |
Nice detective work @ctubbsii |
Here's another report of the same issue. |
Fix apache#2016 by adding an entry to /etc/hosts to fix incorrect DNS entries, which return an IP for the current machine's hostname that does not match any IP address in the machine. Adding an entry to /etc/hosts to force the hostname to match on eth0's IP address.
My previous attempt to fix this had a typo in the /etc/hosts filename. Correcting that typo should fix the issue, even though it's a hack. See PR #2024 |
Upstream issue is now at actions/runner-images#3185 They have codified the recommended workaround (to update If the upstream workaround is removed, then our workaround will automatically apply. If they fix the DNS records so that a workaround is no longer needed at all, then we can revisit this (or change the way we check to see if a fix is needed). |
Describe the bug
Github QA occasionally hangs while running unit tests and eventually will cancel after the allotted 60 minutes have passed. Typically hangs around the test
MiniAccumuloClusterImplTest
. So far, this has not been reproduced locally and seems to be centered around a possible resource issue Github QA might run into.Versions (OS, Maven, Java, and others, as appropriate):
To Reproduce
Steps to reproduce the behavior (or a link to an example repository that reproduces the problem):
Screenshots
![image](https://user-images.githubusercontent.com/29436247/114597045-4d241000-9c5e-11eb-8e0a-0ee785558759.png)
The text was updated successfully, but these errors were encountered: