-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EAGAIN hazard in isolated runs using rsync + oc rsh #6692
Comments
Here is some speculation that will need substantiating:
|
In fact it is even debatable whether this is is an |
Here is why this issue doesn't show up with “ordinary” rsync over ssh. |
Wow. First, thanks for spending so much time on this one - it's pretty gross 😂 and very complicated. Having read all of this, this comment captures my first impression:
Based on your description, this really sounds like a bug in Also, any chance you've tried and found better results by adding https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L41 |
Yeah, holy shit. Thank you for digging so deep into this. I see you found my If @ryanpetrello's suggestion doesn't work, we might want to move away from using |
I've been distracted by other work lately, but I'll give this a try ASAP. |
Applying the `--blocking-io` suggestion from ansible/awx#6692 results in six consecutive successful runs in the conditions described inside the bug report: https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/817 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/820 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/823 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/826 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/828 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/829
Indeed, |
ISSUE TYPE
SUMMARY
Because
rsync
andoc rsh
disagree on whether socket IO should be blocking or nonblocking by default, andawx/playbooks/check_isolated.yml
uses a combination of both to pump the JSON job events out of runner containers running on Kubernetes, there exists a hazard of EAGAIN being treated as a crippling error byoc rsh
(killing a goroutine, but leaving the process alive) and causingrsync
to deadlock.ENVIRONMENT
STEPS TO REPRODUCE
EXPECTED RESULTS
The job should succeed or fail depending solely on the underlying Ansible tasks' outcomes.
ACTUAL RESULTS
rsync
deadlocks. That is, tworsync
processes are visible both in theawx-task
master container and on the runner container, but they make no progress. After exactly 10 minutes in that state,awx/playbooks/check_isolated.yml
killsrsync
and terminates in failure, AWX continues with no error (because failure is the expected outcome), and soon finds a.~tmp~
directory (left over by the deadrsync
) in a place where it shouldn't be, causing a misleading Python backtrace to appear in the AWX UI.ADDITIONAL INFORMATION AND INVESTIGATION
rsync
to die, the logs are frozen in the Web UI andJobEvent.objects.count()
makes no progress from the vantage point ofawx-manage shell_plus
rsync
from within the ephemeral runner container, resultsin the same
ISADirectory
backtrace, except it does so immediately (as opposed toafter 10 minutes).
Running
rsync
manually by copying and pasting the deadlocked command line, demonstrates that the issue does not lie in Ansible / AWX code. e.g.(where the
find -delete
is there to increase rsync workload, and thus the probability for the bug to be triggered).In case of deadlock, a log line like this one will appear in the middle of the rsync spew:
Investigation shows that
v2.go:147
is a goroutine that pumps stdout around, and is implemented in terms ofio.Copy
(where0cbc58b
is the precise commit reported byoc version
in theansible/awx_task:10.0
image that I am using)After that log message is shown, part of
oc rsh
's byte-pumping apparatus becomes inoperative, and thereforersync
stops making progress.Second-hand information only below this point:
rsync
on top of another piece of golang pipework (here,docker exec
)rsync
to use blocking I/O solved the problem.The text was updated successfully, but these errors were encountered: