EAGAIN hazard in isolated runs using rsync + oc rsh #6692

domq · 2020-04-14T09:03:32Z

ISSUE TYPE

Bug Report

SUMMARY

Because rsync and oc rsh disagree on whether socket IO should be blocking or nonblocking by default, and awx/playbooks/check_isolated.yml uses a combination of both to pump the JSON job events out of runner containers running on Kubernetes, there exists a hazard of EAGAIN being treated as a crippling error by oc rsh (killing a goroutine, but leaving the process alive) and causing rsync to deadlock.

ENVIRONMENT

AWX version: X.Y.Z
AWX install method: openshift, minishift, docker on linux, docker for mac, boot2docker
Ansible version: X.Y.Z
Operating System:
Web Browser:

STEPS TO REPRODUCE

Deploy AWX on Kubernetes
Create a container group to run on Kubernetes and fit it with the appropriate credentials
Create a job template that uses the container group in a way that will create a lot of job events per time unit, i.e. slicing = 1, high inventory count, lots of short tasks in the playbook, verbosity set to 3
Execute the job template

EXPECTED RESULTS

The job should succeed or fail depending solely on the underlying Ansible tasks' outcomes.

ACTUAL RESULTS

rsync deadlocks. That is, two rsync processes are visible both in the awx-task master container and on the runner container, but they make no progress. After exactly 10 minutes in that state, awx/playbooks/check_isolated.yml kills rsync and terminates in failure, AWX continues with no error (because failure is the expected outcome), and soon finds a .~tmp~ directory (left over by the dead rsync) in a place where it shouldn't be, causing a misleading Python backtrace to appear in the AWX UI.

ADDITIONAL INFORMATION AND INVESTIGATION

During the 10 minutes it takes for rsync to die, the logs are frozen in the Web UI and JobEvent.objects.count() makes no progress from the vantage point of awx-manage shell_plus
Killing rsync from within the ephemeral runner container, results
in the same ISADirectory backtrace, except it does so immediately (as opposed to
after 10 minutes).

Running rsync manually by copying and pasting the deadlocked command line, demonstrates that the issue does not lie in Ansible / AWX code. e.g.

find /tmp/awx_*/artifacts/ -type f -name "*.json" -delete
/usr/bin/rsync --delay-updates -F --compress --delete-after --archive --rsh="oc rsh --config=/tmp/awx_792_zk0tq7xm/.kubeconfig9kwpv7nq" --out-format="<<CHANGED>>%i %n%L" awx-job-792:/tmp/awx_792_zk0tq7xm/artifacts/ /tmp/awx_792_zk0tq7xm/artifacts/

(where the find -delete is there to increase rsync workload, and thus the probability for the bug to be triggered).

In case of deadlock, a log line like this one will appear in the middle of the rsync spew:

E0413 17:35:40.920139    6976 v2.go:147] write /dev/stdout: resource temporarily unavailable

Investigation shows that v2.go:147 is a goroutine that pumps stdout around, and is implemented in terms of io.Copy (where 0cbc58b is the precise commit reported by oc version in the ansible/awx_task:10.0 image that I am using)

After that log message is shown, part of oc rsh's byte-pumping apparatus becomes inoperative, and therefore rsync stops making progress.

Second-hand information only below this point:

this has possibly happened before while attempting to use rsync on top of another piece of golang pipework (here, docker exec)
In this previous instance, telling rsync to use blocking I/O solved the problem.

The text was updated successfully, but these errors were encountered:

domq · 2020-04-14T09:07:50Z

Here is some speculation that will need substantiating:

Because io.Copy does not have access to a shareable event loop abstraction that would let it wait and retry on EAGAIN, it likely has no choice but to assume EAGAIN never happens (i.e., blocking I/O)
The fact that rsync --blocking-io helped, might indicate that rsync conflates blocking I/O for its “own” use (i.e. on the files), versus the ones that it creates only to pass them to the --rsh= forked process (i.e. on the sockets)

domq · 2020-04-14T09:10:13Z

In fact it is even debatable whether this is is an rsync bug, an oc bug, or should somehow be worked around in AWX's own pipework.

domq · 2020-04-14T10:25:17Z

Here is why this issue doesn't show up with “ordinary” rsync over ssh.

ryanpetrello · 2020-04-14T13:22:33Z

Wow. First, thanks for spending so much time on this one - it's pretty gross 😂 and very complicated.

Having read all of this, this comment captures my first impression:

In fact it is even debatable whether this is is an rsync bug, an oc bug, or should somehow be worked around in AWX's own pipework.

Based on your description, this really sounds like a bug in oc itself. Have you considered reaching out to the OpenShift folks and seeing if they have any ideas?

Also, any chance you've tried and found better results by adding --blocking-io here?

https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L41
https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L54

shanemcd · 2020-04-14T13:45:11Z

Yeah, holy shit. Thank you for digging so deep into this. I see you found my synchronize + oc rsh hack. I knew that it was probably a matter of time before someone found a fatal flaw with it. 😄

If @ryanpetrello's suggestion doesn't work, we might want to move away from using run_isolated.yml and check_isolated.yml for container groups. Those (as the name implies) were for our Isolated Node feature (not an upstream concept), but I reused them because it seemed to work. If we're learning that it doesn't, it would be easy enough to add a conditional right before these playbooks are invoked and handle the container groups case differently.

domq · 2020-04-16T07:24:43Z

Also, any chance you've tried and found better results by adding --blocking-io here?

https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L41
https://github.com/ansible/awx/blob/devel/awx/playbooks/check_isolated.yml#L54

I've been distracted by other work lately, but I'll give this a try ASAP.

domq · 2020-04-16T17:42:29Z

Burn-in test is in progress; looks good so far (3 down, 3 to go).

Applying the `--blocking-io` suggestion from ansible/awx#6692 results in six consecutive successful runs in the conditions described inside the bug report: https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/817 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/820 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/823 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/826 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/828 https://awx-poc-vpsi.epfl.ch/#/jobs/playbook/829

domq · 2020-04-16T18:19:33Z

Also, any chance you've tried and found better results by adding --blocking-io here?

Indeed, --blocking-io helps. Pull request to follow.

ryanpetrello · 2020-04-16T19:35:26Z

👍 thanks a ton for testing this, @domq - if you want to let it marinate for a day or two, I'll merge your patch early next week after you're feeling confident cc @shanemcd

domq mentioned this issue Apr 14, 2020

Isolated runs choke on rsync's .~tmp~ directories #6675

Closed

awxbot added the type:bug label Apr 14, 2020

domq mentioned this issue Apr 16, 2020

[fix] Use rsync --blocking-io to work around EAGAIN hazard #6732

Merged

softwarefactory-project-zuul bot closed this as completed in 5a1599b Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EAGAIN hazard in isolated runs using rsync + oc rsh #6692

EAGAIN hazard in isolated runs using rsync + oc rsh #6692

domq commented Apr 14, 2020 •

edited

Loading

domq commented Apr 14, 2020 •

edited

Loading

domq commented Apr 14, 2020

domq commented Apr 14, 2020

ryanpetrello commented Apr 14, 2020 •

edited

Loading

shanemcd commented Apr 14, 2020

domq commented Apr 16, 2020

domq commented Apr 16, 2020

domq commented Apr 16, 2020

ryanpetrello commented Apr 16, 2020

EAGAIN hazard in isolated runs using rsync + oc rsh #6692

EAGAIN hazard in isolated runs using rsync + oc rsh #6692

Comments

domq commented Apr 14, 2020 • edited Loading

ISSUE TYPE

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION AND INVESTIGATION

domq commented Apr 14, 2020 • edited Loading

domq commented Apr 14, 2020

domq commented Apr 14, 2020

ryanpetrello commented Apr 14, 2020 • edited Loading

shanemcd commented Apr 14, 2020

domq commented Apr 16, 2020

domq commented Apr 16, 2020

domq commented Apr 16, 2020

ryanpetrello commented Apr 16, 2020

domq commented Apr 14, 2020 •

edited

Loading

domq commented Apr 14, 2020 •

edited

Loading

ryanpetrello commented Apr 14, 2020 •

edited

Loading