AWX stops gathering job output if kubernetes starts a new log #11338

dbanttari · 2021-11-11T20:27:23Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

If a very large job is run, sometimes Kubernetes will start a new log, but AWX's reporting of job progess stalls. Eventually the job ends as an Error, even though the job itself continues to completion in the pod

AWX version

19.4.0

Installation method

kubernetes

Modifications

yes

Ansible version

2.11.3.post0

Operating system

Azure Kubernetes Service 1.21.2

Web browser

Chrome

Steps to reproduce

We can tail the log of the pod in real time using kubectl

->k logs -f -n awx automation-job-29506-2qlhn
...
...
...
{"uuid": "021508bc-6424-4b63-a394-a42f0aaeab36", "counter": 42648, "stdout": "", "start_line": 23274, "end_line": 23274, "runner_ident": "29506", "event": "runner_on_start", "job_id": 29506, "pid": 19, "created": "2021-11-11T19:51:19.974410", "parent_uuid": "261db21c-7d3e-ca0c-65e3-000000042eae", "event_data": {"playbook": "site.yml", "playbook_uuid": "3cc1df7c-23bc-4641-8262-1eaf9a73ea2f", "play": "tag_role_edapp:tag_role_eda:tag_role_edweb:tag_role_edw:tag_role_rhb*:tag_role_rwb1*:tag_role_vhb*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa:tag_role_wvd1", "play_uuid": "261db21c-7d3e-ca0c-65e3-00000000031f", "play_pattern": "tag_role_edapp:tag_role_eda:tag_role_edweb:tag_role_edw:tag_role_rhb*:tag_role_rwb1*:tag_role_vhb*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa:tag_role_wvd1", "task": "Create component in udeploy", "task_uuid": "261db21c-7d3e-ca0c-65e3-000000042eae", "task_action": "uri", "task_args": "", "task_path": "/runner/project/roles/udeploy-agent/tasks/add_component.yml:26", "role": "udeploy-agent", "host": "**********", "uuid": "021508bc-6424-4b63-a394-a42f0aaeab36"}}

*** this window stalls until the job is complete and the pod exits ***

*** opened a new window and it resumes: ***

->k logs -f -n awx automation-job-29506-2qlhn
{"uuid": "7193d6f6-1f7e-48c6-8f4d-b764817d31cb", "counter": 56862, "stdout": "\u001b[0;32mok: [dvuc-rhb1b-vm0]\u001b[0m", "start_line": 32869, "end_line": 32870, "runner_ident": "29506", "event": "runner_on_ok", "job_id": 29506, "pid": 19, "created": "2021-11-11T20:03:21.402116", "parent_uuid": "261db21c-7d3e-ca0c-65e3-0000000505be", "event_data": {"playbook": "site.yml", "playbook_uuid": "3cc1df7c-23bc-4641-8262-1eaf9a73ea2f", "play": "tag_role_rhb*:tag_role_rwb1*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa", "play_uuid": "261db21c-7d3e-ca0c-65e3-0000000003c3", "play_pattern": "tag_role_rhb*:tag_role_rwb1*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa", "task": "Make renamed executable for easier troubleshooting in multi-instance installations", "task_uuid": "261db21c-7d3e-ca0c-65e3-0000000505be", "task_action": "ansible.windows.win_copy", "task_args": "", "task_path": "/runner/project/roles/tomcat/tasks/install-tomcat.yml:20", "role": "tomcat", "host": "***********", "remote_addr": "**********", "res": {"src": "**********", "dest": "**********", "checksum": null, "size": 131072, "original_basename": "tomcat9.exe", "changed": false, "operation": "file_copy", "_ansible_no_log": false}, "start": "2021-11-11T20:03:20.256071", "end": "2021-11-11T20:03:21.401776", "duration": 1.145705, "event_loop": null, "uuid": "7193d6f6-1f7e-48c6-8f4d-b764817d31cb"}}
...
...
...
{"status": "failed", "runner_ident": "29506"}
{"zipfile": 85344}
*** zipfile contents here ***
{"eof": true}

Note that this example is discontiguous, which implies that this may happen multiple times during a pod's life

Expected results

Expected AWX to continue reporting the progress of the job, and noting its proper completion status (in this case, "Failed")

Actual results

The last item presented by AWX was the last entry in the log before it started anew:

Additional information

Ansible pod image is based on awx-ee:0.6.0 but adds things like helm, openssl, azure cli, zabbix-api, and other libraries that are used by various tasks

Installation was done via AWX Operator 0.14.0

The text was updated successfully, but these errors were encountered:

dbanttari · 2021-11-16T21:04:30Z

This seems similar/related? kubernetes/kubernetes#28369

shanemcd · 2021-11-18T09:54:54Z

We recently saw this in an AKS deployment. I've verified a workaround by setting kubelet --container-log-max-size to a higher number, or using containerLogMaxSize in a kubelet config file.

chrismeyersfsu · 2021-11-19T18:59:08Z

#10366 (comment) @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the max-pods being set to 30. Can you try it and report back?

if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi

Tioborto · 2021-12-10T15:54:58Z

Hello,
We are in the same case with our AWX instance (same version).
We are currently in a GKE cluster so we don't have control over the kubelet configuration.
Do you plan to fix this soon?

shanemcd · 2021-12-14T08:51:24Z

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

Tioborto · 2021-12-20T08:44:44Z

Hello @shanemcd ,
This is a real problem for us. We have users with jobs in "ERROR" status while we have full logs in our external tool.
Why is this not a top priority? This is a regression from previous versions

bschaeffer · 2022-01-07T19:38:58Z

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

Running this in kubernetes 1.20+ should be called as strictly unsupported and broken I feel like

bschaeffer · 2022-01-18T14:20:54Z

From #11511:

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

MalfuncEddie · 2022-01-21T08:40:06Z

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

I agree this is a contanerd/k8s issue. But it also best practice to "limit' logs from filling up filesystems.

But relying on the logging systems of k8s does not seem reliable.

Are te following options valid?
Move to openshift?
Is it possible to move the execution nodes to podman (and does podman have this issue?)

oliverf1 · 2022-02-01T05:18:14Z

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

If this kubernetes/kubernetes#59902 was implemented, would not that help?

I can also confirm is it an issue for us too. For some problematic playbooks with lots of output and multiple hosts, the only solution we found, was to create a playbook that will create a pod which runs the playbook directly on the k8s cluster, so outside of AWX :-( In our case, increasing the log size is not an option, because we have other workloads on the cluster and increasing the log file size will drastically increase the disk space.

Thanks

chris93111 · 2022-02-24T12:27:14Z

Hi same problem in k3s, this is a real problem for us also

Rhel 8.4
Version: 0.1.0
RuntimeName: containerd
RuntimeVersion: v1.5.8-k3s1
RuntimeApiVersion: v1alpha2
k3s 1.22

Fixed with kubelet --container-log-max-size

jogleasonjr · 2022-02-24T15:29:53Z

Using AKS and Terraform we were able to get enough headroom by specifying:

resource "azurerm_kubernetes_cluster_node_pool" "pool" { 
  ...

  kubelet_config = {
    container_log_max_line    = 5
    container_log_max_size_mb = 500
  }
  ...
}

The default, I believe, was 10MB.

def1nity · 2022-02-25T16:40:38Z

#10366 (comment) @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the max-pods being set to 30. Can you try it and report back?

if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi

Hey, I tried your suggestion but included--container-log-max-files=3 before I read that 5 is the default value... And even though the daemonset pods failed to start with Crashloopback and I deleted them a couple of minutes later somehow the kubelet config file now looks like this:

KUBELET_FLAGS=--address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/
azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.0
--cluster-domain=cluster.local --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-g
ates=RotateKubeletServerCertificate=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false -
-kube-reserved=cpu=100m,memory=1843Mi --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=35 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-ma
x-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container
-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --co
ntainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files
=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-ma
x-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container
-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --co
ntainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500
Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-s
ize=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-lo
g-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --conta
iner-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-f
iles=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-lo
g-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --conta
iner-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size
=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-m
ax-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --containe
r-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --c
ontainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-file
s=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-m
ax-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --containe
r-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --c
ontainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=50
0Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-
size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-l
og-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --cont
ainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-
files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-l
og-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --cont
ainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --network-plugin=cni --
node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.5 --pod-manifest-path=/etc/kubernete
s/manifests --pod-max-pids=-1 --protect-kernel-defaults=true --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --rotate-certificates=false --streaming-con
nection-idle-timeout=4h --tls-cert-file=/etc/kubernetes/certs/kubeletserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM
_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_
WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/etc/kubernetes/certs/kubeletserver.key
KUBELET_REGISTER_SCHEDULABLE=true
NETWORK_POLICY=azure

KUBELET_NODE_LABELS=kubernetes.azure.com/role=agent,agentpool=nodepoold,kubernetes.azure.com/agentpool=nodepoold,storageprofile=managed,storagetier=Premium_LRS,kube
rnetes.azure.com/storageprofile=managed,kubernetes.azure.com/storagetier=Premium_LRS,kubernetes.azure.com/os-sku=Ubuntu,kubernetes.azure.com/cluster=MC_rg_cluster_westeurope
,kubernetes.azure.com/mode=system,kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2containerd-2021.09.28,nodepool=defaultnodepoolnew

root@aks-nodepoolD:/etc/default#

Funny enough everything works exactly as before and as far as I can tell, nothing was removed, only the two parameters got added a gazzilion times... Anyway I would like to fix this mess and restore the default kubelet config file if possible.

Could anyone suggest how to fix this please?

ThomasADavis · 2022-03-01T20:19:22Z

Lucky for us, I've also setup ARA. so when the log file rotates, ARA still has the complete job data..

Tioborto · 2022-03-22T14:36:54Z

Any update about the resolution of this problem @shanemcd ? please

AlanCoding · 2022-04-08T14:55:44Z

Also see some good quality research in the ansible-runner issue ansible/ansible-runner#998

This is particularly a problem for inventory updates that produce large inventories, because the archive of the artifacts necessarily has to be large, and the way that is sent requires precise byte alignment or else it won't work.

Trying to think of where / how a patch might be developed... a change to awx-operator might just increase the size that it can handle but not fundamentally address the issue. If I could reproduce it, maybe I could get the syntax of what the controller node reads, and figure out how to identify these breaks in the ansible-runner code.

cerberek · 2022-06-15T14:08:41Z

For me the same behavior happens when automation-job pod finished and is destroyed.
In pod logs i could see that all happened correctly but AWX jobs have Error with trimmed log output.

kdelee · 2022-07-27T18:49:06Z

@AlanCoding to force a reproducer, you could try setting kubelet --container-log-max-size to a very small number

ethemcemozkan · 2022-09-22T06:05:52Z

Any update about the resolution?

andreasbourges · 2022-09-27T06:14:07Z

Hi same problem in k3s, this is a real problem for us also

Rhel 8.4 Version: 0.1.0 RuntimeName: containerd RuntimeVersion: v1.5.8-k3s1 RuntimeApiVersion: v1alpha2 k3s 1.22

Fixed with kubelet --container-log-max-size

Hi - trying to apply the mentioned workaround for k3s, too, but currently I'm unsure on where and how to apply the container-log-max-size configuration. Could you please give me a hint, where and how you deployed that?

Thanks,

Andreas

kurokobo · 2022-09-27T06:22:52Z

@andreasbourges
Modify /etc/systemd/system/k3s.service and restart K3s,

$ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \     👈👈👈
        'container-log-max-files=4' \     👈👈👈
        '--kubelet-arg' \     👈👈👈
        'container-log-max-size=50Mi' \     👈👈👈

$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s

Or re-install K3s using script with specific arguments,

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"

andreasbourges · 2022-09-27T06:27:44Z

@andreasbourges Modify /etc/systemd/system/k3s.service and restart K3s,

$ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \     👈👈👈
        'container-log-max-files=4' \     👈👈👈
        '--kubelet-arg' \     👈👈👈
        'container-log-max-size=50Mi' \     👈👈👈

$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s

Or re-install K3s using script with specific arguments,

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"

Thanks a lot - this saves me a lot of time!

Highly appreciated!

anasaizg · 2022-09-27T08:42:15Z

Hi! I have been reading all your valuable information and tests for the last days, because we are facing the same issue. I just wanted to let you know that, in our case, changing the log configuration didn't solve it, so I'm afraid we have come to the wall of "4h maximum connection time" that is commented in this other issue:

#11805 (comment)

andreasbourges · 2022-09-27T08:52:30Z

Hi, thanks for sharing your experience - I adjusted the amount of data processed in the playbook (way too much information was gathered via an http request to Autobot) *and* I adjusted the container-log size. What I can tell is, that opening the output log from the aborted tasks, still triggers the uwsgi consuming GBs of memory. We're hoping that adjusting memory consumption will prevent the tasks from failing and thus avoid the uwsgi problem. Well, let's see. Thanks, Andreas Edit by Gundalow to tidy up formatting

bschaeffer · 2022-10-12T14:19:55Z

Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way.

andreasbourges · 2022-10-12T15:47:08Z

Hi Braden, .I totally agree - this is a non-deterministic workaround, but in combination with the reduction of the generated logs we're running fine so far - not a single error since the change (before we had a chance of 50:50). And in the absence of any alternatives, we are happy to have this one. Do you know of any other ways to solve this? Thanks, Andreas Von: Braden Schaeffer ***@***.***> Gesendet: Mittwoch, 12. Oktober 2022 16:20 An: ansible/awx ***@***.***> Cc: andreasbourges ***@***.***>; Mention ***@***.***> Betreff: Re: [ansible/awx] AWX stops gathering job output if kubernetes starts a new log (Issue #11338) Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way. - Reply to this email directly, view it on GitHub <#11338 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIB4YVQ5FYXYS7JVDRJY5RDWC 3CJRANCNFSM5H3JFP5A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

TheRealHaoLiu · 2022-12-14T16:33:03Z

fixed in ansible/receptor#683

domq · 2023-07-05T10:25:12Z

fixed in ansible/receptor#683

Not for me — I seem to still get only the first chunk of logs. Running AWX receptor version 1.4.1 against OpenShift 3.11's (I know, I know) Kubernetes 1.11.0+d4cacc0.

It is worth noting that the /tmp/receptor/awx-0/*/stdout files do not end with (or contain) an { "eof": true } marker in this case, indicating that Receptor at least understands that it has work remaining.

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system.

Indeed. I seem to remember that at some point, AWX used to rely on RabbitMQ for this task?

... You see, umm, we haven't applied `-t awx` for a long time, because [that would break logging in AWX](ansible/awx#11338 (comment)) 🤦

Klaas- · 2024-01-09T13:44:47Z

this issue seems finally fixed for me with kubernetes 1.29

As seen in ansible/awx#11338 and ansible/receptor#446 - Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683 - Pump up timeouts thereof

shanemcd mentioned this issue Nov 19, 2021

AWX RuntimeError resulting in loss of worker task logs and true outcome #9961

Closed

shanemcd mentioned this issue Jan 7, 2022

automation-job completes successfully but the ops-awx-task and UI show an error and stop producing output #11511

Closed

3 tasks

This was referenced Jan 31, 2022

AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366

Closed

Long job marked as error because logs stop to be retrieved while job continues and complete successfully #11451

Closed

chris93111 mentioned this issue Feb 24, 2022

job output no complete and appear on ERROR in UI Exceeded retries for reading stdout #11803

Closed

6 tasks

chris93111 mentioned this issue Mar 4, 2022

AWX and Execution Environment not synchronized #10211

Closed

mabashian added the component:api label Apr 7, 2022

AlanCoding mentioned this issue Apr 8, 2022

Runner seems to have difficulties to process large outputs ansible/ansible-runner#998

Closed

kladiv mentioned this issue Apr 28, 2022

Playbooks terminated unexpectedly after 4 hours #11805

Closed

6 tasks

kurokobo mentioned this issue Jul 30, 2022

Jobs with long running tasks and Inventory sync terminating after 5 minutes #12530

Closed

9 tasks

klauserber mentioned this issue Aug 17, 2022

Automation jobs are canceled unintentionally #12677

Closed

9 tasks

wtmthethird mentioned this issue Aug 31, 2022

502 Bad Gateway Errors in AWX GUI browsing Jobs, Templates #12644

Closed

9 tasks

shanemcd mentioned this issue Oct 13, 2022

Pod logs stop being pulled when container log files are rotated ansible/receptor#446

Open

shanemcd mentioned this issue Oct 26, 2022

Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream ansible/receptor#683

Merged

TheRealHaoLiu closed this as completed Dec 14, 2022

filipprosovsky mentioned this issue Dec 21, 2022

EC2 Inventory Errors, but no error shown #13357

Closed

9 tasks

fosterseth mentioned this issue Jul 26, 2023

Task was marked as running but was not present in the job queue, so it has been marked as failed. #14277

Open

11 tasks

domq pushed a commit to epfl-si/wp-ops that referenced this issue Dec 12, 2023

[feature] -t awx.k8s.secrets

fb27364

... You see, umm, we haven't applied `-t awx` for a long time, because [that would break logging in AWX](ansible/awx#11338 (comment)) 🤦

domq mentioned this issue Aug 26, 2024

[upgrade] AWX to 22.3 epfl-si/wp-ops#558

Merged

domq pushed a commit to epfl-si/wp-ops that referenced this issue Sep 27, 2024

[workaround] AWX known log truncation bug

fd0df34

As seen in ansible/awx#11338 and ansible/receptor#446 - Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683 - Pump up timeouts thereof

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWX stops gathering job output if kubernetes starts a new log #11338

AWX stops gathering job output if kubernetes starts a new log #11338

dbanttari commented Nov 11, 2021 •

edited

Loading

dbanttari commented Nov 16, 2021

shanemcd commented Nov 18, 2021

chrismeyersfsu commented Nov 19, 2021

Tioborto commented Dec 10, 2021

shanemcd commented Dec 14, 2021

Tioborto commented Dec 20, 2021

bschaeffer commented Jan 7, 2022

bschaeffer commented Jan 18, 2022

MalfuncEddie commented Jan 21, 2022

oliverf1 commented Feb 1, 2022

chris93111 commented Feb 24, 2022 •

edited

Loading

jogleasonjr commented Feb 24, 2022

def1nity commented Feb 25, 2022

ThomasADavis commented Mar 1, 2022

Tioborto commented Mar 22, 2022

AlanCoding commented Apr 8, 2022

cerberek commented Jun 15, 2022

kdelee commented Jul 27, 2022

ethemcemozkan commented Sep 22, 2022

andreasbourges commented Sep 27, 2022

kurokobo commented Sep 27, 2022

andreasbourges commented Sep 27, 2022

anasaizg commented Sep 27, 2022

andreasbourges commented Sep 27, 2022 via email •

edited by gundalow

Loading

bschaeffer commented Oct 12, 2022

andreasbourges commented Oct 12, 2022 via email

TheRealHaoLiu commented Dec 14, 2022

domq commented Jul 5, 2023 •

edited

Loading

Klaas- commented Jan 9, 2024

AWX stops gathering job output if kubernetes starts a new log #11338

AWX stops gathering job output if kubernetes starts a new log #11338

Comments

dbanttari commented Nov 11, 2021 • edited Loading

Please confirm the following

Summary

AWX version

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

dbanttari commented Nov 16, 2021

shanemcd commented Nov 18, 2021

chrismeyersfsu commented Nov 19, 2021

Tioborto commented Dec 10, 2021

shanemcd commented Dec 14, 2021

Tioborto commented Dec 20, 2021

bschaeffer commented Jan 7, 2022

bschaeffer commented Jan 18, 2022

MalfuncEddie commented Jan 21, 2022

oliverf1 commented Feb 1, 2022

chris93111 commented Feb 24, 2022 • edited Loading

jogleasonjr commented Feb 24, 2022

def1nity commented Feb 25, 2022

ThomasADavis commented Mar 1, 2022

Tioborto commented Mar 22, 2022

AlanCoding commented Apr 8, 2022

cerberek commented Jun 15, 2022

kdelee commented Jul 27, 2022

ethemcemozkan commented Sep 22, 2022

andreasbourges commented Sep 27, 2022

kurokobo commented Sep 27, 2022

andreasbourges commented Sep 27, 2022

anasaizg commented Sep 27, 2022

andreasbourges commented Sep 27, 2022 via email • edited by gundalow Loading

bschaeffer commented Oct 12, 2022

andreasbourges commented Oct 12, 2022 via email

TheRealHaoLiu commented Dec 14, 2022

domq commented Jul 5, 2023 • edited Loading

Klaas- commented Jan 9, 2024

dbanttari commented Nov 11, 2021 •

edited

Loading

chris93111 commented Feb 24, 2022 •

edited

Loading

andreasbourges commented Sep 27, 2022 via email •

edited by gundalow

Loading

domq commented Jul 5, 2023 •

edited

Loading