Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover rsyslog from 4xx error #14719

Merged

Conversation

TheRealHaoLiu
Copy link
Member

@TheRealHaoLiu TheRealHaoLiu commented Dec 11, 2023

SUMMARY

Fixes #7560

Related to rsyslog/rsyslog#4348

'omhttp' module for rsyslog will completely stop forwarding message to external log aggregator after receiving a 4xx error from the external log aggregator

This PR is an "workaround" for this problem by restarting rsyslogd after detecting that rsyslog received a 4xx error

NOTE: this workaround will cause message lost! It's best to resolve the root cause for the 4xx

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • Other
AWX VERSION
awx: 23.5.2.dev11+g85e9e02a41
ADDITIONAL INFORMATION

events=PROCESS_LOG_STDERR
priority=0
autorestart=true
stdout_events_enabled = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this flag needed to be on the service you want to monitor. I expected it under [program:awx-rsyslogd]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like

stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0

and

stdout_events_enabled = true
stderr_events_enabled = true

is equivalent
and rsyslogd section already have

stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying it now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An event type emitted when a process writes to stdout or stderr. The event will only be emitted if the file descriptor is not in capture mode and if stdout_events_enabled or stderr_events_enabled config options are set to true.

nvm

if headers["eventname"] == "PROCESS_STATE_FATAL":
headers.update(
dict(
[x.split(":") for x in sys.stdin.read(int(headers["len"])).split()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any historians that come back to this PR comment.

We aren't sure that this ever worked. Or, if it did, it was a time when the header and data were not read in the general while loop above.

This sys.stdin.read() would consume outside of the current message boundaries. We think maybe reading another message from supervisor. The logic that follow only "replies" once. This could/would lead to the supervisor buffer backing up.

write_stderr(
f"{datetime.datetime.now(timezone.utc)} - sending SIGTERM to proc={headers} with data={headers}\n"
)
os.kill(headers["pid"], signal.SIGTERM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PROCESS_STATE_FATAL event is supervisor telling us that it has given up starting our process and will not try to start it again. There will be no pid in the header.

@@ -133,3 +165,7 @@ command = supervisor_stdout
buffer_size = 100
events = PROCESS_LOG
result_handler = supervisor_stdout:event_handler
stdout_logfile=/dev/stdout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[eventlistener:stdout] What does this evenlistener do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comes from https://github.com/coderanger/supervisor-stdout
seems to just dump supervisor logs to stdout so we can see it on container log

events=PROCESS_LOG_STDERR
priority=0
autorestart=true
stderr_logfile=/dev/stderr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this eventlistener doesn't specify a stdout_logfile. Is that so that restarting rsyslog will be seen in the docker container output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdout contain only that

READY
RESULT 2\nOK

message I don't want it to spam a file

and stderr will go into the container output (which contain message about restarting service)

@TheRealHaoLiu TheRealHaoLiu force-pushed the hack-recover-rsyslog-from-4xx branch 3 times, most recently from bed006d to 7656bb2 Compare December 12, 2023 17:13
@@ -8,13 +8,14 @@ pidfile = /var/run/supervisor/supervisor.rsyslog.pid
[program:awx-rsyslogd]
command = rsyslogd -n -i /var/run/awx-rsyslog/rsyslog.pid -f /var/lib/awx/rsyslog/rsyslog.conf
autorestart = true
startsecs = 30
startsecs = 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prevent PROCESS_STATE_FATAL

@TheRealHaoLiu TheRealHaoLiu force-pushed the hack-recover-rsyslog-from-4xx branch 3 times, most recently from 9764f38 to 8f95784 Compare December 13, 2023 15:24
Due to ansible#7560

'omhttp' module for rsyslog will completely stop forwarding message to external log aggregator after receiving a 4xx error from the external log aggregator

This PR is an "workaround" for this problem by restarting rsyslogd after detecting that rsyslog received a 4xx error
Not every log messages need to be emitted as a event!
@TheRealHaoLiu TheRealHaoLiu enabled auto-merge (rebase) December 14, 2023 15:23
@TheRealHaoLiu TheRealHaoLiu merged commit 6440e3c into ansible:devel Dec 14, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[External Logger] Rsyslogd unexpectedly stop sending events to Splunk http collector
3 participants