-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auditbeat] Recover from errors in audit monitoring routine #22673
Conversation
The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests. For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+) kernel threads being created. ``` ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285) ``` ``` $ ps -ef [...] root 27790 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27791 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27792 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27793 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27794 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27795 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27796 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27797 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27798 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27799 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27800 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27801 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27802 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27803 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27804 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27805 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27806 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27807 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27808 2 0 12:52 ? 00:00:00 [audit_send_repl] [...] ``` This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of `audit_send_repl` kernel threads.
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
} | ||
client, err = libaudit.NewAuditClient(nil) | ||
if err != nil { | ||
ms.log.Errorw("Failure creating audit monitoring client", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this one fails then I think client == nil
which will cause problems on subsequent iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh! Good catch
…22673) The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests. For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+) kernel threads being created. ``` ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285) ``` ``` $ ps -ef [...] root 27790 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27791 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27792 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27793 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27794 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27795 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27796 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27797 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27798 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27799 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27800 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27801 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27802 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27803 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27804 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27805 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27806 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27807 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27808 2 0 12:52 ? 00:00:00 [audit_send_repl] [...] ``` This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of `audit_send_repl` kernel threads. (cherry picked from commit ca9550f)
…22673) The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests. For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+) kernel threads being created. ``` ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285) ``` ``` $ ps -ef [...] root 27790 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27791 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27792 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27793 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27794 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27795 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27796 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27797 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27798 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27799 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27800 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27801 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27802 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27803 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27804 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27805 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27806 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27807 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27808 2 0 12:52 ? 00:00:00 [audit_send_repl] [...] ``` This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of `audit_send_repl` kernel threads. (cherry picked from commit ca9550f)
…22725) The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests. For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+) kernel threads being created. ``` ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285) ``` ``` $ ps -ef [...] root 27790 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27791 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27792 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27793 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27794 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27795 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27796 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27797 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27798 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27799 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27800 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27801 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27802 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27803 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27804 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27805 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27806 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27807 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27808 2 0 12:52 ? 00:00:00 [audit_send_repl] [...] ``` This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of `audit_send_repl` kernel threads. (cherry picked from commit ca9550f)
…22724) The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests. For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+) kernel threads being created. ``` ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285) ``` ``` $ ps -ef [...] root 27790 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27791 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27792 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27793 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27794 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27795 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27796 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27797 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27798 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27799 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27800 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27801 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27802 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27803 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27804 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27805 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27806 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27807 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27808 2 0 12:52 ? 00:00:00 [audit_send_repl] [...] ``` This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of `audit_send_repl` kernel threads. (cherry picked from commit ca9550f)
The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests.
For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of
[audit_send_repl]
(2.6.x) /[audit_send_reply]
(3.x+) kernel threads being created. (Reproduced in 2.6.32, no other versions tested).The following error will appear every 15s:
ps -ef
will show a lot ofaudit_send_repl
threads:This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of
audit_send_repl
kernel threads.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
It's easy to reproduce this issue by modifying the code at
beats/auditbeat/module/auditd/audit_linux.go
Lines 159 to 182 in bb973c4
client.GetStatusAsync(false)
outside of the polling loop.Similar can be used to validate this fix. Ideally sending an async getstatus every few iterations of the loop.