-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix race condition on audit logging CCL hook #105227
sql: fix race condition on audit logging CCL hook #105227
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @THardy98)
-- commits
line 4 at r1:
Don't forget to make the PR description reflect the release note.
-- commits
line 8 at r1:
Something is off with your URLs and issue/epic references here. They seem totally unrelated.
-- commits
line 11 at r1:
Something very wrong happened with these release notes.
pkg/server/server_sql.go
line 1385 at r1 (raw file):
initialized, _ := auditlogging.ReadEnterpriseParamsHook() if !initialized { auditlogging.WriteEnterpriseParamsHook(ctx, base.CheckEnterpriseEnabled(execCfg.Settings, cfg.ClusterIDContainer.Get(), "role-based auditing"))
This way of capturing the settings and the cluster ID container is very "not OK".
The only dependency you can preserve in a global variable is a reference to the function base.CheckEnterpriseEnabled
. Everything else needs to be passed as argument to validateAuditLogConfig
.
b42dfaa
to
a809c2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
Previously, knz (Raphael 'kena' Poss) wrote…
Don't forget to make the PR description reflect the release note.
Done.
Previously, knz (Raphael 'kena' Poss) wrote…
Something is off with your URLs and issue/epic references here. They seem totally unrelated.
Fixed.
Previously, knz (Raphael 'kena' Poss) wrote…
Something very wrong happened with these release notes.
Fixed.
pkg/server/server_sql.go
line 1385 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
This way of capturing the settings and the cluster ID container is very "not OK".
The only dependency you can preserve in a global variable is a reference to the function
base.CheckEnterpriseEnabled
. Everything else needs to be passed as argument tovalidateAuditLogConfig
.
It's difficult to pass arguments to validateAuditLogConfig
as the cluster setting validation function has an expected signature of: (st *settings.Values, input string) error
, and doesn't have the necessary arguments to call CheckEnterpriseEnabled
. I'm not sure how to pass server state into a cluster setting's validation function aside from a hook like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @THardy98)
pkg/server/server_sql.go
line 1385 at r1 (raw file):
Previously, THardy98 (Thomas Hardy) wrote…
It's difficult to pass arguments to
validateAuditLogConfig
as the cluster setting validation function has an expected signature of:(st *settings.Values, input string) error
, and doesn't have the necessary arguments to callCheckEnterpriseEnabled
. I'm not sure how to pass server state into a cluster setting's validation function aside from a hook like this.
This should teach you that validateAuditLogConfig
is not the right place to do an enterprise check.
Generally, we do not do license checks on cluster settings. This is undesirable anyway; what would it mean if the value is set properly, and then the enterprise license expires? We want the feature to be disabled in that case.
The proper approach wrt license check is to do it in the SQL session. Either once when the session is opened or when we actually need to do audit logging, when/if the log config is non-empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/server/server_sql.go
line 1385 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
This should teach you that
validateAuditLogConfig
is not the right place to do an enterprise check.Generally, we do not do license checks on cluster settings. This is undesirable anyway; what would it mean if the value is set properly, and then the enterprise license expires? We want the feature to be disabled in that case.
The proper approach wrt license check is to do it in the SQL session. Either once when the session is opened or when we actually need to do audit logging, when/if the log config is non-empty.
I had thought about doing the enterprise check elsewhere, but doesn't it seem odd to allow users to change the cluster setting value even if they can't use its underlying feature?
We have precedent: using kerberos authentication in the HBA config. |
a809c2f
to
da1afd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is better. LGTM with nit.
Reviewed 7 of 7 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @THardy98)
pkg/sql/audit_logging.go
line 122 at r4 (raw file):
// Do not emit audit events for reserved users/roles. This does not omit the root user. // Do not emit audit events for internal planners. return auditlogging.UserAuditEnabled(p.execCfg.Settings, p.EvalContext().ClusterID) || p.User().IsReserved() || p.isInternalPlanner
nit: consider checking whether the audit log config is empty first and take a shortcut. The enterprise license check is comparatively more expensive.
da1afd6
to
d6e0211
Compare
d6e0211
to
329c7c1
Compare
I'm having a lot of trouble trying to determine the cause of #105174. It seems difficult to reproduce, running It often produces what looks like the test hanging: even when running for over 30mins. I've tried running the above on commits prior to the introduction of any audit logging logic (I'm not certain if the CCL commit tagged in the issue is the cause of the deadlock, I wanted to As such, this PR will only address the race condition bug. I'll mention this on the deadlock issue as well. |
Fixes cockroachdb#104660. This change fixes a race condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous write to an unprotected global variable used to configure a CCL audit logging feature. This change removes the global variable and instead invokes the hook to check enterprise status at log time. Release note (bug fix): This change fixes a race condition that can occur when multiple SQLServers are created simulatenously, causing simultaneous writes to an unprotected global variable used to configure a CCL audit logging feature.
329c7c1
to
a5bbd98
Compare
It's mildly possible that the race condition (the issue you're fixing here) is a different issue from the deadlock. So I would be OK merging this (and closing the race condition issue) while keeping the deadlock issue investigation separate. |
Coming across another data race: Stack trace:
This one is less clear and I don't have a clear idea of how to tackle this. Would love to pair on this with someone who has a hunch of what's going on and what's causing this. |
This data race is not related to your work. O believe we have an issue for it already, and it's for the kv team. Try looking for that stack trace in the issue tracker. |
Ah, here it is: #104837 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)
pkg/sql/audit_logging.go
line 122 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: consider checking whether the audit log config is empty first and take a shortcut. The enterprise license check is comparatively more expensive.
Doesn't it do that shortcut in the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @j82w)
pkg/sql/audit_logging.go
line 122 at r4 (raw file):
Previously, j82w (Jake) wrote…
Doesn't it do that shortcut in the method?
The first time around I thought the call to shouldNotRoleBasedAudit
came first. If that's not the case, please ignore my comment.
It did :) I made the change. |
@knz are there still blocking changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Wanted to post a benchmark with these changes before merging.
Negligible loss (~2%) compared to results in #101658. Could be accounted for by different runs not using the same data initialization. This is only experienced by CCL users. |
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from a5bbd98 to blathers/backport-release-23.1-105227: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/sql/audit_logging.go
line 122 at r4 (raw file):
Previously, THardy98 (Thomas Hardy) wrote…
It did :) I made the change.
Just wanted to callout that p.isInternalPlanner
- confusingly - is likely to not be set to true
when the query is run via the InternalExecutor, so there may a bug in here. It is similar to a9b83b5 in nature, and I'll look into fixing this isInternalPlanner
on master (but not might be backportable).
Fixes: #104660, #104838
This change fixes a race condition that can occur when multiple
SQLServers are created simultaneously, causing simultaneous writes to an
unprotected global variable used to configure a CCL audit logging
feature.
This change removes the global variable and instead invokes the hook to
check enterprise status at log time.
Release note (bug fix): This change fixes a race condition that
can occur when multiple SQLServers are created simultaneously, causing
simultaneous writes to an unprotected global variable used to configure
a CCL audit logging feature.