-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oidcccl: fix SSO login fails if IdP is down when CRDB starts #119013
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing the context cancellation issue seems difficult and not worthwhile, but we should be able to add a test in authentication_oidc_test.go that mocks NewOIDCManager to return an error on the first call then succeed on the second.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @souravcrl)
-- commits
line 2 at r1:
We start our commit messages with a directory/package name, so this should be oidccl:
instead of sso:
pkg/ccl/oidcccl/authentication_oidc.go
line 718 at r1 (raw file):
} if oidcAuthentication.manager == nil {
There's an initialized
boolean; we should be able to use that instead of checking the manager directly. I'd put this in the block that calls reloadConfigLocked above:
if `enabled && !initialized {
reloadConfig()
if !initialized {
http.Error("auth manager could not be initialized")
return
}
}
Or maybe modify reloadConfigLocked to return an error, but we don't want to leak our internal error to the (unauthenticated) user here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @souravcrl)
-- commits
line 7 at r1:
I would keep the commit message talking about the code and the scenarios under which the bug manifests more directly. The discussion of wifi on/off is confusing here.
Instead of "X failed due to networking being unavailable when I turned off wifi" prefer language like "When the network is unavailable, X fails". This makes it easier to understand the scenario you are trying to fix.
-- commits
line 13 at r1:
It sounds like the context fix is for problem #2 and the initialized/manager check is for problem #1, can you clarify this?
nit: obejct
-> object
We also wrap commit messages at 100 chars or fewer.
pkg/ccl/oidcccl/authentication_oidc.go
line 237 at r1 (raw file):
// We need to provide a context which cannot be cancelled because of a specific implementation which // prohibits context to be cancelled if we are to reuse the provider object // https://github.com/coreos/go-oidc/issues/339
It sounds like the latest version of the library has fixed this problem since that issue is resolved. We should consider upgrading perhaps.
Can you add a // TODO
comment here that upgrading can resolve this. I understand you might not want to do that in this particular PR.
c6150b5
to
2ef5585
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the test case TestOIDCManagerInitialisationNetworkUnavailable
to check oidcManager initialisation for both available and unavailable provider URL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @bdarnell, and @dhartunian)
Previously, bdarnell (Ben Darnell) wrote…
We start our commit messages with a directory/package name, so this should be
oidccl:
instead ofsso:
updated the commit title to reflect correct package name
Previously, dhartunian (David Hartunian) wrote…
I would keep the commit message talking about the code and the scenarios under which the bug manifests more directly. The discussion of wifi on/off is confusing here.
Instead of "X failed due to networking being unavailable when I turned off wifi" prefer language like "When the network is unavailable, X fails". This makes it easier to understand the scenario you are trying to fix.
Updated the commit message to remove unnecessary information and provide more concise information.
Previously, dhartunian (David Hartunian) wrote…
It sounds like the context fix is for problem #2 and the initialized/manager check is for problem #1, can you clarify this?
nit:
obejct
->object
We also wrap commit messages at 100 chars or fewer.
Remove the message and moved the context to the problem statement itself. problem #1 fix in the same paragraph and problem #2 fix with the problem statement in same paragraph.
pkg/ccl/oidcccl/authentication_oidc.go
line 237 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
It sounds like the latest version of the library has fixed this problem since that issue is resolved. We should consider upgrading perhaps.
Can you add a
// TODO
comment here that upgrading can resolve this. I understand you might not want to do that in this particular PR.
Added a TODO comment with the possible fix that will be needed going forward.
pkg/ccl/oidcccl/authentication_oidc.go
line 718 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
There's an
initialized
boolean; we should be able to use that instead of checking the manager directly. I'd put this in the block that calls reloadConfigLocked above:if `enabled && !initialized { reloadConfig() if !initialized { http.Error("auth manager could not be initialized") return } }
Or maybe modify reloadConfigLocked to return an error, but we don't want to leak our internal error to the (unauthenticated) user here.
Moved the check to conditional block after reloadConfig()
.
2ef5585
to
b48dd21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to TestOIDCManagerInitialisationUnderNetworkAvailability
for better naming.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @bdarnell, and @dhartunian)
018f4d6
to
8c5e018
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @bdarnell, and @souravcrl)
-- commits
line 5 at r3:
You don't need to be so specific detailing the lines on which changes are made. You can assume that your reader is a developer who can figure that out. What's helpful is to provide a summary of your intent in the commit message. The reason for that is that by documenting your intent and pairing it with your code diff, you give the reviewer a chance to understand whether your change accomplishes your intent appropriately.
There are a few things you're fixing here so it's good to lay those problems out directly and how you intend to fix them:
- OIDC attempts to re-initialize at the start of the request handler, this can fail due to the cancelled context bug in the OIDC library
- OIDC re-initialization assumes that the operation will succeed and continues serving the request even though initialization might have failed
-- commits
line 10 at r3:
nit: accounts.google.com
should be omitted here since it's not relevant to the failure. I don't think specifying a URL is necessary.
-- commits
line 12 at r3:
nit: we use American spelling with a z
.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 499 at r3 (raw file):
if !tc.wantError { require.NoError(t, err)
this check is not necessary because you've already checked err != nil
above.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 503 at r3 (raw file):
if resp.StatusCode != 302 { t.Fatalf("expected 302 status code but got: %d, responseBody: %s", resp.StatusCode, string(bodyBytes)) }
if you're using require
above, use require.Equal
for the resp.Status
code check.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 509 at r3 (raw file):
if resp.StatusCode != 500 { t.Fatalf("expected 500 status code but got: %d", resp.StatusCode) }
same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @dhartunian, and @souravcrl)
pkg/ccl/oidcccl/authentication_oidc_test.go
line 449 at r3 (raw file):
realNewManager := NewOIDCManager NewOIDCManager = func(ctx context.Context, conf oidcAuthenticationConf, redirectURL string, scopes []string) (IOIDCManager, error) { if conf.providerURL == unavailableProviderURL {
This isn't what I had in mind - here you're testing two different OIDCManagers, each one just used once. I was thinking of having a single OIDCManager (with a single URL) that just used a counter to return an error the first time and succeed the second. Don't tear down all the TestServer stuff in between.
20ca53b
to
2d24ef3
Compare
Just for my clarification, can the value of |
d2ea21c
to
38e6d47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @BabuSrithar, @bdarnell, and @dhartunian)
Previously, dhartunian (David Hartunian) wrote…
You don't need to be so specific detailing the lines on which changes are made. You can assume that your reader is a developer who can figure that out. What's helpful is to provide a summary of your intent in the commit message. The reason for that is that by documenting your intent and pairing it with your code diff, you give the reviewer a chance to understand whether your change accomplishes your intent appropriately.
There are a few things you're fixing here so it's good to lay those problems out directly and how you intend to fix them:
- OIDC attempts to re-initialize at the start of the request handler, this can fail due to the cancelled context bug in the OIDC library
- OIDC re-initialization assumes that the operation will succeed and continues serving the request even though initialization might have failed
fixed the commit message to elaborate on the problem description and solution.
Previously, dhartunian (David Hartunian) wrote…
nit:
accounts.google.com
should be omitted here since it's not relevant to the failure. I don't think specifying a URL is necessary.
updated.
Previously, dhartunian (David Hartunian) wrote…
nit: we use American spelling with a
z
.
updated.
pkg/ccl/oidcccl/authentication_oidc.go
line 713 at r4 (raw file):
Previously, BabuSrithar (BabuSrithar) wrote…
the #2 and #3 later in the file seems to be user visible. If yes, how are the users expected to consume the numbers ?
I was running into issues with stress and stress_race CI pipelines in github team viewer. I was testing by adding debug points.
pkg/ccl/oidcccl/authentication_oidc.go
line 725 at r4 (raw file):
Previously, BabuSrithar (BabuSrithar) wrote…
Just for my clarification, can the value of
oidcAuthentication.enabled
between line 712 and here ? Where we are already checking for same condition ? If yes, can we add a comment to explain it. Just looking at the code it seems like unreachable block to me,
We are acquiring lock on oidcAuthentication, so should not be possible. I was looking for potential flaws and fixing data race in stress_race CI pipeline.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 449 at r3 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This isn't what I had in mind - here you're testing two different OIDCManagers, each one just used once. I was thinking of having a single OIDCManager (with a single URL) that just used a counter to return an error the first time and succeed the second. Don't tear down all the TestServer stuff in between.
I updated the mock function to evaluate a boolean variable named networkAvailable
, so as to make it less ambiguous. I ran into race condition problems with counter variable when I was setting cluster settings using sqlDB.Exec
This was because we call reloadConfig
every time the value of cluster setting changes(handled using SetOnChange
). Hence we can't be sure number of calls made to NewOIDCManager()
. In current implementation this isn't a problem since I am using clusterSetting.Override()
which directly sets the value being accessed and SetOnChange()
is not triggered.
There is another problem which I encountered with sqlDB.Exec
, where I set the value but later while accessing it, the oidcAuthServer.enabled
value was found to be not set. oidcAuthServer.enabled
gets set using a hook(settingChanged
) on the in-RAM cluster setting OIDCEnabled
which was not getting set via sqlDB.Exec
in stress conditions. I tried doing OIDCEnabled.Overrider()
which solved the issue.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 499 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
this check is not necessary because you've already checked
err != nil
above.
removed.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 503 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
if you're using
require
above, userequire.Equal
for theresp.Status
code check.
updated.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 509 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
same comment as above.
used require
here also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @BabuSrithar, @bdarnell, and @souravcrl)
pkg/ccl/oidcccl/authentication_oidc.go
line 725 at r4 (raw file):
Previously, souravcrl wrote…
We are acquiring lock on oidcAuthentication, so should not be possible. I was looking for potential flaws and fixing data race in stress_race CI pipeline.
@BabuSrithar the two checks are not the same and you can reach the second one if the first one is not true. you could argue that this one should come first for readability though and to exit early if OIDC isn't enabled at all.
pkg/ccl/oidcccl/authentication_oidc.go
line 714 at r6 (raw file):
if oidcAuthentication.enabled && !oidcAuthentication.initialized { reloadConfigLocked(ctx, oidcAuthentication, locality, st) if !oidcAuthentication.initialized {
there are two other handlers in this file (callback, and jwt) which have similar conditional logic to attempt to initialize. can you amend them similarly?
pkg/ccl/oidcccl/authentication_oidc_test.go
line 450 at r6 (raw file):
defer func() { restoreHook() }()
You can just use defer restoreHook()
or even simpler: defer testutils.TestingHook(...)()
. All the other usage in our codebase uses the second example. (I know you're probably following the example in authentication_jwt_test.go
, but better to match the rest of the codebase.)
pkg/ccl/oidcccl/authentication_oidc_test.go
line 478 at r6 (raw file):
}, } { networkAvailable = tc.network
You should move this line into the t.Run
func just to keep all the code that uses tc
together.
Previously, dhartunian (David Hartunian) wrote…
The code is changed now, at the time of my comment, the line 712 had the exact same condition check and return statement. I agree that this condition should go first for readability. |
38e6d47
to
b3c3881
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @BabuSrithar, @bdarnell, and @dhartunian)
pkg/ccl/oidcccl/authentication_oidc.go
line 714 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
there are two other handlers in this file (callback, and jwt) which have similar conditional logic to attempt to initialize. can you amend them similarly?
I have added the check post reloadConfigLocked() to the other two handlers. One problem I see is reloadConfigLocked() is called from SetOnChange
for the cluster settings in-mem variables. If reloadConfigLocked
fails inside SetOnChange
, then oidcAuthentication
is never initialized and hence we won't be calling reloadConfigLocked() here. Is this valid scenario? Please let me know.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 450 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
You can just use
defer restoreHook()
or even simpler:defer testutils.TestingHook(...)()
. All the other usage in our codebase uses the second example. (I know you're probably following the example inauthentication_jwt_test.go
, but better to match the rest of the codebase.)
I have updated it now.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 478 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
You should move this line into the
t.Run
func just to keep all the code that usestc
together.
I have moved it to t.Run
, so it gets executed in single go routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @BabuSrithar, @bdarnell, and @dhartunian)
pkg/ccl/oidcccl/authentication_oidc.go
line 714 at r6 (raw file):
Previously, souravcrl wrote…
I have added the check post reloadConfigLocked() to the other two handlers. One problem I see is reloadConfigLocked() is called from
SetOnChange
for the cluster settings in-mem variables. IfreloadConfigLocked
fails insideSetOnChange
, thenoidcAuthentication
is never initialized and hence we won't be calling reloadConfigLocked() here. Is this valid scenario? Please let me know.
Hey I see we are checking for !initialized and this won't be the case. Nvm please ignore the above message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r5, 1 of 2 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @BabuSrithar, @dhartunian, and @souravcrl)
pkg/ccl/oidcccl/authentication_oidc.go
line 507 at r7 (raw file):
defer oidcAuthentication.mutex.Unlock() if oidcAuthentication.enabled && !oidcAuthentication.initialized {
This seven-line block is repeated so it should probably be replaced with a function like
if err := oidcAuthentication.maybeInitialize(ctx, locality, st); err != nil {
http.Error(...)
return
}
e51178b
to
95b0a05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @BabuSrithar, @bdarnell, and @souravcrl)
-- commits
line 22 at r7:
nit: please wrap commit at 80 characters.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 495 at r8 (raw file):
if !tc.wantError { log.Infof(ctx, "response body %v", string(bodyBytes))
nit: you can remove this from the test. if you want the body printed in the case of error, add it to the require
call.
We are encountering errors both when IdP is down and when it comes back up: 1. When IdP is down(network is unavailable): - OIDC attempts to re-initialize at the start of the request handler, this can fail as it is unable to fetch openid-configuration required for initializing the provider for oidc manager - OIDC re-initialization assumes that the operation will succeed and continues serving the request even though initialization might have failed. Hence we need to add check to verify if `oidcAuthentication.initialized` is set to true post successful call to `reloadConfigLocked`. 2. After IdP comes back up (network becomes available): - OIDC is initialized successfully and there is a call to oidc callback request handler. - OIDC callback request handler is failing when we are trying to `ExchangeVerifyGetClaims`. This is because `oidcManager.Verify()` is using the older context used when creating the verifier in oidc manager initialization. - The previous context got cancelled and there is an error thrown in `Verify()` code for cancelled context. The fix is to pass a non-cancellable context to the oidcManager instance. Release note: None Epic: CRDB-35370
95b0a05
to
44d7da5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @BabuSrithar, @bdarnell, and @dhartunian)
Previously, dhartunian (David Hartunian) wrote…
nit: please wrap commit at 80 characters.
I have wrapped to 80 characters.
pkg/ccl/oidcccl/authentication_oidc.go
line 507 at r7 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This seven-line block is repeated so it should probably be replaced with a function like
if err := oidcAuthentication.maybeInitialize(ctx, locality, st); err != nil { http.Error(...) return }
I have replaced it with maybeInitializeLocked
since we are acquiring lock previously and need to hold it till request handler completes execution.
pkg/ccl/oidcccl/authentication_oidc_test.go
line 495 at r8 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: you can remove this from the test. if you want the body printed in the case of error, add it to the
require
call.
I have removed the log stmt from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @BabuSrithar, @dhartunian, and @souravcrl)
TFTR! bors r=bdarnell,dhartunian |
Build succeeded: |
blathers backport 23.2 |
oidcccl: fix SSO login fails if IdP is down when CRDB starts
We are encountering errors both when IdP is down and when it comes back up:
fail as it is unable to fetch openid-configuration required for initializing
the provider for oidc manager
serving the request even though initialization might have failed. Hence we need
to add check to verify if
oidcAuthentication.initialized
is set to true postsuccessful call to
reloadConfigLocked
.handler.
ExchangeVerifyGetClaims
. This is becauseoidcManager.Verify()
is using theolder context used when creating the verifier in oidc manager initialization.
Verify()
code for cancelled context. The fix is to pass a non-cancellable context to the
oidcManager instance.
Release note: None
Epic: CRDB-35370