-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlliveness: bug fixes and improvements #90586
Conversation
0d28995
to
bea29b8
Compare
First two commits are #90777. |
9ac0ec2
to
95f2dde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#90777 is now merged. All commits in this PR are original.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @ajwerner)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 22 of 22 files at r11, 18 of 18 files at r12, 17 of 17 files at r13, 18 of 18 files at r14, 17 of 17 files at r15, 17 of 17 files at r16, 5 of 5 files at r17, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @andreimatei)
-- commits
line 45 at r16:
I have no objection to the change, but I don't understand the rationale. How come the amount of state protected makes a difference in the choice between regular vs rw?
When I was at school, i was taught that the tradeoff is only a matter of the number of readers vs the number of writers. When we expect more readers, rwmutex is better.
What of the last field that remains under the mutex? Is it truly mostly equally written and read from?
pkg/sql/sqlliveness/slinstance/slinstance.go
line 19 at r17 (raw file):
import ( "context" "errors"
cockroachdb/errors
pkg/sql/sqlliveness/slinstance/slinstance.go
line 127 at r17 (raw file):
testKnobs sqlliveness.TestingKnobs mu struct { syncutil.Mutex
maybe also a q of normal vs rw here.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 292 at r17 (raw file):
// Keep track of the fact that this Instance is not usable anymore. Further // Session() calls will return errors. l.mu.Lock()
defer l.mu.Unlock
pkg/sql/sqlliveness/slinstance/slinstance_test.go
line 93 at r17 (raw file):
// and causes further Session() calls to fail. stopper.Stop(ctx) testutils.SucceedsSoon(t, func() error {
How come you need SucceedsSoon here? The heartbeat loop is tied to the stopper via RunAsync, so the goroutine is guaranteed to have completed after Stop returns?
95f2dde
to
c5c8090
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @knz)
Previously, knz (Raphael 'kena' Poss) wrote…
I have no objection to the change, but I don't understand the rationale. How come the amount of state protected makes a difference in the choice between regular vs rw?
When I was at school, i was taught that the tradeoff is only a matter of the number of readers vs the number of writers. When we expect more readers, rwmutex is better.
What of the last field that remains under the mutex? Is it truly mostly equally written and read from?
The Go RWMutex is not a scalable implementation - all readers synchronize with each other to enter/exit the critical section, unfortunately (but once a reader is inside, another one can also enter, of course). In fact, the Mutex
implementation scales better for small critical section than RLock/RUnlock
until 8 cores, where the two become pretty similar. RWMutex
's Lock/Unlock
is both slower absolutely, and scales worse than the other two.
So, it's not about the "amount of state", it's about the duration of reader critical sections. When all critical sections are small (which is the case here, and is also commonly the case), the RWMutex
is worse. There are special situations where you might want to use RWMutex
even for fairly small critical sections - which is when one of those sections, even though it's small, is prone to preemption and you don't want that to affect the tail latency of other readers. But these are rare.
More importantly, seeing RWMutex
sends a message to the reader that there's particular considerations around that code.
In the case we're discussing here, the use of the RWMutex
never really made sense. The commit message is being polite. It did use to be the case, however, that the mutex was held while running a list of callbacks, so from the perspective of this struct, there was some theoretical justification for it.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 19 at r17 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
cockroachdb/errors
done
pkg/sql/sqlliveness/slinstance/slinstance.go
line 127 at r17 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
maybe also a q of normal vs rw here.
I don't think it's a good idea, for the same reasons. Switching to RWMutex
would not be a good idea for the vast majority of mutexes in the codebase.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 292 at r17 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
defer l.mu.Unlock
done
pkg/sql/sqlliveness/slinstance/slinstance_test.go
line 93 at r17 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
How come you need SucceedsSoon here? The heartbeat loop is tied to the stopper via RunAsync, so the goroutine is guaranteed to have completed after Stop returns?
you're right, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 18 files at r12, 1 of 5 files at r17, 2 of 2 files at r19, 5 of 5 files at r24, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andreimatei, and @knz)
pkg/server/server_sql.go
line 425 at r17 (raw file):
// OnSessionDeleted implements the slinstance.SessionEventListener interface. func (s *stopperSessionEventListener) OnSessionDeleted(ctx context.Context) bool {
nit: name the return param for readability.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 20 at r24 (raw file):
"context" "fmt" "github.com/cockroachdb/errors"
I think this needs a gofmt.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 186 at r24 (raw file):
} // When the session is set, blockCh should not be set. if l.mu.blockCh != nil {
nit: panic errors and not strings
Code quote:
panic("expected session to be set")
}
// When the session is set, blockCh should not be set.
if l.mu.blockCh != nil {
panic("unexpected blockCh")
}
pkg/sql/sqlliveness/slinstance/slinstance.go
line 269 at r24 (raw file):
if err != nil { if ctx.Err() == nil { panic(fmt.Sprintf("expected canceled ctx on err: %s", err))
Can you panic an error rather than a string? I think if you panic a string, it'll get redacted when it goes to sentry.
return errors.NewWithDepthf(depth+1, "panic: %v", r) |
pkg/sql/sqlliveness/slinstance/slinstance.go
line 285 at r24 (raw file):
err := l.heartbeatLoopInner(ctx) if err == nil { panic("expected heartbeat to always terminate with an error")
Same nit about panicking an error
This field is accessed and the mutex, as it needs to be. But it declared outside the `mu` struct by mistake. Release note: None
They were referring to an interface that doesn't exist any more. Release note: None
Release note: None
Before this patch, the async heartbeat loop was respecting the cancelation of the context passed to Instance.Start(). That's a bad idea, since the async task should be detached from the caller. Release note: None
This field was used to protect against doubly-starting sqlliveness.Instance, and against requesting a session before the Instance was started. Neither of these are real risks. This patch removes such protection and the field backing it. Release note: None
This mutex used to protect more stuff and arguably used to make more sense as a RWMutex, but now it protects a single field, so a regular mutex is clearly better. Release note: None
This patch improves the code of slinstance.Instance and fixes some bugs. Before this patch, the background sqlliveness heartbeat loop was very inconsistent about what events it signaled on exit. Some exit paths would raise an event to the event consumer, some wouldn't. The ones that didn't are very dubious because future calls to Session() might block forever. Even for the code paths that did raise call into clearSession(), which is supposed to raise an event, the event would only be produced if the current session is not currently expired. Tying that event with the session expiration didn't make sense because, if the session is not currently expired and the hb loop exits, there will be no further opportunity to signal an event. This patch makes the raising of events clear and consistent: whenever the heartbeat loop exists, it arranges that future Session() calls error out, and it signals the event listener if there was currently a session. This patch also improves the shutdown process triggered in tenant servers when a session is found to have been deleted. This shutdown works through this event raised by the Instance whenever a session is deleted, giving the sqlServer the opportunity to initiate a shutdown. Before this patch, the event handler would trigger a shutdown asynchronously. While the shutdown trigger propagated, the sqlliveness Instance could create new sessions. I don't think there were clear bad consequences of this, but, at the very least, creating new sessions in a tenant server is confusing because the server is supposed to be bound to a single session throughout its lifetime - that's why the server shuts down when it finds out that the session has been deleted. This patch makes it so that the heartbeat loop that generally creates new session shuts down synchronously when the event handler tells it to do so. Release note: None
c5c8090
to
09b04f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @knz)
pkg/server/server_sql.go
line 425 at r17 (raw file):
Previously, ajwerner wrote…
nit: name the return param for readability.
done
pkg/sql/sqlliveness/slinstance/slinstance.go
line 20 at r24 (raw file):
Previously, ajwerner wrote…
I think this needs a gofmt.
done
pkg/sql/sqlliveness/slinstance/slinstance.go
line 186 at r24 (raw file):
Previously, ajwerner wrote…
nit: panic errors and not strings
switched to log.Fatal()
, which hopefully achieves the same
pkg/sql/sqlliveness/slinstance/slinstance.go
line 269 at r24 (raw file):
Previously, ajwerner wrote…
Can you panic an error rather than a string? I think if you panic a string, it'll get redacted when it goes to sentry.
return errors.NewWithDepthf(depth+1, "panic: %v", r)
switched to log.Fatal()
pkg/sql/sqlliveness/slinstance/slinstance.go
line 285 at r24 (raw file):
Previously, ajwerner wrote…
Same nit about panicking an error
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @andreimatei)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 186 at r24 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
switched to
log.Fatal()
, which hopefully achieves the same
panic+error is objectively better FYI - it gives us a chance to recover in a test suite.
I don't mind about it in this case (given how unlikely it is), but I've noticed you're biased towards log.Fatal
in other work so I want to remind you to over-compensate on this when you can.
pkg/sql/sqlliveness/slinstance/slinstance.go
line 269 at r24 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
switched to
log.Fatal()
ditto
I mean it gives a chance to the test runner to still run the tests after it. log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @knz)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 186 at r24 (raw file):
I mean it gives a chance to the test runner to still run the tests after it.
log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.
(Rafa, I think you sometimes comment from Github (or email?) and your comments don't make it to the right Reviewable thread)
TIL that the bazel test runner seems to continue running tests after a panic. go test
doesn't do that.
Anyway, I think log.Fatal()
is fine for assertion failures. I don't think whether or not tests continue to run is an important consideration; I think ergonomics are more important. I usually prefer log.Fatal()
cause it uses the log tags, and because it does the string formatting. With panics, you frequently have to combine it fmt.Sprintf()
(or better with errors.Newf()
, I guess). I don't think we should discourage people away from log.Fatal()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @andreimatei)
pkg/sql/sqlliveness/slinstance/slinstance.go
line 186 at r24 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I mean it gives a chance to the test runner to still run the tests after it.
log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.
(Rafa, I think you sometimes comment from Github (or email?) and your comments don't make it to the right Reviewable thread)
TIL that the bazel test runner seems to continue running tests after a panic.
go test
doesn't do that.
Anyway, I thinklog.Fatal()
is fine for assertion failures. I don't think whether or not tests continue to run is an important consideration; I think ergonomics are more important. I usually preferlog.Fatal()
cause it uses the log tags, and because it does the string formatting. With panics, you frequently have to combine itfmt.Sprintf()
(or better witherrors.Newf()
, I guess). I don't think we should discourage people away fromlog.Fatal()
.
Yeah I was treating the bias, not suggesting moving away from log.Fatal entirely. We should have better guidance.
(i don't consider myself to be a stakeholder here - don't wait for my review to merge this) |
TFTRs! bors r+ |
Build succeeded: |
Please see individual commits.
Release note: None
Epic: None