New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

sqlliveness: bug fixes and improvements #90586

Merged

craig merged 7 commits into cockroachdb:master from andreimatei:sqlliveness.callback

Oct 31, 2022

Contributor

andreimatei commented Oct 24, 2022 •

edited

Loading

Please see individual commits.

Release note: None
Epic: None

Member

cockroach-teamcity commented Oct 24, 2022

This change is

andreimatei force-pushed the sqlliveness.callback branch 4 times, most recently from 0d28995 to bea29b8 Compare

October 27, 2022 17:27

andreimatei changed the title ~~Sqlliveness.callback~~ sqlliveness: bug fixes and improvements

andreimatei marked this pull request as ready for review

October 27, 2022 17:28

andreimatei requested a review from a team

October 27, 2022 17:28

andreimatei requested review from a team as code owners

October 27, 2022 17:28

andreimatei requested review from aadityasondhi and ajwerner

October 27, 2022 17:28

Contributor Author

andreimatei commented Oct 27, 2022

First two commits are #90777.

andreimatei force-pushed the sqlliveness.callback branch 2 times, most recently from 9ac0ec2 to 95f2dde Compare

October 27, 2022 21:47

andreimatei commented

View reviewed changes

Contributor Author

andreimatei left a comment

#90777 is now merged. All commits in this PR are original.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @ajwerner)

knz reviewed

View reviewed changes

Contributor

knz left a comment

Reviewed 22 of 22 files at r11, 18 of 18 files at r12, 17 of 17 files at r13, 18 of 18 files at r14, 17 of 17 files at r15, 17 of 17 files at r16, 5 of 5 files at r17, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @andreimatei)

-- commits line 45 at r16:
I have no objection to the change, but I don't understand the rationale. How come the amount of state protected makes a difference in the choice between regular vs rw?

When I was at school, i was taught that the tradeoff is only a matter of the number of readers vs the number of writers. When we expect more readers, rwmutex is better.

What of the last field that remains under the mutex? Is it truly mostly equally written and read from?

pkg/sql/sqlliveness/slinstance/slinstance.go line 19 at r17 (raw file):

import (
	"context"
	"errors"

cockroachdb/errors

pkg/sql/sqlliveness/slinstance/slinstance.go line 127 at r17 (raw file):

	testKnobs     sqlliveness.TestingKnobs
	mu            struct {
		syncutil.Mutex

maybe also a q of normal vs rw here.

pkg/sql/sqlliveness/slinstance/slinstance.go line 292 at r17 (raw file):

	// Keep track of the fact that this Instance is not usable anymore. Further
	// Session() calls will return errors.
	l.mu.Lock()

defer l.mu.Unlock

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 93 at r17 (raw file):

	// and causes further Session() calls to fail.
	stopper.Stop(ctx)
	testutils.SucceedsSoon(t, func() error {

How come you need SucceedsSoon here? The heartbeat loop is tied to the stopper via RunAsync, so the goroutine is guaranteed to have completed after Stop returns?

andreimatei force-pushed the sqlliveness.callback branch from 95f2dde to c5c8090 Compare

October 29, 2022 16:13

andreimatei commented

View reviewed changes

Contributor Author

andreimatei left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @knz)

-- commits line 45 at r16:

Previously, knz (Raphael 'kena' Poss) wrote…

I have no objection to the change, but I don't understand the rationale. How come the amount of state protected makes a difference in the choice between regular vs rw?

When I was at school, i was taught that the tradeoff is only a matter of the number of readers vs the number of writers. When we expect more readers, rwmutex is better.

What of the last field that remains under the mutex? Is it truly mostly equally written and read from?

The Go RWMutex is not a scalable implementation - all readers synchronize with each other to enter/exit the critical section, unfortunately (but once a reader is inside, another one can also enter, of course). In fact, the Mutex implementation scales better for small critical section than RLock/RUnlock until 8 cores, where the two become pretty similar. RWMutex's Lock/Unlock is both slower absolutely, and scales worse than the other two.
So, it's not about the "amount of state", it's about the duration of reader critical sections. When all critical sections are small (which is the case here, and is also commonly the case), the RWMutex is worse. There are special situations where you might want to use RWMutex even for fairly small critical sections - which is when one of those sections, even though it's small, is prone to preemption and you don't want that to affect the tail latency of other readers. But these are rare.
More importantly, seeing RWMutex sends a message to the reader that there's particular considerations around that code.

In the case we're discussing here, the use of the RWMutex never really made sense. The commit message is being polite. It did use to be the case, however, that the mutex was held while running a list of callbacks, so from the perspective of this struct, there was some theoretical justification for it.

pkg/sql/sqlliveness/slinstance/slinstance.go line 19 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

cockroachdb/errors

done

pkg/sql/sqlliveness/slinstance/slinstance.go line 127 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

maybe also a q of normal vs rw here.

I don't think it's a good idea, for the same reasons. Switching to RWMutex would not be a good idea for the vast majority of mutexes in the codebase.

pkg/sql/sqlliveness/slinstance/slinstance.go line 292 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

defer l.mu.Unlock

done

pkg/sql/sqlliveness/slinstance/slinstance_test.go line 93 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

How come you need SucceedsSoon here? The heartbeat loop is tied to the stopper via RunAsync, so the goroutine is guaranteed to have completed after Stop returns?

you're right, done

ajwerner approved these changes

View reviewed changes

Contributor

ajwerner left a comment

Reviewed 2 of 18 files at r12, 1 of 5 files at r17, 2 of 2 files at r19, 5 of 5 files at r24, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andreimatei, and @knz)

pkg/server/server_sql.go line 425 at r17 (raw file):

// OnSessionDeleted implements the slinstance.SessionEventListener interface.
func (s *stopperSessionEventListener) OnSessionDeleted(ctx context.Context) bool {

nit: name the return param for readability.

pkg/sql/sqlliveness/slinstance/slinstance.go line 20 at r24 (raw file):

	"context"
	"fmt"
	"github.com/cockroachdb/errors"

I think this needs a gofmt.

pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

	}
	// When the session is set, blockCh should not be set.
	if l.mu.blockCh != nil {

nit: panic errors and not strings

Code quote:

		panic("expected session to be set")
	}
	// When the session is set, blockCh should not be set.
	if l.mu.blockCh != nil {
		panic("unexpected blockCh")
	}

pkg/sql/sqlliveness/slinstance/slinstance.go line 269 at r24 (raw file):

	if err != nil {
		if ctx.Err() == nil {
			panic(fmt.Sprintf("expected canceled ctx on err: %s", err))

Can you panic an error rather than a string? I think if you panic a string, it'll get redacted when it goes to sentry.

cockroach/pkg/util/log/logcrash/crash_reporting.go

Line 189 in 2675c7c

return errors.NewWithDepthf(depth+1, "panic: %v", r)

pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r24 (raw file):

	err := l.heartbeatLoopInner(ctx)
	if err == nil {
		panic("expected heartbeat to always terminate with an error")

Same nit about panicking an error

andreimatei added 7 commits

October 31, 2022 11:47


          sqlliveness: move field under mutex

59f7767

This field is accessed and the mutex, as it needs to be. But it declared
outside the `mu` struct by mistake.

Release note: None


          sqlliveness: fix comments

ff32b23

They were referring to an interface that doesn't exist any more.

Release note: None


          sqlliveness: add a comment

ae0ae72

Release note: None


          sqlliveness: ensure that the heartbeat doesn't get canceled

f135bf0

Before this patch, the async heartbeat loop was respecting the
cancelation of the context passed to Instance.Start(). That's a bad
idea, since the async task should be detached from the caller.

Release note: None


          sqlliveness: remove started field

e1d5835

This field was used to protect against doubly-starting
sqlliveness.Instance, and against requesting a session before the
Instance was started. Neither of these are real risks. This patch
removes such protection and the field backing it.

Release note: None


          sqlliveness: switch mutex from rw to regular

74c571c

This mutex used to protect more stuff and arguably used to make more
sense as a RWMutex, but now it protects a single field, so a regular
mutex is clearly better.

Release note: None


          sqlliveness: improve slinstance.Instance

09b04f6

This patch improves the code of slinstance.Instance and fixes some bugs.

Before this patch, the background sqlliveness heartbeat loop was very
inconsistent about what events it signaled on exit. Some exit paths
would raise an event to the event consumer, some wouldn't. The ones that
didn't are very dubious because future calls to Session() might block
forever.
Even for the code paths that did raise call into clearSession(), which
is supposed to raise an event, the event would only be produced if the
current session is not currently expired. Tying that event with the
session expiration didn't make sense because, if the session is not
currently expired and the hb loop exits, there will be no further
opportunity to signal an event.

This patch makes the raising of events clear and consistent: whenever
the heartbeat loop exists, it arranges that future Session() calls error
out, and it signals the event listener if there was currently a session.

This patch also improves the shutdown process triggered in tenant
servers when a session is found to have been deleted. This shutdown
works through this event raised by the Instance whenever a session is
deleted, giving the sqlServer the opportunity to initiate a shutdown.
Before this patch, the event handler would trigger a shutdown
asynchronously. While the shutdown trigger propagated, the sqlliveness
Instance could create new sessions. I don't think there were clear bad
consequences of this, but, at the very least, creating new sessions in a
tenant server is confusing because the server is supposed to be bound to
a single session throughout its lifetime - that's why the server shuts
down when it finds out that the session has been deleted. This patch
makes it so that the heartbeat loop that generally creates new session
shuts down synchronously when the event handler tells it to do so.

Release note: None

andreimatei force-pushed the sqlliveness.callback branch from c5c8090 to 09b04f6 Compare

October 31, 2022 16:02

andreimatei commented

View reviewed changes

Contributor Author

andreimatei left a comment

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @knz)

pkg/server/server_sql.go line 425 at r17 (raw file):

Previously, ajwerner wrote…

nit: name the return param for readability.

done

pkg/sql/sqlliveness/slinstance/slinstance.go line 20 at r24 (raw file):

Previously, ajwerner wrote…

I think this needs a gofmt.

done

pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

Previously, ajwerner wrote…

nit: panic errors and not strings

switched to log.Fatal(), which hopefully achieves the same

pkg/sql/sqlliveness/slinstance/slinstance.go line 269 at r24 (raw file):

Previously, ajwerner wrote…

Can you panic an error rather than a string? I think if you panic a string, it'll get redacted when it goes to sentry.

cockroach/pkg/util/log/logcrash/crash_reporting.go

Line 189 in 2675c7c

return errors.NewWithDepthf(depth+1, "panic: %v", r)

switched to log.Fatal()

pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r24 (raw file):

Previously, ajwerner wrote…

Same nit about panicking an error

Done.

knz reviewed

View reviewed changes

Contributor

knz left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @andreimatei)

pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

switched to log.Fatal(), which hopefully achieves the same

panic+error is objectively better FYI - it gives us a chance to recover in a test suite.

I don't mind about it in this case (given how unlikely it is), but I've noticed you're biased towards log.Fatal in other work so I want to remind you to over-compensate on this when you can.

pkg/sql/sqlliveness/slinstance/slinstance.go line 269 at r24 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

switched to log.Fatal()

ditto

Contributor

knz commented Oct 31, 2022

panic+error is objectively better FYI - it gives us a chance to recover in a test suite.

I mean it gives a chance to the test runner to still run the tests after it.

log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.

andreimatei commented

View reviewed changes

Contributor Author

andreimatei left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @knz)

pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

I mean it gives a chance to the test runner to still run the tests after it.

log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.

(Rafa, I think you sometimes comment from Github (or email?) and your comments don't make it to the right Reviewable thread)

TIL that the bazel test runner seems to continue running tests after a panic. go test doesn't do that.
Anyway, I think log.Fatal() is fine for assertion failures. I don't think whether or not tests continue to run is an important consideration; I think ergonomics are more important. I usually prefer log.Fatal() cause it uses the log tags, and because it does the string formatting. With panics, you frequently have to combine it fmt.Sprintf() (or better with errors.Newf(), I guess). I don't think we should discourage people away from log.Fatal().

knz reviewed

View reviewed changes

Contributor

knz left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @andreimatei)

pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I mean it gives a chance to the test runner to still run the tests after it.

log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.

(Rafa, I think you sometimes comment from Github (or email?) and your comments don't make it to the right Reviewable thread)

TIL that the bazel test runner seems to continue running tests after a panic. go test doesn't do that.
Anyway, I think log.Fatal() is fine for assertion failures. I don't think whether or not tests continue to run is an important consideration; I think ergonomics are more important. I usually prefer log.Fatal() cause it uses the log tags, and because it does the string formatting. With panics, you frequently have to combine it fmt.Sprintf() (or better with errors.Newf(), I guess). I don't think we should discourage people away from log.Fatal().

Yeah I was treating the bias, not suggesting moving away from log.Fatal entirely. We should have better guidance.

Contributor

knz commented Oct 31, 2022

(i don't consider myself to be a stakeholder here - don't wait for my review to merge this)

Contributor Author

andreimatei commented Oct 31, 2022

TFTRs!

bors r+

Contributor

craig bot commented Oct 31, 2022

Build succeeded:

Bazel Essential CI (Cockroach)

craig bot merged commit 1d5374c into cockroachdb:master

andreimatei deleted the sqlliveness.callback branch

October 31, 2022 22:59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet