Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlliveness: bug fixes and improvements #90586

Merged
merged 7 commits into from
Oct 31, 2022

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Oct 24, 2022

Please see individual commits.

Release note: None
Epic: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei andreimatei force-pushed the sqlliveness.callback branch 4 times, most recently from 0d28995 to bea29b8 Compare October 27, 2022 17:27
@andreimatei andreimatei changed the title Sqlliveness.callback sqlliveness: bug fixes and improvements Oct 27, 2022
@andreimatei andreimatei marked this pull request as ready for review October 27, 2022 17:28
@andreimatei andreimatei requested a review from a team October 27, 2022 17:28
@andreimatei andreimatei requested review from a team as code owners October 27, 2022 17:28
@andreimatei
Copy link
Contributor Author

First two commits are #90777.

@andreimatei andreimatei force-pushed the sqlliveness.callback branch 2 times, most recently from 9ac0ec2 to 95f2dde Compare October 27, 2022 21:47
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#90777 is now merged. All commits in this PR are original.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @ajwerner)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 22 files at r11, 18 of 18 files at r12, 17 of 17 files at r13, 18 of 18 files at r14, 17 of 17 files at r15, 17 of 17 files at r16, 5 of 5 files at r17, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @andreimatei)


-- commits line 45 at r16:
I have no objection to the change, but I don't understand the rationale. How come the amount of state protected makes a difference in the choice between regular vs rw?

When I was at school, i was taught that the tradeoff is only a matter of the number of readers vs the number of writers. When we expect more readers, rwmutex is better.

What of the last field that remains under the mutex? Is it truly mostly equally written and read from?


pkg/sql/sqlliveness/slinstance/slinstance.go line 19 at r17 (raw file):

import (
	"context"
	"errors"

cockroachdb/errors


pkg/sql/sqlliveness/slinstance/slinstance.go line 127 at r17 (raw file):

	testKnobs     sqlliveness.TestingKnobs
	mu            struct {
		syncutil.Mutex

maybe also a q of normal vs rw here.


pkg/sql/sqlliveness/slinstance/slinstance.go line 292 at r17 (raw file):

	// Keep track of the fact that this Instance is not usable anymore. Further
	// Session() calls will return errors.
	l.mu.Lock()

defer l.mu.Unlock


pkg/sql/sqlliveness/slinstance/slinstance_test.go line 93 at r17 (raw file):

	// and causes further Session() calls to fail.
	stopper.Stop(ctx)
	testutils.SucceedsSoon(t, func() error {

How come you need SucceedsSoon here? The heartbeat loop is tied to the stopper via RunAsync, so the goroutine is guaranteed to have completed after Stop returns?

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @knz)


-- commits line 45 at r16:

Previously, knz (Raphael 'kena' Poss) wrote…

I have no objection to the change, but I don't understand the rationale. How come the amount of state protected makes a difference in the choice between regular vs rw?

When I was at school, i was taught that the tradeoff is only a matter of the number of readers vs the number of writers. When we expect more readers, rwmutex is better.

What of the last field that remains under the mutex? Is it truly mostly equally written and read from?

The Go RWMutex is not a scalable implementation - all readers synchronize with each other to enter/exit the critical section, unfortunately (but once a reader is inside, another one can also enter, of course). In fact, the Mutex implementation scales better for small critical section than RLock/RUnlock until 8 cores, where the two become pretty similar. RWMutex's Lock/Unlock is both slower absolutely, and scales worse than the other two.
So, it's not about the "amount of state", it's about the duration of reader critical sections. When all critical sections are small (which is the case here, and is also commonly the case), the RWMutex is worse. There are special situations where you might want to use RWMutex even for fairly small critical sections - which is when one of those sections, even though it's small, is prone to preemption and you don't want that to affect the tail latency of other readers. But these are rare.
More importantly, seeing RWMutex sends a message to the reader that there's particular considerations around that code.

In the case we're discussing here, the use of the RWMutex never really made sense. The commit message is being polite. It did use to be the case, however, that the mutex was held while running a list of callbacks, so from the perspective of this struct, there was some theoretical justification for it.


pkg/sql/sqlliveness/slinstance/slinstance.go line 19 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

cockroachdb/errors

done


pkg/sql/sqlliveness/slinstance/slinstance.go line 127 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

maybe also a q of normal vs rw here.

I don't think it's a good idea, for the same reasons. Switching to RWMutex would not be a good idea for the vast majority of mutexes in the codebase.


pkg/sql/sqlliveness/slinstance/slinstance.go line 292 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

defer l.mu.Unlock

done


pkg/sql/sqlliveness/slinstance/slinstance_test.go line 93 at r17 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

How come you need SucceedsSoon here? The heartbeat loop is tied to the stopper via RunAsync, so the goroutine is guaranteed to have completed after Stop returns?

you're right, done

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 18 files at r12, 1 of 5 files at r17, 2 of 2 files at r19, 5 of 5 files at r24, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andreimatei, and @knz)


pkg/server/server_sql.go line 425 at r17 (raw file):

// OnSessionDeleted implements the slinstance.SessionEventListener interface.
func (s *stopperSessionEventListener) OnSessionDeleted(ctx context.Context) bool {

nit: name the return param for readability.


pkg/sql/sqlliveness/slinstance/slinstance.go line 20 at r24 (raw file):

	"context"
	"fmt"
	"github.com/cockroachdb/errors"

I think this needs a gofmt.


pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

	}
	// When the session is set, blockCh should not be set.
	if l.mu.blockCh != nil {

nit: panic errors and not strings

Code quote:

		panic("expected session to be set")
	}
	// When the session is set, blockCh should not be set.
	if l.mu.blockCh != nil {
		panic("unexpected blockCh")
	}

pkg/sql/sqlliveness/slinstance/slinstance.go line 269 at r24 (raw file):

	if err != nil {
		if ctx.Err() == nil {
			panic(fmt.Sprintf("expected canceled ctx on err: %s", err))

Can you panic an error rather than a string? I think if you panic a string, it'll get redacted when it goes to sentry.

return errors.NewWithDepthf(depth+1, "panic: %v", r)


pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r24 (raw file):

	err := l.heartbeatLoopInner(ctx)
	if err == nil {
		panic("expected heartbeat to always terminate with an error")

Same nit about panicking an error

This field is accessed and the mutex, as it needs to be. But it declared
outside the `mu` struct by mistake.

Release note: None
They were referring to an interface that doesn't exist any more.

Release note: None
Release note: None
Before this patch, the async heartbeat loop was respecting the
cancelation of the context passed to Instance.Start(). That's a bad
idea, since the async task should be detached from the caller.

Release note: None
This field was used to protect against doubly-starting
sqlliveness.Instance, and against requesting a session before the
Instance was started. Neither of these are real risks. This patch
removes such protection and the field backing it.

Release note: None
This mutex used to protect more stuff and arguably used to make more
sense as a RWMutex, but now it protects a single field, so a regular
mutex is clearly better.

Release note: None
This patch improves the code of slinstance.Instance and fixes some bugs.

Before this patch, the background sqlliveness heartbeat loop was very
inconsistent about what events it signaled on exit. Some exit paths
would raise an event to the event consumer, some wouldn't. The ones that
didn't are very dubious because future calls to Session() might block
forever.
Even for the code paths that did raise call into clearSession(), which
is supposed to raise an event, the event would only be produced if the
current session is not currently expired. Tying that event with the
session expiration didn't make sense because, if the session is not
currently expired and the hb loop exits, there will be no further
opportunity to signal an event.

This patch makes the raising of events clear and consistent: whenever
the heartbeat loop exists, it arranges that future Session() calls error
out, and it signals the event listener if there was currently a session.

This patch also improves the shutdown process triggered in tenant
servers when a session is found to have been deleted. This shutdown
works through this event raised by the Instance whenever a session is
deleted, giving the sqlServer the opportunity to initiate a shutdown.
Before this patch, the event handler would trigger a shutdown
asynchronously. While the shutdown trigger propagated, the sqlliveness
Instance could create new sessions. I don't think there were clear bad
consequences of this, but, at the very least, creating new sessions in a
tenant server is confusing because the server is supposed to be bound to
a single session throughout its lifetime - that's why the server shuts
down when it finds out that the session has been deleted. This patch
makes it so that the heartbeat loop that generally creates new session
shuts down synchronously when the event handler tells it to do so.

Release note: None
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @ajwerner, and @knz)


pkg/server/server_sql.go line 425 at r17 (raw file):

Previously, ajwerner wrote…

nit: name the return param for readability.

done


pkg/sql/sqlliveness/slinstance/slinstance.go line 20 at r24 (raw file):

Previously, ajwerner wrote…

I think this needs a gofmt.

done


pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

Previously, ajwerner wrote…

nit: panic errors and not strings

switched to log.Fatal(), which hopefully achieves the same


pkg/sql/sqlliveness/slinstance/slinstance.go line 269 at r24 (raw file):

Previously, ajwerner wrote…

Can you panic an error rather than a string? I think if you panic a string, it'll get redacted when it goes to sentry.

return errors.NewWithDepthf(depth+1, "panic: %v", r)

switched to log.Fatal()


pkg/sql/sqlliveness/slinstance/slinstance.go line 285 at r24 (raw file):

Previously, ajwerner wrote…

Same nit about panicking an error

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @andreimatei)


pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

switched to log.Fatal(), which hopefully achieves the same

panic+error is objectively better FYI - it gives us a chance to recover in a test suite.

I don't mind about it in this case (given how unlikely it is), but I've noticed you're biased towards log.Fatal in other work so I want to remind you to over-compensate on this when you can.


pkg/sql/sqlliveness/slinstance/slinstance.go line 269 at r24 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

switched to log.Fatal()

ditto

@knz
Copy link
Contributor

knz commented Oct 31, 2022

panic+error is objectively better FYI - it gives us a chance to recover in a test suite.

I mean it gives a chance to the test runner to still run the tests after it.

log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @knz)


pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

I mean it gives a chance to the test runner to still run the tests after it.

log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.

(Rafa, I think you sometimes comment from Github (or email?) and your comments don't make it to the right Reviewable thread)

TIL that the bazel test runner seems to continue running tests after a panic. go test doesn't do that.
Anyway, I think log.Fatal() is fine for assertion failures. I don't think whether or not tests continue to run is an important consideration; I think ergonomics are more important. I usually prefer log.Fatal() cause it uses the log tags, and because it does the string formatting. With panics, you frequently have to combine it fmt.Sprintf() (or better with errors.Newf(), I guess). I don't think we should discourage people away from log.Fatal().

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ajwerner, and @andreimatei)


pkg/sql/sqlliveness/slinstance/slinstance.go line 186 at r24 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I mean it gives a chance to the test runner to still run the tests after it.

log.Fatal will stupidly stop the entire test process and doesn't give a chance to the other tests to run.

(Rafa, I think you sometimes comment from Github (or email?) and your comments don't make it to the right Reviewable thread)

TIL that the bazel test runner seems to continue running tests after a panic. go test doesn't do that.
Anyway, I think log.Fatal() is fine for assertion failures. I don't think whether or not tests continue to run is an important consideration; I think ergonomics are more important. I usually prefer log.Fatal() cause it uses the log tags, and because it does the string formatting. With panics, you frequently have to combine it fmt.Sprintf() (or better with errors.Newf(), I guess). I don't think we should discourage people away from log.Fatal().

Yeah I was treating the bias, not suggesting moving away from log.Fatal entirely. We should have better guidance.

@knz
Copy link
Contributor

knz commented Oct 31, 2022

(i don't consider myself to be a stakeholder here - don't wait for my review to merge this)

@andreimatei
Copy link
Contributor Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 31, 2022

Build succeeded:

@craig craig bot merged commit 1d5374c into cockroachdb:master Oct 31, 2022
@andreimatei andreimatei deleted the sqlliveness.callback branch October 31, 2022 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants