-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli/start: unify code between cockroach start
and cockroach mt start-sql
#90176
Conversation
cockroach start
and cockroach mt start-sql
`
cockroach start
and cockroach mt start-sql
`cockroach start
and cockroach mt start-sql
4e39869
to
bc858f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure this is all God's work and I'm excited to review it, but please do me a favor and review #90208 so we can rebase this on it. They'll conflict (hopefully for the better), and I was first :P
First 10 commits in this PR from #90143 and #90173. Please ignore them during review.
Did you mean first 9 commits? I can't figure out where r10 - cli/start: extract the server init deps into an interface
comes from, and I think I should be looking at it.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/cli/mt_start_sql.go
line 65 at r12 (raw file):
return err } // This value is injected in order to have something populated during startup.
since you're taking all this blame, perhaps check whether all or some this comment can go away.
pkg/cli/start.go
line 570 at r11 (raw file):
// Run SQL for new clusters. // // TODO(knz): If/when we want auto-creation of an initial admin user,
think twice about whether this TODO needs to exist
pkg/cli/start.go
line 615 at r11 (raw file):
maybeRunInitialSQL func(context.Context, serverStartupInterface) error, serverType redact.SafeString, ) (getS func() serverShutdownInterface, srvStatus *serverStatus, serverStartupErrC <-chan error) {
Shameless plug, but this getS
thing bothered me and I believe I've gotten rid of it somehow in #90208.
It might look better if you rebase :P.
pkg/cli/start.go
line 637 at r11 (raw file):
} }() // When the start up goroutine completes, so can the start up span
this comment doesn't fully make sense any more
TODO(self): mention job prevention start #90524 |
90384: server: split the tenant server creation into 3 stages r=dhartunian a=knz Needed as prerequisite to #90176, towards fixing #89974. First two commits from #90523. The original motivation (and ultimate goal) for this PR is to split the tenant server initialization into three phases: `New()`, `PreStart()`, `AcceptClients()`, so as to reuse a common process startup logic in a separate PR (#90176). To achieve this goal, this PR re-orders the initialization steps in `server.NewTenantServer` (previously known as `server.StartTenant`), and extracts many of them into a new method `(*SQLServerWrapper).PreStart()`. The specific order of the code in `NewTenantServer()` and `(*SQLServerWrapper).PreStart()` was chosen to mirror the order of things in `NewServer()` and `(*Server).PreStart()`. Reason for using the same order: - it makes the review of this change easier: the reviewer can pull `server.go` and `tenant.go` and read them side-by-side, to satisfy themselves that the two implementations of `NewServer`/`NewTenantServer` and `PreStart` are equivalent. - it will make it easier for future maintainers to keep them in sync. - it helps us discover the fact that both sides share a lot of code. This opens an opportunity to merge them to a common implementation at a later stage. While doing this work, care was also taken to discover which steps of `(*Server).PreStart()` were *missing* from the tenant server initialization. We found the following: - the Sentry context enhancement (to report cluster ID, etc) was missing. This commit fixes that. - several log entries that report the server configuration to the OPS channel were not emitted. This commit fixes that. - the Graphite metric reporting was never enabled, even when configured. This commit fixes that. - the Obs Service testing knobs (TestingKnobs.EventExporter) were not configured on the ObsServer instance. This commit fixes that. - the `go.scheduler_latency` metric was not being measured. This commit fixes that. Additionally, two followup issues were filed for the following missing steps: - missing support for the special file that blocks background jobs. (#90524) - missing support for the system.eventlog cleanup loop. (#90521) Epic: CRDB-14537 Co-authored-by: Raphael 'kena' Poss <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)
-- commits
line 2 at r1:
I've red this patch several times, and I have no idea what its effects are - like I don't know what the default for what store was before, and how it changes. I don't really care, but consider spelling it out more in the commit message - particularly if it actually has some user effect.
pkg/cli/mt_start_sql.go
line 89 at r8 (raw file):
// Change the permission mask for all created files. // // We're considering everything produced by a cockroach node
I'm trying to not police bro, but you never wrap :)
pkg/cli/mt_start_sql.go
line 110 at r8 (raw file):
// flapping exit code can affect alerting, including the alerting // performed within CockroachCloud. if err := exitIfDiskFull(vfs.Default, serverCfg.Stores.Specs); err != nil {
does this apply to sql nodes? Same with serverCfg.StorageEngine
below.
pkg/cli/mt_start_sql.go
line 286 at r8 (raw file):
log.Flush() // When the start up completes, so can the start up span
why isn't this done immediately after opening the span, both here and in runStart()
?
pkg/cli/start.go
line 434 at r4 (raw file):
return clierror.NewError(err, exit.FatalError()) } stopper.SetTracer(serverCfg.BaseConfig.AmbientCtx.Tracer)
serverCfg.Tracer
pkg/cli/start.go
line 434 at r4 (raw file):
return clierror.NewError(err, exit.FatalError()) } stopper.SetTracer(serverCfg.BaseConfig.AmbientCtx.Tracer)
Feel free to ignore, but any chance I can snipe you into figuring out a better patch for the tracer?
There's a couple of things that bother me with the status quo:
- server.NewServer()` also does what you're doing here, so one of the two calls is unnecessary.
- The stopper allocates a useless tracer that gets discarded here.
I think ideally a function like this runStart()
should create the tracer explicitly, and then passing it both into the stopper, and into serverCfg
; I also think we should make serverCfg
not be a global any more (what's up with that anyway),
pkg/cli/start.go
line 473 at r6 (raw file):
// Configure the default storage engine. // NB: we only support one engine type for now.
scratch the "for now". Better yet, scratch the whole comment :P - we're hardly "configuring the default storage engine".
In fact, is this engine.EngineType
vestigial / can it be deleted?
pkg/cli/start.go
line 653 at r10 (raw file):
// Run SQL for new clusters. // // TODO(knz): If/when we want auto-creation of an initial admin user,
If it starts with a "if we want", is it really a TODO
?
I am curious what this is referring to, though. We already create a root
user for example somehow, don't we?
In any case, consider scratching the comment. I doubt it serves enough purpose to justify its existence.
pkg/cli/start.go
line 656 at r10 (raw file):
// this can be achieved here. // // TODO(knz): It's unfortunate that this conditional is piercing the
it's not clear what the TODO
is, so make it a note, if anything.
But how about making runInitialSQL
be a method on the Server
, and then it can be in the interface too.
pkg/cli/start.go
line 651 at r11 (raw file):
// Instantiate the server. var err error s, err = newServerFn(ctx, *serverCfg, stopper)
you sure this call needs to be async? If it doesn't, let's make it sync and return the server instead of the getS()
things. If an async scheme needs to stay for some reason, return a channel instead.
pkg/server/config.go
line 814 at r8 (raw file):
func (cfg *Config) readSQLEnvironmentVariables() { cfg.SpanConfigsDisabled = envutil.EnvOrDefaultBool("COCKROACH_DISABLE_SPAN_CONFIGS", cfg.SpanConfigsDisabled)
is this something that applies to sql servers?
bc858f2
to
c785f8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
Previously, andreimatei (Andrei Matei) wrote…
I've red this patch several times, and I have no idea what its effects are - like I don't know what the default for what store was before, and how it changes. I don't really care, but consider spelling it out more in the commit message - particularly if it actually has some user effect.
Good point. Done.
pkg/cli/mt_start_sql.go
line 89 at r8 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'm trying to not police bro, but you never wrap :)
These comments are copy-pasted from start.go. I didn't write them. Also I wanted them to be identical across the two files so I could more easily see what differs with diff -y
.
Anyway this is moot as all this code disappears in the last commit.
pkg/cli/mt_start_sql.go
line 110 at r8 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
does this apply to sql nodes? Same with
serverCfg.StorageEngine
below.
Yes it does. They use a store for disk spilling.
pkg/cli/mt_start_sql.go
line 286 at r8 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why isn't this done immediately after opening the span, both here and in
runStart()
?
Because we want only finish the span when the startup goroutine completes. In this file in this commit it's not a goroutine yet -- the goroutine is introduced in a later commit.
pkg/cli/mt_start_sql.go
line 65 at r12 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
since you're taking all this blame, perhaps check whether all or some this comment can go away.
Yah good question.
Filed #90831 to follow up.
pkg/cli/start.go
line 434 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
serverCfg.Tracer
I'll do this in the next PR.
pkg/cli/start.go
line 434 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Feel free to ignore, but any chance I can snipe you into figuring out a better patch for the tracer?
There's a couple of things that bother me with the status quo:
- server.NewServer()` also does what you're doing here, so one of the two calls is unnecessary.
- The stopper allocates a useless tracer that gets discarded here.
I think ideally a function like this
runStart()
should create the tracer explicitly, and then passing it both into the stopper, and intoserverCfg
; I also think we should makeserverCfg
not be a global any more (what's up with that anyway),
I'll do this in the next PR.
pkg/cli/start.go
line 473 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
scratch the "for now". Better yet, scratch the whole comment :P - we're hardly "configuring the default storage engine".
In fact, is thisengine.EngineType
vestigial / can it be deleted?
Removed the comment. I'll defer commenting on the StorageEngine field to someone in the storage team, in #90832
pkg/cli/start.go
line 653 at r10 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
If it starts with a "if we want", is it really a
TODO
?
I am curious what this is referring to, though. We already create aroot
user for example somehow, don't we?In any case, consider scratching the comment. I doubt it serves enough purpose to justify its existence.
Done.
pkg/cli/start.go
line 656 at r10 (raw file):
how about making runInitialSQL be a method on the Server
Good idea. I'll do this in a separate followup PR.
pkg/cli/start.go
line 570 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
think twice about whether this TODO needs to exist
It's removed in a later commit.
pkg/cli/start.go
line 615 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Shameless plug, but this
getS
thing bothered me and I believe I've gotten rid of it somehow in #90208.
It might look better if you rebase :P.
Done.
pkg/cli/start.go
line 637 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this comment doesn't fully make sense any more
It does! startupSpan is bound to startupCtx whose scope is bound to this function/goroutine.
pkg/cli/start.go
line 651 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
you sure this call needs to be async? If it doesn't, let's make it sync and return the server instead of the
getS()
things. If an async scheme needs to stay for some reason, return a channel instead.
Yes it needs to remain async because we've seen it hang in the past. (It can also hang because of a disk stall etc)
The async scheme is solved by your new definition of the serverStatus interface. So there's nothing else to do here.
pkg/server/config.go
line 814 at r8 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
is this something that applies to sql servers?
Yep it does - the logic for span configs can be disabled sql-side with the SpanConfigsDisabled flag (this is even used in tests already)
06f5372
to
abfaeab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
pkg/cli/start.go
line 656 at r10 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
how about making runInitialSQL be a method on the Server
Good idea. I'll do this in a separate followup PR.
abfaeab
to
2bd08d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)
pkg/cli/start.go
line 637 at r11 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
It does! startupSpan is bound to startupCtx whose scope is bound to this function/goroutine.
The "defined above" part doesn't really make much sense now that this is a separate function taking the span as an argument. Consider putting a comment on the argument saying that the function takes ownership.
This was already done by `runStartSQL` (in `mt_start_sql.go`). Release note: None
When a `_CRITICAL_ALERT.txt` file is present in the filesystem, we want to detect that early before anything else is set up. Release note: None
This was also done in `mt_start_sql.go` for convenience. Release note: None
Prior to this patch, `hintServerCmdFlags()` would only work for `cockroach start` and `start-single-node`, which are the only commands with support for the hidden, backward-compatible `--host` / `--advertise-host` flags. This commit generalizes the code so that it also works when these flags are not defined. Release note: None
Prior to this patch, the code in `runStartSQL` was using a different sequence of steps than in `runStart`, even for those steps that are beneficial to run in both cases. This commit fixes that. In particular it adds the following missing bits to `cockroach mt start-sql`: - it fixes support for the (test-only) `COCKROACH_EXPERIMENTAL_LINEARIZABLE` env var. (from cockroachdb#4754) - it adds a tracing span for the startup code. (from cockroachdb#8712) - it properly supports `--listening-url-file`. (from cockroachdb#15468) - it properly does sanitization of `--external-io-dir`. (from cockroachdb#19725) - it sets the proper log severity level for gRPC. (from cockroachdb#20308) - it reports the command-line and env config to logs. (from cockroachdb#21344) - it prevents startup if there is a `_CRITICAL_ALERT.txt` file in the store directory. (from cockroachdb#42401) - sets the umask for newly created file to remove "other" permission bits. This was a security team request originally. (from cockroachdb#44043) - it sets `GOMAXPROCS` from current cgroup limits. (from cockroachdb#57390) - it stops the server early if the storage disk is full. (from cockroachdb#66893) - it fixes support for the `COCKROACH_DISABLE_SPAN_CONFIGS` config env var. (from cockroachdb#73876) To review this commit, it is useful to open the files `start.go` and `mt_start_sql.go` side-by-side, which clarifies how much closer they have become to each other. Looking at the `mt_start_sql.go` changes without that context likely makes the review more difficult. A later commit will actually merge the two code paths together so there is just one thing to maintain. Release note: None
This again enables sharing more code. Release note: None
… servers Prior to this commit, the default store path (stored in `serverCfg.Stores.Specs[0].Path`) computed in `mtStartSQLFlagsInit` used a path relative to the current directory. This prevented us from reusing the same external IO path validation code as used for regular servers. This commit changes the logic in `mtStartSQLFlagsInit` to make the default store path absolute. This is needed to reuse more startup code. Release note: None
Release note: None
2bd08d3
to
9ccde8d
Compare
This is a prerequisite to moving the startup goroutine to its own function. Release note: None
74369ca
to
4c615e8
Compare
Release note: None
…rt-sql` Arguably, this should really have been done from the start. This incidentally introduces the ability to terminate the process if the SQL initialization hangs somehow (e.g. because the remote KV layer is not ready). Release note: None
Release note: None
Release note: None
4c615e8
to
c3b62c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
pkg/cli/start.go
line 637 at r11 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
The "defined above" part doesn't really make much sense now that this is a separate function taking the span as an argument. Consider putting a comment on the argument saying that the function takes ownership.
Done.
This reverts commit dab29d9. It also fixes cockroachdb#90524. Release note: None
TFYR bors r=andreimatei |
Build succeeded: |
90835: cli,server: move runInitialSQL to the server package r=andreimatei a=knz Requested/suggested by `@andreimatei` in #90176 (review). This also enables us to get rid of the ill-advised `RunLocalSQL` method on `Server`. 90899: storage: handle errors in `pebbleMVCCScanner.enablePointSynthesis` r=erikgrinaker a=erikgrinaker Epic: CRDB-2624 Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
Fixes #89974.
Fixes #90524.
This PR merges the server initialization code between
cockroach start
andcockroach mt start-sql
.In doing so, it brings
cockroach mt start-sql
closer to what we expect from proper CockroachDB server processes:it fixes support for the (test-only)
COCKROACH_EXPERIMENTAL_LINEARIZABLE
env var. (from cli: Remove unnecessary command line flags #4754)it adds a tracing span for the startup code. (from server,etc: Add detailed tracing to the startup process #8712!!)
it properly supports
--listening-url-file
. (from cli: Add --listening-url-file flag #15468)it properly does sanitization of
--external-io-dir
. (from storageccl,cli: add external IO dir cli flag #19725)it sets the proper log severity level for gRPC. (from cli: add timeouts to
node status
andnode ls
#20308)it reports the command-line and env config to logs. (from cli/start: enhance and better comment the startup process #21344)
it prevents startup if there is a
_CRITICAL_ALERT.txt
filein the store directory. (from storage: prevent node from restarting after consistency check fatal #42401)
sets the umask for newly created file to remove "other" permission
bits. This was a security team request originally. (from cli,log: change the default file permissions #44043)
it recovers support for
DISABLE_STARTING_BACKGROUND_JOBS
. (from jobs: allow blocking job adoption via sentinel file #44786)it sets
GOMAXPROCS
from current cgroup limits. (from cli: Set GOMAXPROCS on start if in a CPU-limited cgroup #57390)it stops the server early if the storage disk is full. (from cli,storage: add emergency ballast #66893)
it fixes support for the
COCKROACH_DISABLE_SPAN_CONFIGS
config envvar. (from *: enable span configs by default #73876)
See the individual commit for details.