Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/start: unify code between cockroach start and cockroach mt start-sql #90176

Merged
merged 14 commits into from
Oct 31, 2022

Conversation

knz
Copy link
Contributor

@knz knz commented Oct 18, 2022

Fixes #89974.
Fixes #90524.

This PR merges the server initialization code between cockroach start and cockroach mt start-sql.

In doing so, it brings cockroach mt start-sql closer to what we expect from proper CockroachDB server processes:

See the individual commit for details.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz changed the title cli: fix a comment cli/start: unify code between cockroach start and cockroach mt start-sql` Oct 18, 2022
@knz knz changed the title cli/start: unify code between cockroach start and cockroach mt start-sql` cli/start: unify code between cockroach start and cockroach mt start-sql Oct 18, 2022
@knz knz force-pushed the 20221018-server-startup branch from 4e39869 to bc858f2 Compare October 20, 2022 17:55
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this is all God's work and I'm excited to review it, but please do me a favor and review #90208 so we can rebase this on it. They'll conflict (hopefully for the better), and I was first :P

First 10 commits in this PR from #90143 and #90173. Please ignore them during review.

Did you mean first 9 commits? I can't figure out where r10 - cli/start: extract the server init deps into an interface comes from, and I think I should be looking at it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/cli/mt_start_sql.go line 65 at r12 (raw file):

			return err
		}
		// This value is injected in order to have something populated during startup.

since you're taking all this blame, perhaps check whether all or some this comment can go away.


pkg/cli/start.go line 570 at r11 (raw file):

		// Run SQL for new clusters.
		//
		// TODO(knz): If/when we want auto-creation of an initial admin user,

think twice about whether this TODO needs to exist


pkg/cli/start.go line 615 at r11 (raw file):

	maybeRunInitialSQL func(context.Context, serverStartupInterface) error,
	serverType redact.SafeString,
) (getS func() serverShutdownInterface, srvStatus *serverStatus, serverStartupErrC <-chan error) {

Shameless plug, but this getS thing bothered me and I believe I've gotten rid of it somehow in #90208.
It might look better if you rebase :P.


pkg/cli/start.go line 637 at r11 (raw file):

			}
		}()
		// When the start up goroutine completes, so can the start up span

this comment doesn't fully make sense any more

@knz
Copy link
Contributor Author

knz commented Oct 23, 2022

TODO(self): mention job prevention start #90524

craig bot pushed a commit that referenced this pull request Oct 26, 2022
90384: server: split the tenant server creation into 3 stages r=dhartunian a=knz

Needed as prerequisite to #90176, towards fixing #89974.
First two commits from #90523.

The original motivation (and ultimate goal) for this PR is to
split the tenant server initialization into three phases: `New()`,
`PreStart()`, `AcceptClients()`, so as to reuse a common process startup
logic in a separate PR (#90176).

To achieve this goal, this PR re-orders the initialization steps
in `server.NewTenantServer` (previously known as
`server.StartTenant`), and extracts many of them into a new method
`(*SQLServerWrapper).PreStart()`.

The specific order of the code in `NewTenantServer()` and
`(*SQLServerWrapper).PreStart()` was chosen to mirror the order of
things in `NewServer()` and `(*Server).PreStart()`.

Reason for using the same order:

- it makes the review of this change easier: the reviewer can pull
  `server.go` and `tenant.go` and read them side-by-side, to satisfy
  themselves that the two implementations of
  `NewServer`/`NewTenantServer` and `PreStart` are equivalent.

- it will make it easier for future maintainers to keep them in sync.

- it helps us discover the fact that both sides share a lot of code.
  This opens an opportunity to merge them to a common implementation
  at a later stage.

While doing this work, care was also taken to discover which steps of
`(*Server).PreStart()` were *missing* from the tenant server
initialization. We found the following:

- the Sentry context enhancement (to report cluster ID, etc)
  was missing. This commit fixes that.

- several log entries that report the server configuration
  to the OPS channel were not emitted. This commit fixes that.

- the Graphite metric reporting was never enabled, even when
  configured. This commit fixes that.

- the Obs Service testing knobs (TestingKnobs.EventExporter) were
  not configured on the ObsServer instance. This commit fixes that.

- the `go.scheduler_latency` metric was not being measured.
  This commit fixes that.

Additionally, two followup issues were filed for the following
missing steps:

- missing support for the special file that blocks background jobs. (#90524)
- missing support for the system.eventlog cleanup loop. (#90521)

Epic: CRDB-14537

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)


-- commits line 2 at r1:
I've red this patch several times, and I have no idea what its effects are - like I don't know what the default for what store was before, and how it changes. I don't really care, but consider spelling it out more in the commit message - particularly if it actually has some user effect.


pkg/cli/mt_start_sql.go line 89 at r8 (raw file):

	// Change the permission mask for all created files.
	//
	// We're considering everything produced by a cockroach node

I'm trying to not police bro, but you never wrap :)


pkg/cli/mt_start_sql.go line 110 at r8 (raw file):

	// flapping exit code can affect alerting, including the alerting
	// performed within CockroachCloud.
	if err := exitIfDiskFull(vfs.Default, serverCfg.Stores.Specs); err != nil {

does this apply to sql nodes? Same with serverCfg.StorageEngine below.


pkg/cli/mt_start_sql.go line 286 at r8 (raw file):

	log.Flush()

	// When the start up completes, so can the start up span

why isn't this done immediately after opening the span, both here and in runStart()?


pkg/cli/start.go line 434 at r4 (raw file):

		return clierror.NewError(err, exit.FatalError())
	}
	stopper.SetTracer(serverCfg.BaseConfig.AmbientCtx.Tracer)

serverCfg.Tracer


pkg/cli/start.go line 434 at r4 (raw file):

		return clierror.NewError(err, exit.FatalError())
	}
	stopper.SetTracer(serverCfg.BaseConfig.AmbientCtx.Tracer)

Feel free to ignore, but any chance I can snipe you into figuring out a better patch for the tracer?
There's a couple of things that bother me with the status quo:

  • server.NewServer()` also does what you're doing here, so one of the two calls is unnecessary.
  • The stopper allocates a useless tracer that gets discarded here.

I think ideally a function like this runStart() should create the tracer explicitly, and then passing it both into the stopper, and into serverCfg; I also think we should make serverCfg not be a global any more (what's up with that anyway),


pkg/cli/start.go line 473 at r6 (raw file):

	// Configure the default storage engine.
	// NB: we only support one engine type for now.

scratch the "for now". Better yet, scratch the whole comment :P - we're hardly "configuring the default storage engine".
In fact, is this engine.EngineType vestigial / can it be deleted?


pkg/cli/start.go line 653 at r10 (raw file):

				// Run SQL for new clusters.
				//
				// TODO(knz): If/when we want auto-creation of an initial admin user,

If it starts with a "if we want", is it really a TODO?
I am curious what this is referring to, though. We already create a root user for example somehow, don't we?

In any case, consider scratching the comment. I doubt it serves enough purpose to justify its existence.


pkg/cli/start.go line 656 at r10 (raw file):

				// this can be achieved here.
				//
				// TODO(knz): It's unfortunate that this conditional is piercing the

it's not clear what the TODO is, so make it a note, if anything.
But how about making runInitialSQL be a method on the Server, and then it can be in the interface too.


pkg/cli/start.go line 651 at r11 (raw file):

			// Instantiate the server.
			var err error
			s, err = newServerFn(ctx, *serverCfg, stopper)

you sure this call needs to be async? If it doesn't, let's make it sync and return the server instead of the getS() things. If an async scheme needs to stay for some reason, return a channel instead.


pkg/server/config.go line 814 at r8 (raw file):

func (cfg *Config) readSQLEnvironmentVariables() {
	cfg.SpanConfigsDisabled = envutil.EnvOrDefaultBool("COCKROACH_DISABLE_SPAN_CONFIGS", cfg.SpanConfigsDisabled)

is this something that applies to sql servers?

@knz knz force-pushed the 20221018-server-startup branch from bc858f2 to c785f8c Compare October 28, 2022 12:35
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


-- commits line 2 at r1:

Previously, andreimatei (Andrei Matei) wrote…

I've red this patch several times, and I have no idea what its effects are - like I don't know what the default for what store was before, and how it changes. I don't really care, but consider spelling it out more in the commit message - particularly if it actually has some user effect.

Good point. Done.


pkg/cli/mt_start_sql.go line 89 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm trying to not police bro, but you never wrap :)

These comments are copy-pasted from start.go. I didn't write them. Also I wanted them to be identical across the two files so I could more easily see what differs with diff -y.

Anyway this is moot as all this code disappears in the last commit.


pkg/cli/mt_start_sql.go line 110 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

does this apply to sql nodes? Same with serverCfg.StorageEngine below.

Yes it does. They use a store for disk spilling.


pkg/cli/mt_start_sql.go line 286 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why isn't this done immediately after opening the span, both here and in runStart()?

Because we want only finish the span when the startup goroutine completes. In this file in this commit it's not a goroutine yet -- the goroutine is introduced in a later commit.


pkg/cli/mt_start_sql.go line 65 at r12 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

since you're taking all this blame, perhaps check whether all or some this comment can go away.

Yah good question.
Filed #90831 to follow up.


pkg/cli/start.go line 434 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

serverCfg.Tracer

I'll do this in the next PR.


pkg/cli/start.go line 434 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Feel free to ignore, but any chance I can snipe you into figuring out a better patch for the tracer?
There's a couple of things that bother me with the status quo:

  • server.NewServer()` also does what you're doing here, so one of the two calls is unnecessary.
  • The stopper allocates a useless tracer that gets discarded here.

I think ideally a function like this runStart() should create the tracer explicitly, and then passing it both into the stopper, and into serverCfg; I also think we should make serverCfg not be a global any more (what's up with that anyway),

I'll do this in the next PR.


pkg/cli/start.go line 473 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

scratch the "for now". Better yet, scratch the whole comment :P - we're hardly "configuring the default storage engine".
In fact, is this engine.EngineType vestigial / can it be deleted?

Removed the comment. I'll defer commenting on the StorageEngine field to someone in the storage team, in #90832


pkg/cli/start.go line 653 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If it starts with a "if we want", is it really a TODO?
I am curious what this is referring to, though. We already create a root user for example somehow, don't we?

In any case, consider scratching the comment. I doubt it serves enough purpose to justify its existence.

Done.


pkg/cli/start.go line 656 at r10 (raw file):

how about making runInitialSQL be a method on the Server

Good idea. I'll do this in a separate followup PR.


pkg/cli/start.go line 570 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

think twice about whether this TODO needs to exist

It's removed in a later commit.


pkg/cli/start.go line 615 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Shameless plug, but this getS thing bothered me and I believe I've gotten rid of it somehow in #90208.
It might look better if you rebase :P.

Done.


pkg/cli/start.go line 637 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this comment doesn't fully make sense any more

It does! startupSpan is bound to startupCtx whose scope is bound to this function/goroutine.


pkg/cli/start.go line 651 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you sure this call needs to be async? If it doesn't, let's make it sync and return the server instead of the getS() things. If an async scheme needs to stay for some reason, return a channel instead.

Yes it needs to remain async because we've seen it hang in the past. (It can also hang because of a disk stall etc)

The async scheme is solved by your new definition of the serverStatus interface. So there's nothing else to do here.


pkg/server/config.go line 814 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is this something that applies to sql servers?

Yep it does - the logic for span configs can be disabled sql-side with the SpanConfigsDisabled flag (this is even used in tests already)

@knz knz force-pushed the 20221018-server-startup branch 2 times, most recently from 06f5372 to abfaeab Compare October 28, 2022 13:20
@knz knz marked this pull request as ready for review October 28, 2022 13:26
@knz knz requested a review from a team October 28, 2022 13:26
@knz knz requested review from a team as code owners October 28, 2022 13:26
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


pkg/cli/start.go line 656 at r10 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

how about making runInitialSQL be a method on the Server

Good idea. I'll do this in a separate followup PR.

#90835

@knz knz force-pushed the 20221018-server-startup branch from abfaeab to 2bd08d3 Compare October 28, 2022 14:20
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: towards a unified future

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)


pkg/cli/start.go line 637 at r11 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

It does! startupSpan is bound to startupCtx whose scope is bound to this function/goroutine.

The "defined above" part doesn't really make much sense now that this is a separate function taking the span as an argument. Consider putting a comment on the argument saying that the function takes ownership.

knz added 6 commits October 31, 2022 09:46
This was already done by `runStartSQL` (in `mt_start_sql.go`).

Release note: None
When a `_CRITICAL_ALERT.txt` file is present in the filesystem,
we want to detect that early before anything else is set up.

Release note: None
This was also done in `mt_start_sql.go` for convenience.

Release note: None
Prior to this patch, `hintServerCmdFlags()` would only work
for `cockroach start` and `start-single-node`, which
are the only commands with support for the hidden, backward-compatible
`--host` / `--advertise-host` flags.

This commit generalizes the code so that it also works
when these flags are not defined.

Release note: None
Prior to this patch, the code in `runStartSQL` was using
a different sequence of steps than in `runStart`, even
for those steps that are beneficial to run in both cases.

This commit fixes that. In particular it adds the following
missing bits to `cockroach mt start-sql`:

- it fixes support for the (test-only)
  `COCKROACH_EXPERIMENTAL_LINEARIZABLE` env var. (from cockroachdb#4754)

- it adds a tracing span for the startup code. (from cockroachdb#8712)

- it properly supports `--listening-url-file`. (from cockroachdb#15468)

- it properly does sanitization of `--external-io-dir`. (from cockroachdb#19725)

- it sets the proper log severity level for gRPC. (from cockroachdb#20308)

- it reports the command-line and env config to logs. (from cockroachdb#21344)

- it prevents startup if there is a `_CRITICAL_ALERT.txt` file
  in the store directory. (from cockroachdb#42401)

- sets the umask for newly created file to remove "other" permission
  bits. This was a security team request originally. (from cockroachdb#44043)

- it sets `GOMAXPROCS` from current cgroup limits. (from cockroachdb#57390)

- it stops the server early if the storage disk is full. (from cockroachdb#66893)

- it fixes support for the `COCKROACH_DISABLE_SPAN_CONFIGS` config env
  var. (from cockroachdb#73876)

To review this commit, it is useful to open the files
`start.go` and `mt_start_sql.go` side-by-side, which
clarifies how much closer they have become to each other.
Looking at the `mt_start_sql.go` changes without
that context likely makes the review more difficult.

A later commit will actually merge the two code paths together so
there is just one thing to maintain.

Release note: None
This again enables sharing more code.

Release note: None
knz added 2 commits October 31, 2022 09:46
… servers

Prior to this commit, the default store path (stored in
`serverCfg.Stores.Specs[0].Path`) computed in `mtStartSQLFlagsInit`
used a path relative to the current directory.

This prevented us from reusing the same external IO path validation
code as used for regular servers.

This commit changes the logic in `mtStartSQLFlagsInit` to make the
default store path absolute.

This is needed to reuse more startup code.

Release note: None
@knz knz force-pushed the 20221018-server-startup branch from 2bd08d3 to 9ccde8d Compare October 31, 2022 08:49
This is a prerequisite to moving the startup goroutine to its own
function.

Release note: None
@knz knz force-pushed the 20221018-server-startup branch 2 times, most recently from 74369ca to 4c615e8 Compare October 31, 2022 13:46
knz added 4 commits October 31, 2022 14:48
…rt-sql`

Arguably, this should really have been done from the start.

This incidentally introduces the ability to terminate the process if
the SQL initialization hangs somehow (e.g. because the remote KV layer
is not ready).

Release note: None
@knz knz force-pushed the 20221018-server-startup branch from 4c615e8 to c3b62c8 Compare October 31, 2022 13:48
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


pkg/cli/start.go line 637 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The "defined above" part doesn't really make much sense now that this is a separate function taking the span as an argument. Consider putting a comment on the argument saying that the function takes ownership.

Done.

@knz
Copy link
Contributor Author

knz commented Oct 31, 2022

TFYR

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Oct 31, 2022

Build succeeded:

@craig craig bot merged commit b7b658f into cockroachdb:master Oct 31, 2022
@knz knz deleted the 20221018-server-startup branch October 31, 2022 16:24
craig bot pushed a commit that referenced this pull request Oct 31, 2022
90835: cli,server: move runInitialSQL to the server package r=andreimatei a=knz

Requested/suggested by `@andreimatei` in #90176 (review).

This also enables us to get rid of the ill-advised `RunLocalSQL`
method on `Server`.






90899: storage: handle errors in `pebbleMVCCScanner.enablePointSynthesis` r=erikgrinaker a=erikgrinaker

Epic: CRDB-2624
Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants