Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269

Conversation

ajwerner
Copy link
Contributor

…ossip

This commit removes the code which connected the settings to their backing
table via the gossipped system config. Instead it unconditionally enables the
rangefeed-backed settingswatcher which was developed to support tenants.

Note that it is rather tested code that has been used in multi-tenant sql
pods for about a year now and all the existing tests still pass.

Release justification: Low risk, high benefit change to existing functionality

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

Actually there's a bit more to do here.

@ajwerner ajwerner marked this pull request as draft August 24, 2021 01:39
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 16, 2021
Grafted from cockroachdb#69269. This seems like a useful primitive for users of
this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 20, 2021
Grafted from cockroachdb#69269. This seems like a useful primitive for users of
this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 21, 2021
Grafted from cockroachdb#69269. This seems like a useful primitive for users of
this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 21, 2021
Grafted from cockroachdb#69269. This seems like a useful primitive for users of
this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 21, 2021
Grafted from cockroachdb#69269. This seems like a useful primitive for users of
this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
craig bot pushed a commit that referenced this pull request Sep 21, 2021
70317: kvclient/rangefeed: emit checkpoint events r=irfansharif a=irfansharif

Grafted from #69269. This seems like a useful primitive for users of
this library. We intend to use it in #69661 and #69614.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@ajwerner ajwerner force-pushed the ajwerner/adopt-settingswatcher-on-system-tenant branch from dec8f50 to 1c00077 Compare October 25, 2021 05:30
…ossip

This commit removes the code which connected the settings to their backing
table via the gossipped system config. Instead it unconditionally enables the
rangefeed-backed `settingswatcher` which was developed to support tenants.

Note that it is rather tested code that has been used in multi-tenant sql
pods for about a year now and all the existing tests still pass.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/adopt-settingswatcher-on-system-tenant branch from 1c00077 to 2e8c2c4 Compare December 22, 2021 00:07
@ajwerner ajwerner marked this pull request as ready for review December 22, 2021 00:49
@ajwerner ajwerner requested a review from RaduBerinde December 22, 2021 00:49
@ajwerner
Copy link
Contributor Author

@RaduBerinde I've rebased this and I think it's ready for a pass. I suspect there's missing testing somewhere. Please let me know what you're looking for.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

It feels like we're processing each rangefeed event in two different places and in two different ways (one indirectly, after buffering). What's the benefit of buffering events? Why not just keep mu.data up to date in in the main rangefeed callback? That would make everything a lot simpler. I don't think we'd even need to keep track of frontier timestamps anymore - whenever we get an event, we either spawn the async storage task, or if it is running already, we set a flag indicating that it needs to run again (and the task can check that flag and restart).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/server/settingswatcher/settings_watcher.go, line 66 at r3 (raw file):

// bootstrap settings state.
type Storage interface {
	WriteKVs(ctx context.Context, kvs []roachpb.KeyValue) error

[nit] This suggests that we may be writing KVs in batches, whereas IIUC each call is a full snapshot. Maybe SaveKVs or SnapshotKVs?

We guarantee that one instance of the call will be running at any one time, right? We should advertise that here.


pkg/server/settingswatcher/settings_watcher.go, line 255 at r3 (raw file):

				return
			}
		}

[nit] add a comment here saying that a call was already running and we need to try again.

This singleflight+retry mechanism feels awkward to me (perhaps because each call still spawns an async task separately waiting for what should really be a single process). Wouldn't it be simpler to have at most one async task running, along the lines of:

  • invariant: if frontierToSave < frontierSaved then there is an async task running or starting. If frontierToSave >= frontierSaved the async task is not running (or it's exiting).
  • Before we forward frontierToSave, we check the above condition and if we didn't have an async task running, we start it after the forward.
  • In the async task, we run a loop until frontierSaved >= frontierToSave. The latter can change during the loop, causing more iterations.

pkg/server/settingswatcher/settings_watcher_external_test.go, line 186 at r3 (raw file):

func (f *fakeStorage) WriteKVs(ctx context.Context, kvs []roachpb.KeyValue) error {
	f.Lock()
	defer f.Unlock()

[nit] should we introduce a random delay here? I want to make sure to test the situation where an event comes in while the async storage task is running. If we do that, we should also assert that only one instance of the method is running at a time (we can increment and defer(decrement) an atomic counter and check that it was 0)

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of buffering events? Why not just keep mu.data up to date in in the main rangefeed callback? That would make everything a lot simpler.

The complexity exists to ensure that whenever we write out a snapshot of settings, it corresponds to a snapshot which actually existed in the settings table at some point in time. The problem is that updates may come out-of-order. The buffer is a hack to avoid needing to maintain a versioned store for data. I'm not sure it's saving much in the way of complexity.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

@RaduBerinde
Copy link
Member

Huh, the range feed events don't have monotonically increasing timestamps?? That is a huge thing, it should be documented in at least 3 places (the RangeFeed Internal API itself, kvDB.RangeFeed(), and in rangefeed.RangeFeed).

If we care about the ordering in the settings snapshot, it feels like we should care about it in memory as well (even though the window of potential problems would be much smaller in practice).

I assume that more often than not, we will care about the proper ordering when we write a range feed "client" (which means each use will have to get involved with buffering and frontiers). At the very least, it's much easier to reason about things if you can just get the nice semantics.

So I think we should build a small layer into rangefeed.RangeFeed (or maybe even kvDB.RangeFeed) that can be used optionally which internally buffers the KV events and provides the nice monotonic guarantee on the callback. We can specify a memory limit after which the range feed errors out.

@ajwerner
Copy link
Contributor Author

ajwerner commented Jan 7, 2022

Huh, the range feed events don't have monotonically increasing timestamps??

It has very specific ordering guarantees. It guarantees for that any individual key you will see writes for the first time in increasing timestamp order. It makes no statement at all about ordering of events corresponding to different rows. In cockroach, you can have a txn write to row A at t2 and then subsequently a txn writes to row B at t1. To not emit the t2 event until a t1 event becomes impossible would effectively negate the design of rangefeeds.

That is a huge thing, it should be documented in at least 3 places (the RangeFeed Internal API itself, kvDB.RangeFeed(), and in rangefeed.RangeFeed).

I don't disagree that it is under-documented. Here's a comment which says some things but is far afield from where one might look.

// Changefeeds offer the following two ordering guarantees to external clients:
//
// 1. Rows are emitted with a timestamp. Individual rows are emitted in
// timestamp order. There may be duplicates, but once a row is seen at a given
// timestamp no previously unseen version of that row will be emitted at a less
// (or equal) timestamp. For example, you may see 1 2 1 2, or even 1 2 1, but
// never simply 2 1.
// 2. Periodically, a resolved timestamp is emitted. This is a changefeed-wide
// guarantee that no previously unseen row will later be seen with a timestamp
// less (or equal) to the resolved one. The cloud storage sink is structured as
// a number of distsql processors that each emit some part of the total changefeed.
// These processors only write files containing row data (initially only in ndjson
// format in this cloudStorageSink). This mapping is stable for a given distsql
// flow of a changefeed (meaning any given row is always emitted by the same
// processor), but it's not stable across restarts (pause/unpause). Each of these
// processors report partial progress information to a central coordinator
// (changeFrontier), which is responsible for writing the resolved timestamp files.

If we care about the ordering in the settings snapshot, it feels like we should care about it in memory as well (even though the window of potential problems would be much smaller in practice).

It's a tradeoff. If we wanted to wait for a snapshot, we'd have to wait for the closed timestamp which is on the order of seconds. Even today we don't update the settings atomically with the gossip update, though the window for things to be out of sync is extremely small; the updater is non-atomic.

One approach to reducing the delay is #73399

So I think we should build a small layer into rangefeed.RangeFeed (or maybe even kvDB.RangeFeed) that can be used optionally which internally buffers the KV events and provides the nice monotonic guarantee on the callback.

This is what @irfansharif was setting out to do with

// Buffer provides a thin memory-bounded buffer to sit on top of a rangefeed. It
// accumulates raw events which can then be flushed out in timestamp sorted
// order en-masse whenever the rangefeed frontier is bumped. If we accumulate
// more events than the limit allows for, we error out to the caller.
type Buffer struct {
but I agree it's too low-level.

We can specify a memory limit after which the range feed errors out.

Yeah. The data structure can in principle maintain exactly one entry per key while waiting for a checkpoint. Hitting a memory error for these use cases we're intending to use this for ought to be extremely rare and indicative of something pathological. I'm all for it existing as a guard rail, but at least for settings, it feels like the sort of thing where if we use too much ram, we ought to be crashing the server or something drastic like that. Thinking through more complex handling doesn't seem worth it.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

If we wanted to wait for a snapshot, we'd have to wait for the closed timestamp which is on the order of seconds.

This is another critical detail that should be better advertised, e.g. in the comments for WithOnFrontierAdvance / OnFrontierAdvance.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)


pkg/server/settingswatcher/settings_watcher.go, line 53 at r4 (raw file):

	// State used to store settings values to disk.
	buffer *rangefeedbuffer.Buffer

Can you add a comment here explaining that we need the buffer because the range feed callbacks can be out-of-order? Also mention that the buffer will hold a few seconds worth of changes in practice.

@irfansharif
Copy link
Contributor

One approach to reducing the delay is #73399

For posterity another would be setting a more aggressive kv.closed_timestamp.target_duration over specific tables (we default to 3s): #70614 (comment).

@ajwerner
Copy link
Contributor Author

ajwerner commented Jan 7, 2022

One approach to reducing the delay is #73399

For posterity another would be setting a more aggressive kv.closed_timestamp.target_duration over specific tables (we default to 3s): #70614 (comment).

That's a particularly legit idea for the settings table. We don't allow transactional writes to that table. If we set it to zero and we implemented the logic to stop tracking point reads to which we've written as mentioned in #52768 (comment), then, hopefully, we'd insta-close the range for the settings table. This would also rely on the whole table being in one range, but that is not that farfetched in a future release where we break up the system config span.

@ajwerner
Copy link
Contributor Author

Replacing this with some form of #74612. It's much cleaner.

@ajwerner ajwerner closed this Jan 10, 2022
craig bot pushed a commit that referenced this pull request Feb 2, 2022
74612: rangefeedcache,settingswatcher,systemcfgwatcher: lose gossip deps r=ajwerner a=ajwerner

This is a pile of commits which supersedes #69269 and pretty much puts in place all of the pieces to move on #70560. 

75726: sql: refactor pg_has_role to remove privilege.GRANT usage r=RichardJCai a=ecwall

refs #73129

Also combines some layers of privilege checking code.

Release note: None

75770: vendor: bump cockroachdb/apd to v3.1.0, speed up decimal division r=nvanbenschoten a=nvanbenschoten

Picks up two PRs that improved the performance of `Quo`, `Sqrt`, `Cbrt`, `Exp`, `Ln`, `Log`, and `Pow`:
- cockroachdb/apd#114
- cockroachdb/apd#115

Almost all of the testing changes here are due to the rounding behavior in cockroachdb/apd#115. This brings us closer to PG's behavior, but also creates a lot of noise in this diff. To verify that this noise wasn't hiding any correctness regressions caused by the rewrite of `Context.Quo` in the first PR, I created #75757, which only includes the first PR. #75757 passes CI with minimal testing changes. The testing changes that PR did require all have to do with trailing zeros, and most of them are replaced in this PR.

Release note (performance improvement): The performance of many DECIMAL arithmetic operators has been improved by as much as 60%. These operators include division (`/`), `sqrt`, `cbrt`, `exp`, `ln`, `log`, and `pow`.

----

### Speedup on TPC-DS dataset

The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (web_sales, row count = 7,197,566):

```sql
# q1
select sum(ws_wholesale_cost / ws_ext_list_price) from web_sales;

# q2
select sum(ws_wholesale_cost / ws_ext_list_price - sqrt(ws_net_paid_inc_tax)) from web_sales;
```

Here's the difference in runtime of these two queries before and after this change on an `n2-standard-8` instance:

```
name              old s/op   new s/op   delta
TPC-DS/custom/q1  22.4 ± 0%   8.6 ± 0%  -61.33%  (p=0.002 n=6+6)
TPC-DS/custom/q2  91.4 ± 0%  32.1 ± 0%  -64.85%  (p=0.008 n=5+5)
```

75771: colexec: close the ordered sync used by the external sorter r=yuzefovich a=yuzefovich

**colexec: close the ordered sync used by the external sorter**

Previously, we forgot to close the ordered synchronizer that is used by
the external sorter to merge already sorted partitions. This could
result in some tracing spans never being finished and is now fixed.

Release note: None

**colexec: return an error rather than logging it on Close in some cases**

This error eventually will be logged anyway, but we should try to
propagate it up the stack as much as possible.

Release note: None

75807: ui: Add CircleFilled component r=ericharmeling a=ericharmeling

Previously, there was no component for a filled circle icon in the `ui` package.
Recent product designs have requested this icon for the DB Console (see #67510, #73463).
This PR adds a `CircleFilled` component to the `ui` package.

Release note: None

75813: sql: fix flakey TestShowRangesMultipleStores r=ajwerner a=ajwerner

Fixes #75699

Release note: None

75836: dev: add generate protobuf r=ajwerner a=ajwerner

Convenient, fast.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Evan Wall <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants