sql: reuse already allocated memory for the cache in a row container #34767

yuzefovich · 2019-02-10T19:03:38Z

Previously, we would always allocate new memory for every row that
we put in the cache of DiskBackedIndexedRowContainer and simply
discard the memory underlying the row that we remove from the cache.
Now, we're reusing that memory.

Release note: None

cockroach-teamcity · 2019-02-10T19:03:46Z

This change is

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

pkg/sql/rowcontainer/row_container.go, line 759 at r1 (raw file):

	for {
		if f.indexedRowsCache.Len() == 0 {
			return errors.Errorf("unexpectedly the cache of DiskBackedIndexedRowContainer contains zero rows")

Couldn't this happen in an out of memory scenario, or in the case of an improbably large row? The loop would keep calling RemoveFirst until there were no rows left in the cache.

I'm not really sure what to do about this other than to say that we should probably just propagate the error in that case. Can you work that out?

Also, this is complicated enough that it really deserves a unit test. Would you mind adding some?

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)

pkg/sql/rowcontainer/row_container.go, line 759 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Couldn't this happen in an out of memory scenario, or in the case of an improbably large row? The loop would keep calling RemoveFirst until there were no rows left in the cache.

I'm not really sure what to do about this other than to say that we should probably just propagate the error in that case. Can you work that out?

Also, this is complicated enough that it really deserves a unit test. Would you mind adding some?

You're right, this could happen, so I added the error propagation. Now, that "unexpected zero rows" error, I believe, should never be reachable, so should I panic there instead?

I added a unit test, and it showed that I had a bug (the cache was not being updated properly when memory was reused - only the copy ended up being updated), so thanks for this suggestion. Also, I realized that there is no change required to ring.Buffer and removed those changes.

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)

pkg/sql/rowcontainer/row_container.go, line 759 at r1 (raw file):

Previously, yuzefovich wrote…

You're right, this could happen, so I added the error propagation. Now, that "unexpected zero rows" error, I believe, should never be reachable, so should I panic there instead?

I added a unit test, and it showed that I had a bug (the cache was not being updated properly when memory was reused - only the copy ended up being updated), so thanks for this suggestion. Also, I realized that there is no change required to ring.Buffer and removed those changes.

Keeping the error is fine I think.

pkg/sql/rowcontainer/row_container.go, line 761 at r2 (raw file):

	for {
		if f.indexedRowsCache.Len() == 0 {
			if err != nil {

Do you need this?

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)

pkg/sql/rowcontainer/row_container.go, line 759 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Keeping the error is fine I think.

Ok, thanks.

pkg/sql/rowcontainer/row_container.go, line 761 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Do you need this?

This is only used in the case I described in the second paragraph of the comment to the function. The point is to return original OOM error instead of not very useful that I generate below. Are you suggesting returning the generated one and not take in err as an argument? I agree, this approach seems a little weird to me.

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

pkg/sql/rowcontainer/row_container.go, line 761 at r2 (raw file):

Previously, yuzefovich wrote…

This is only used in the case I described in the second paragraph of the comment to the function. The point is to return original OOM error instead of not very useful that I generate below. Are you suggesting returning the generated one and not take in err as an argument? I agree, this approach seems a little weird to me.

Oh, I see, yeah I thought that you can just return the final OOM error that you get down below in this case. I thought you already have a case for this, I'll comment where it is.

pkg/sql/rowcontainer/row_container.go, line 780 at r2 (raw file):

					f.maxCacheSize--
					f.firstCachedRowPos++
					if f.indexedRowsCache.Len() == 0 {

Seems to me you can rely on this case without having to pass in the outer error above.

Previously, we would always allocate new memory for every row that we put in the cache of DiskBackedIndexedRowContainer and simply discard the memory underlying the row that we remove from the cache. Now, we're reusing that memory. Release note: None

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto, @jordanlewis, and @yuzefovich)

pkg/sql/rowcontainer/row_container.go, line 780 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Seems to me you can rely on this case without having to pass in the outer error above.

This case will not be reached when the cache is empty, and we're trying to cache the first row of very big size. I added an extra check within GetRow() to propagate OOM error in this unlikely case.

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto)

yuzefovich · 2019-02-11T21:58:14Z

Thanks for the reviews!

bors r+

yuzefovich · 2019-02-11T22:10:55Z

By the way, I forgot to mention that I looked again into that growing-shrinking behavior of the memory account with pprof and didn't see measurable impact on performance (it was like 0.43% with memory accounting and 0% without), so it shouldn't be a concern.

34296: storage: improve message on slow Raft proposal r=petermattis a=tbg Touches #33007. Release note: None 34589: importccl: fix flaky test TestImportCSVStmt r=rytaft a=rytaft `TestImportCSVStmt` tests that `IMPORT` jobs appear in a certain order in the `system.jobs` table. Automatic statistics were causing this test to be flaky since `CreateStats` jobs were present in the jobs table as well, in an unpredictable order. This commit fixes the problem by only selecting `IMPORT` jobs from the jobs table. Fixes #34568 Release note: None 34660: storage: make RaftTruncatedState unreplicated r=bdarnell a=tbg This isn't 100% polished yet, but generally ready for review. ---- See #34287. Today, Raft (or preemptive) snapshots include the past Raft log, that is, log entries which are already reflected in the state of the snapshot. Fundamentally, this is because we have historically used a replicated TruncatedState. TruncatedState essentially tells us what the first index in the log is (though it also includes a Term). If the TruncatedState cannot diverge across replicas, we *must* send the whole log in snapshots, as the first log index must match what the TruncatedState claims it is. The Raft log is typically, but not necessarily small. Log truncations are driven by a queue and use a complex decision process. That decision process can be faulty and even if it isn't, the queue could be held up. Besides, even when the Raft log contains only very few entries, these entries may be quite large (see SSTable ingestion during RESTORE). All this motivates that we don't want to (be forced to) send the Raft log as part of snapshots, and in turn we need the TruncatedState to be unreplicated. This change migrates the TruncatedState into unreplicated keyspace. It does not yet allow snapshots to avoid sending the past Raft log, but that is a relatively straightforward follow-up change. VersionUnreplicatedRaftTruncatedState, when active, moves the truncated state into unreplicated keyspace on log truncations. The migration works as follows: 1. at any log position, the replicas of a Range either use the new (unreplicated) key or the old one, and exactly one of them exists. 2. When a log truncation evaluates under the new cluster version, it initiates the migration by deleting the old key. Under the old cluster version, it behaves like today, updating the replicated truncated state. 3. The deletion signals new code downstream of Raft and triggers a write to the new, unreplicated, key (atomic with the deletion of the old key). 4. Future log truncations don't write any replicated data any more, but (like before) send along the TruncatedState which is written downstream of Raft atomically with the deletion of the log entries. This actually uses the same code as 3. What's new is that the truncated state needs to be verified before replacing a previous one. If replicas disagree about their truncated state, it's possible for replica X at FirstIndex=100 to apply a truncated state update that sets FirstIndex to, say, 50 (proposed by a replica with a "longer" historical log). In that case, the truncated state update must be ignored (this is straightforward downstream-of-Raft code). 5. When a split trigger evaluates, it seeds the RHS with the legacy key iff the LHS uses the legacy key, and the unreplicated key otherwise. This makes sure that the invariant that all replicas agree on the state of the migration is upheld. 6. When a snapshot is applied, the receiver is told whether the snapshot contains a legacy key. If not, it writes the truncated state (which is part of the snapshot metadata) in its unreplicated version. Otherwise it doesn't have to do anything (the range will migrate later). The following diagram visualizes the above. Note that it abuses sequence diagrams to get a nice layout; the vertical lines belonging to NewState and OldState don't imply any particular ordering of operations. ``` ┌────────┐ ┌────────┐ │OldState│ │NewState│ └───┬────┘ └───┬────┘ │ Bootstrap under old version │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │ │ Bootstrap under new version │ │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ │─ ─ ┐ │ | Log truncation under old version │< ─ ┘ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ │ │─ ─ ┐ │ │ | Snapshot │ │< ─ ┘ │ │ │ Log truncation under new version │ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│ │ │ │ │─ ─ ┐ │ │ | Log truncation under new version │ │< ─ ┘ │ │ │ │─ ─ ┐ │ │ | Log truncation under old version │ │< ─ ┘ (necessarily running new binary) ``` Release note: None 34762: distsqlplan: fix error in union planning r=jordanlewis a=jordanlewis Previously, if 2 inputs to a UNION ALL had identical post processing except for renders, further post processing on top of that union all could invalidate the plan and cause errors or crashes. Fixes #34437. Release note (bug fix): fix a planning crash during UNION ALL operations that had projections, filters or renders directly on top of the UNION ALL in some cases. 34767: sql: reuse already allocated memory for the cache in a row container r=yuzefovich a=yuzefovich Previously, we would always allocate new memory for every row that we put in the cache of DiskBackedIndexedRowContainer and simply discard the memory underlying the row that we remove from the cache. Now, we're reusing that memory. Release note: None 34779: opt: add stats to tpch xform test r=justinj a=justinj Since we have stats by default now, this should be the default testing mechanism. I left in tpch-no-stats since we also have that for tpcc, just for safety. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Justin Jaffray <[email protected]>

craig · 2019-02-11T22:38:11Z

Build succeeded

GitHub CI (Cockroach)

yuzefovich requested review from jordanlewis, asubiotto and a team February 10, 2019 19:03

jordanlewis reviewed Feb 11, 2019

View reviewed changes

yuzefovich force-pushed the get_row_reuse branch from 4d84524 to 6808751 Compare February 11, 2019 18:21

yuzefovich commented Feb 11, 2019

View reviewed changes

yuzefovich force-pushed the get_row_reuse branch from 6808751 to 399785d Compare February 11, 2019 19:09

jordanlewis reviewed Feb 11, 2019

View reviewed changes

yuzefovich commented Feb 11, 2019

View reviewed changes

jordanlewis reviewed Feb 11, 2019

View reviewed changes

yuzefovich force-pushed the get_row_reuse branch from 399785d to a132513 Compare February 11, 2019 21:21

yuzefovich commented Feb 11, 2019

View reviewed changes

jordanlewis approved these changes Feb 11, 2019

View reviewed changes

craig bot merged commit a132513 into cockroachdb:master Feb 11, 2019

yuzefovich deleted the get_row_reuse branch February 12, 2019 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: reuse already allocated memory for the cache in a row container #34767

sql: reuse already allocated memory for the cache in a row container #34767

yuzefovich commented Feb 10, 2019

cockroach-teamcity commented Feb 10, 2019

jordanlewis left a comment

yuzefovich left a comment

jordanlewis left a comment

yuzefovich left a comment

jordanlewis left a comment

yuzefovich left a comment

jordanlewis left a comment

yuzefovich commented Feb 11, 2019

yuzefovich commented Feb 11, 2019

craig bot commented Feb 11, 2019

sql: reuse already allocated memory for the cache in a row container #34767

sql: reuse already allocated memory for the cache in a row container #34767

Conversation

yuzefovich commented Feb 10, 2019

cockroach-teamcity commented Feb 10, 2019

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich commented Feb 11, 2019

yuzefovich commented Feb 11, 2019

craig bot commented Feb 11, 2019

Build succeeded