Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: separate raft log #16624

Open
irfansharif opened this issue Jun 20, 2017 · 55 comments
Open

kvserver: separate raft log #16624

irfansharif opened this issue Jun 20, 2017 · 55 comments
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. sync-me-8 T-kv KV Team
Milestone

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Jun 20, 2017

Umbrella issue for #16361.

Jira issue: CRDB-12188

Epic CRDB-220

@dianasaur323 dianasaur323 added this to the 1.1 milestone Jun 23, 2017
@irfansharif
Copy link
Contributor Author

~  env COCKROACH_DEDICATED_RAFT_STORAGE=DISABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10"  > perf-disabled
~  env COCKROACH_DEDICATED_RAFT_STORAGE=ENABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10"  > perf-enabled
~  benchstat perf-enabled perf-disabled

  name                             old time/op  new time/op  delta
  ReplicaRaftStorage/vs=1024-4     1.70ms ± 4%  0.44ms ±12%  -73.90%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=4096-4     1.91ms ± 2%  0.63ms ±11%  -66.95%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=16384-4    2.74ms ± 2%  1.46ms ± 4%  -46.68%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=65536-4    5.16ms ± 5%  4.38ms ± 3%  -15.10%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=262144-4   13.6ms ± 4%  16.4ms ± 3%  +20.51%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=1048576-4  50.9ms ± 3%  71.0ms ± 4%  +39.70%  (p=0.000 n=10+8)

Early, untuned & possibly incorrect benchmarks. Will update comment once verified.

@petermattis
Copy link
Collaborator

Any insight as to why performance gets worse with a dedicated Raft storage engine at larger value sizes?

@irfansharif
Copy link
Contributor Author

Not yet, no. It shouldn't happen as far as I can tell, doesn't align with the early experiments done in #16361 either. Investigating.

@irfansharif
Copy link
Contributor Author

there was particularly ill-suited reflect.DeepEqual debug assertion here skewing things. That aside we have:

~ benchstat perf-disabled perf-enabled

  name                             old time/op  new time/op  delta
  ReplicaRaftStorage/vs=1024-4      320µs ± 6%   385µs ±18%  +20.29%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=4096-4      613µs ±14%   580µs ± 2%     ~     (p=0.278 n=10+9)
  ReplicaRaftStorage/vs=16384-4    2.59ms ± 3%  2.05ms ± 4%  -20.87%  (p=0.000 n=10+9)
  ReplicaRaftStorage/vs=65536-4    4.11ms ± 7%  3.29ms ± 3%  -19.97%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=262144-4   13.4ms ± 8%  10.7ms ± 3%  -20.39%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=1048576-4  56.8ms ± 3%  36.4ms ± 2%  -35.91%  (p=0.000 n=10+10)

@tbg
Copy link
Member

tbg commented Jun 30, 2017 via email

@irfansharif
Copy link
Contributor Author

nope, I had sprinkled them all over when implementing this to guarantee parity across the two instances. Just forgot to remove all of them when posting the first benchmark.

@petermattis
Copy link
Collaborator

petermattis commented Jul 17, 2017 via email

@irfansharif
Copy link
Contributor Author

irfansharif commented Jul 17, 2017

The following are some results from experiments to determine the relative lack of speed ups from #16809 (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on navy, ycsb -workload F (pure writes) with the same initial-load. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first run (with new changes), second from 7/13 21:27 to 7/14 14:39.

First, the results:

--- Before
17h21m56s       1145.0           0 / 0 / 0        1145 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  62516.6s          919.5              0

--- After
17h49m5s       1099.0           0 / 0 / 0        1099 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  64145.9s          896.2              0

Overall a 2.59% decrease in throughput. I did the same for a shorter run:

--- Before
1h12m25s       1084.9           0 / 0 / 0        1085 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  4345.3s          900.9              0

--- After
1h18m40s       1060.0           0 / 0 / 0        1060 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  4720.8s          877.0              0

Similarly, 2.62% decrease in throughput.

Here are some interesting graphs from the longer run, the first half is navy with the dedicated raft engine changes and the second without:
image

Overall the change, as is, seems to increase Raft command commit latency. Picking navy-006 to remove the outlier:
image

We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two):
image

We have slightly longer raft log processing times:
image

I tried using go-torch after, here are some flame graphs operating on the same data:

Before:
image

After:
image

They both mostly have the same structure (good, expected) except after we have a much wider second flame (I don't know what else to call it).
Looking into this further using callgrind (screenshots not shown) I was able to confirm a larger proportion of time spent in CGo land, happening right about here:

         .          .   4540:   isLogTruncationRequest := rResult.RaftLogDelta != nil
         .      3.62s   4541:   if err := batch.Commit(isLogTruncationRequest); err != nil {
         .          .   4542:           return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
         .          .   4543:                   errors.Wrap(err, "could not commit batch")))
         .          .   4544:   }
         .      700ms   4545:   if err := raftBatch.Commit(isLogTruncationRequest); err != nil {
         .          .   4546:           return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
         .          .   4547:                   errors.Wrap(err, "could not commit raft batch")))
         .          .   4548:   }
      10ms       10ms   4549:   elapsed := timeutil.Since(start)
         .       30ms   4550:   r.store.metrics.RaftCommandCommitLatency.RecordValue(elapsed.Nanoseconds())
         .       10ms   4551:   return rResult.Delta, nil
         .          .   4552:}
         .          .   4553:

I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed. Looking into what else can be shared across the two instances, by default the total number of compaction threads are but there probably are more (see include/rocksdb/env.h).

@irfansharif
Copy link
Contributor Author

Some other asides:

  • Overall memory usage goes slightly higher
    image
  • We seem to have more replicas/ranges per node (I don't know if this a valid inference, or why this comes about)
    image
  • Ignore the graph headings but here are RocksDB stats for the new engine. I'm surfacing these as separate metrics to track.
    image

@irfansharif
Copy link
Contributor Author

+cc @tschottdorf.

@petermattis
Copy link
Collaborator

I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed.

Yeah, that does seem problematic. When adding the dedicated syncing goroutine, I considered adding a dedicated goroutine to perform the commits. With some restructuring, I think we could have a dedicated per-engine goroutine for committing batches, queue both batches and then wait for both of the commits to happen.

@irfansharif
Copy link
Contributor Author

irfansharif commented Jul 19, 2017

Rebasing on top of new master I'm seeing 0 QPS occurring quite frequently, this is ycsb -workload F on ultramarine:

  9m17s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m18s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m19s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m20s          0.0           0 / 0 / 0           0 / 0           0 / 0
elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
  9m21s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m22s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m23s        116.0           0 / 0 / 0         116 / 0           0 / 0
  9m24s        499.0           0 / 0 / 0         499 / 0           0 / 0
  9m25s        420.0           0 / 0 / 0         420 / 0           0 / 0
  9m26s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m27s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m28s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m29s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m30s        405.0           0 / 0 / 0         405 / 0           0 / 0
  9m31s         59.0           0 / 0 / 0          59 / 0           0 / 0
  9m32s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m33s          0.0           0 / 0 / 0           0 / 0           0 / 0
   ...
elapsed__ops/sec(total)__errors(total)
 773.4s          155.4              0

This is new post-rebase, I suspect it's due to #16942 and the reworked batching mechanisms. Isolating a build without those changes.

For posterity these are the numbers for without the raft log changes:

elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
   6m1s        558.0           0 / 0 / 0         558 / 0           0 / 0
   6m2s        541.0           0 / 0 / 0         541 / 0           0 / 0
   6m3s        596.0           0 / 0 / 0         596 / 0           0 / 0
   6m4s        553.0           0 / 0 / 0         553 / 0           0 / 0
   6m5s        629.0           0 / 0 / 0         629 / 0           0 / 0
   6m6s        538.0           0 / 0 / 0         538 / 0           0 / 0
   6m7s        566.0           0 / 0 / 0         566 / 0           0 / 0
   6m8s        597.0           0 / 0 / 0         597 / 0           0 / 0
 ... 
elapsed__ops/sec(total)__errors(total)
 368.0s          457.6              0

No zero QPS, almost a thrice as fast (!).

@irfansharif
Copy link
Contributor Author

Nope, something else going on here.
Here's a build reverting three related commits to the new batching/syncing mechanism:

a86bba2f6 Revert "storage/engine: use SyncWAL instead of sync on commit"
f9379de0e Revert "storage/engine: unify code to commit write-only and read-write batches"
ec8bd15c0 Revert "storage/engine: lazily initialize the RocksDB batch"
   4m1s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m2s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m3s        244.0           0 / 0 / 0         244 / 0           0 / 0
   4m4s         12.0           0 / 0 / 0          12 / 0           0 / 0
   4m5s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m6s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m7s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m8s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m9s          0.0           0 / 0 / 0           0 / 0           0 / 0
  4m10s          0.0           0 / 0 / 0           0 / 0           0 / 0
  4m11s        280.0           0 / 0 / 0         280 / 0           0 / 0
 ... 
elapsed__ops/sec(total)__errors(total)
 251.3s          245.9              0

@irfansharif
Copy link
Contributor Author

Zero QPS points are during lots of completely idle CPU time:
image

@petermattis
Copy link
Collaborator

0 QPS like that often indicates a problem with leases.

@irfansharif
Copy link
Contributor Author

Circling back to this issue, still visible after restructuring around truncations moving downstream and happening on it's own dedicated queue after the fact ("truncations" only update TruncatedState, actual deletion of log entries happen after the fact). I'm sure there's a very simple oversight causing this but still investigating. Below are some graphs from a longer run last night where we observe gradual degradation to 0 QPS point, my initial theory is some funky interaction with the existing raft log queue (queueing truncations) and the snapshot queue (queueing snapshots; on initial investigation however the requests seem to originate out of raft directly). We see continuous cycle of snapshot generation + application for the same small subset of ranges (this should have been mitigated after the proposal quota pool, I've come full circle). The workload here is ycsb -workload F -splits 1000 -concurrency 1000.
Forgive the verbosity/noise, I don't know what's relevant as yet.

Slow degradation in throughput followed by intermittent peaks:
image

Enqueued replicas despite 0 throughput, also a lot of failures:
image
image

Surge of non-preemptive snapshots generated (master doesn't show this for the same workload):
image
image

Raft log self behind creeps up:
image

Replicas turn quiescent:
image

Other side effects, raft is blocked as well:
image

@irfansharif
Copy link
Contributor Author

Seeing a lot of snapshot generation/application isolated to few specific ranges:

I170725 03:58:16.971243 1078385 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8570 (id=1d594482, encoded size=8836147, 34 rocksdb batches, 24 log entries)
I170725 03:58:17.112937 1078385 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 142ms [clear=0ms batch=99ms entries=0ms commit=42ms]
I170725 03:58:18.162192 1083635 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8619 (id=194c1442, encoded size=8895367, 34 rocksdb batches, 30 log entries)
I170725 03:58:18.315653 1083635 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 153ms [clear=5ms batch=57ms entries=1ms commit=90ms]
I170725 03:58:19.455216 1088924 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8665 (id=9afacbdb, encoded size=8941200, 34 rocksdb batches, 29 log entries)
I170725 03:58:19.535733 1088924 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 80ms [clear=0ms batch=36ms entries=0ms commit=44ms]
I170725 03:58:20.617621 1094108 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8703 (id=07eca121, encoded size=8972789, 35 rocksdb batches, 24 log entries)
I170725 03:58:20.861394 1094108 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 244ms [clear=0ms batch=174ms entries=1ms commit=69ms]
I170725 03:58:21.939922 1099239 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8748 (id=59467f09, encoded size=9017571, 35 rocksdb batches, 23 log entries)
I170725 03:58:22.004040 1099239 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 64ms [clear=0ms batch=8ms entries=0ms commit=55ms]
I170725 03:58:23.110255 1104397 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8802 (id=f619d9ce, encoded size=9087894, 35 rocksdb batches, 33 log entries)

@bdarnell
Copy link
Contributor

I'm not sure what it means yet, but I think that last graph is going to be the key: we have periodic bursts of messages getting through, coinciding with an increase in the level of dropped messages (and each time the spikes/increases get smaller). It looks like it repeats at regular intervals, but not at a round number (11 minutes?)

This suggests problems in the raft transport, maybe flow control related. I'd try increasing initialConnWindowSize in rpc/context.go.

@irfansharif
Copy link
Contributor Author

irfansharif commented Jul 27, 2017

I'm sure there's a very simple oversight causing this but still investigating

Found the issue, it was due to Replica.mu.raftLogSize not getting updated correctly despite actual truncations taking place. This happened after a live "migration" over to the new engine when upstream of raft the "simulated truncation" (

ms.SysBytes = -ms.SysBytes // simulate the deletion
) did not play well with "actual" truncations happening downstream. The result of this was that despite "actual" truncations happening (log entries got deleted in RockDB), the raft log queue still accrued replicas as individual Replica.mu.raftLogSize's grew indefinitely. All the following emergent side effects could be traced down to just that.

The graphs looks steady following a 10hr run last night, albeit with a reduced throughput.

@irfansharif
Copy link
Contributor Author

Here are the new degradations (implementation is as yet incomplete, awkward state debugging the previous issue), things to look out for:

Still we have more replicas queued for log truncations (I have a handle on why, addressing):
image

Despite this we seem to be processing fewer replica log truncations:
image

Side effect perhaps, raft throughput takes a hit:
image

Most pressing, raft log commit latencies shoot way up. To a lesser degree the same is true for commit latencies. For this run log truncations, though processed downstream, happen just as frequently. The Raft RocksDB instance is untuned (same defaults as the base engine). Investigating.
image

@irfansharif
Copy link
Contributor Author

irfansharif commented Jul 27, 2017

Another stab at this, w/ dedicated raft log gc queue "actually" deleting after the fact. Live migration takes place around 16:52, we maintain steady performance for a while before eventual 0 QPS again.
Similar symptoms as before, and at first glance similar causes (over-aggressive raft log truncations).

Dropping throughput, eventually zero, starting at 16:52:
image

Uptick in replicas queued in the raft log queue, again, starting at 16:52:
image

Again we eventually get to cyclic snapshot generation:
image
image

Right around then more replicas report themselves to be "behind" (hence the snapshots):
image

Replicas eventually quiesce, we're now at 0 QPS:
image

Raft log commit latencies shoot up, raft command commits too but not so much. Consider that the new engine is as yet untuned:
image


As an aside, some new metrics we're tracking now. Will aid when tuning:

image

image

image

image

@petermattis
Copy link
Collaborator

Heh, I've used the term "snapshot death spiral" for this behavior. Similar problems occurred in the past when Raft log truncation was truncating while we were trying to catch up or rebalance a replica. The heuristics now in raft_log_queue.go are supposed to avoid this, but perhaps something in this PR has broken those heuristics.

@irfansharif
Copy link
Contributor Author

Fixing the zero QPS issue (more raft log truncation simulation badness, sigh) I'm running into OOM/high RocksDB memory usage behavior.

image

First two steep runs are running truncations in the separate raft log gc queue using tombstones, fairly infrequently at that. The disk usage for the raft engine just grows out of bound indefinitely until we hit 47 GiB when we are killed. The long running stable run is our disk usage without raft engine enabled where we seem to stay steady under the same constant load (ycsb -workload F -splits 500 -concurrency 500).
The differences between the three runs where we max out disk space:

  • First run, using raft deletion tombstones instead of actual deletions
  • Second run, actual deletions
  • Third run, increasing how aggressively the new raft log gc queue runs. This very negatively affects throughput (by 43% or so) but our memory usage, though still growing indefinitely, is somewhat staggered

The blip towards the end of the first run is what happens when the first node is killed having run out of memory and queries are stalled temporarily allowing the other nodes to presumably run compactions and lower disk usage.

Here are some compactions stats right before this happens on n2,s2:

I170802 14:47:44.643162 92 storage/store.go:4386  [n2,s2] sstables (raft eng read amplification = 3):
25438 0 [ 38M 1 ]: 38M
25439 5 [  4M 1 ]: 4M
25440 6 [ 79M 2 ]: 77M 1M
25441 I170802 14:47:44.643226 92 storage/store.go:4387  [n2,s2]
25442 ** Compaction Stats [default] **
25443 Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
25444 ----------------------------------------------------------------------------------------------------------------------------------------------------------
25445   L0      1/0   37.67 MB   0.9      0.0     0.0      0.0       3.7      3.7       0.0   1.0      0.0     49.6        76        99    0.770       0      0
25446   L5      1/0    3.60 MB   0.5      3.5     3.5      0.0       2.8      2.8       0.0   0.8     46.4     37.2        78        47    1.654   6896K   686K
25447   L6      2/0   78.83 MB   0.0      6.5     3.0      3.5       3.6      0.1       0.0   1.2     62.5     34.6       106        63    1.678   9644K  6277K
25448  Sum      4/0   120.09 MB   0.0     10.0     6.5      3.5      10.1      6.6       0.0   2.7     39.3     39.8       260       209    1.243     16M  6963K
25449  Int      0/0    0.00 KB   0.0      1.3     0.8      0.5       1.3      0.9       0.0   2.7     28.4     29.4        46        25    1.837   2117K   849K
Uptime(secs): 4810.2 total, 4810.2 interval
25451 Flush(GB): cumulative 3.693, interval 0.491
25452 AddFile(GB): cumulative 0.000, interval 0.000
25453 AddFile(Total Files): cumulative 0, interval 0
25454 AddFile(L0 Files): cumulative 0, interval 0
25455 AddFile(Keys): cumulative 0, interval 0
25456 Cumulative compaction: 10.09 GB write, 2.15 MB/s write, 9.98 GB read, 2.12 MB/s read, 259.7 seconds
25457 Interval compaction: 1.32 GB write, 0.28 MB/s write, 1.27 GB read, 0.27 MB/s read, 45.9 seconds
25458 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte      s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Here are some after:

29674 I170802 15:07:44.645798 92 storage/store.go:4386  [n2,s2] sstables (raft eng read amplification = 1):
29675 6 [ 7M 1 ]: 7M
29676 I170802 15:07:44.645980 92 storage/store.go:4387  [n2,s2]
29677 ** Compaction Stats [default] **
29678 Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
29679 ----------------------------------------------------------------------------------------------------------------------------------------------------------
29680   L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       3.8      3.8       0.0   1.0      0.0     49.5        78       102    0.766       0      0
29681   L5      0/0    0.00 KB   0.0      3.6     3.6      0.0       2.9      2.9       0.0   0.8     47.0     37.3        80        49    1.623   7267K   743K
29682   L6      1/0    7.23 MB   0.0      6.7     3.0      3.6       3.7      0.0       0.0   1.2     61.2     33.5       112        65    1.718     10M  6628K
29683  Sum      1/0    7.23 MB   0.0     10.3     6.7      3.7      10.3      6.7       0.0   2.7     39.2     39.3       269       216    1.247     17M  7371K
29684  Int      0/0    0.00 KB   0.0      0.1     0.1      0.1       0.0     -0.1       0.0   3.2    171.1     28.6         1         3    0.264    436K   263K
29685 Uptime(secs): 6010.2 total, 6010.2 interval
29686 Flush(GB): cumulative 3.776, interval 0.007
29687 AddFile(GB): cumulative 0.000, interval 0.000
29688 AddFile(Total Files): cumulative 0, interval 0
29689 AddFile(L0 Files): cumulative 0, interval 0
29690 AddFile(Keys): cumulative 0, interval 0
29691 Cumulative compaction: 10.33 GB write, 1.76 MB/s write, 10.32 GB read, 1.76 MB/s read, 269.3 seconds
29692 Interval compaction: 0.02 GB write, 0.00 MB/s write, 0.13 GB read, 0.02 MB/s read, 0.8 seconds
29693 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte      s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Disk size usage before (raft data is only stored in the /raft subdir, the delta is the disk space used by the single base engine):

~ crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: [email protected]
11G     /mnt/data1/raft

2: [email protected]
11G     /mnt/data1/raft

3: [email protected]
12G     /mnt/data1/raft

4: [email protected]
10G     /mnt/data1/raft

~ crl-ssh ultramarine all sudo du -sh /mnt/data1
1: [email protected]
16G     /mnt/data1

2: [email protected]
15G     /mnt/data1

3: [email protected]
16G     /mnt/data1

4: [email protected]
15G     /mnt/data1

Disk usage after, n1 was killed and restarted. Note the disk usage for the other nodes dropped in the interim.

➜  server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: [email protected]
80M     /mnt/data1/raft

2: [email protected]
11G     /mnt/data1/raft

3: [email protected]
12G     /mnt/data1/raft

4: [email protected]
11G     /mnt/data1/raft

➜  server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1
1: [email protected]
4.8G    /mnt/data1

2: [email protected]
17G     /mnt/data1

3: [email protected]
18G     /mnt/data1

4: [email protected]
17G     /mnt/data1

Before looking into this further, is there something in place currently to bound disk usage for the base engine? I'm surprised it grows so rapidly having simply moved raft log entries + truncations to a separate engine and reducing the frequency in which it occurs. Raft engines for all three runs are using the same config/tunings as the base engine.

@petermattis
Copy link
Collaborator

As mentioned offline, there is no limiter on the amount of disk space RocksDB uses. The significant disk growth is very surprising. I wonder if RocksDB isn't compacting sstables in the Raft log engine for some reason. Nothing is jumping out at me in the compaction stats. Do you have compaction stats from the Raft log engine when its disk usage grew to 47GB?

@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 2, 2017

The snippets above are the stats from the raft log engine. Investigating.

@petermattis
Copy link
Collaborator

The snippets above are the stats from the raft log engine. Investigating.

Huh? The total data size in sstables there is tiny. I guess that makes sense on the one hand, we don't expect the Raft log engine to take much space on disk. So where is the 47GB number coming from?

@irfansharif
Copy link
Contributor Author

strange, the compaction stats doesn't match up with the files on disk.

I170802 17:39:07.336853 10 storage/store.go:4386  [n1,s1] sstables (raft eng read amplification = 1):
6 [ 2M 1 ]: 2M
I170802 17:39:07.336915 10 storage/store.go:4387  [n1,s1]
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     23.7         1        15    0.087       0      0
  L6      1/0    1.82 MB   0.0      0.1     0.0      0.0       0.0     -0.0       0.0   0.9     81.5     37.7         1        14    0.047    613K   588K
 Sum      1/0    1.82 MB   0.0      0.1     0.0      0.0       0.1      0.0       0.0   1.8     27.5     28.4         2        29    0.068    613K   588K
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0

Where I'm supposed to have a single 2MB L6 file, instead I have:

cockroach@cockroach-ultramarine-0001:/mnt/data1/raft$ du -sh *
2.0M    000007.sst
2.4M    000011.sst
2.0M    000014.sst
1.8M    000017.sst
1.7M    000020.sst
1.2M    000023.sst
2.0M    000026.sst
1.5M    000029.sst
2.1M    000032.sst
1.5M    000035.sst
2.1M    000038.sst
2.1M    000041.sst
2.2M    000044.sst
1.4M    000047.sst
55M     000048.log
1.9M    000050.sst
4.0K    auxiliary
4.0K    COCKROACHDB_VERSION
4.0K    CURRENT
4.0K    IDENTITY
0       LOCK
4.0K    MANIFEST-000008
8.0K    OPTIONS-000005

@andrewbaptist
Copy link
Collaborator

A slightly different alternative is to not use fsync on either WAL and instead write them using AIO/O_DIRECT. If we go down that path, we would want to preallocate the WAL and then treat it as a circular buffer. Using the O_DIRECT path would mean we would need to be more explicit about our writes (need to be 4K aligned), but if we do that, then we also get some of the benefits of having an async interface to the storage layer since only the WAL write is in the critical path. This could also be done with io_uring, but I don't know a lot about that...

@petermattis
Copy link
Collaborator

The Pebble WAL is not pre-allocated (well it is, but not in the way that you're describing), but the WAL recycling functionality means it is more or less a circular buffer when operating in the steady state. We don't seem to have a good description of WAL recycling in a comment or design doc, but @jbowens or @sumeerbhola know where the bodies are buried (i.e. how this all works).

@petermattis
Copy link
Collaborator

https://github.com/cockroachdb/pebble/blob/master/record/record.go#L60-L98 talks about the low-level WAL format changes used for WAL recycling and mentions that WAL recycling is a prerequisite for using direct IO.

@andrewbaptist
Copy link
Collaborator

Completely agree that we would need to use recycling of the file to use direct io. I also don't have a good feel for what the complexity of this change would be. That said, this is one of the "easier" format changes as for an upgrade you could read the old format, delete that file and then start writing in the new format.

For this to work would require the WAL would have to be fixed sized and preallocated. The fixed size is not completely true as it could be handled like a hash table resize where there is a very rare "freeze the world" event, and all data is copied from the old log to the new log. This would be a very high latency (2-3s) event however so should be avoided as much as possible.

If this is done, there are 3 primary benefits

  1. The WAL could be written without explicit syncs (O_DIRECT will sync if the file is preallocated). There are some questions about write caches on the drives, but talking to Seagate, they guarantee on all SSD drives that there is enough capacitance so that a write that makes it to the drive will make it to persistent storage
  2. The LSM files could be written without concern about the OS level synces freezing the WAL writes
  3. The removal of block allocation need for the WAL would reduce any unnecessary fragmentation and allow the file system to generally be more healthy.

@andrewbaptist
Copy link
Collaborator

@petermattis
Copy link
Collaborator

I'm not sure if any additional WAL format changes are needed. WAL recycling already works (those format changes are already in) and is battle hardened. RocksDB already has code for writing the WAL using O_DIRECT and it is using the same WAL format. Even expanding the size of the WAL already works. If you need to add another WAL file code exists to do that. Perhaps there are some additional parts here, but I suspect most of the pieces are in place.

There are some questions about write caches on the drives, but talking to Seagate, they guarantee on all SSD drives that there is enough capacitance so that a write that makes it to the drive will make it to persistent storage

We'd have to chat with the cloud providers to see what guarantees they provide with regard to EBS and similar.

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Oct 17, 2022

Reading a little more on this, the file can be opened with O_DIRECT|O_DSYNC which will be a no-op if the underlying block device has protection and will slow down writes if it doesn't by setting the FUA (Forced Unit Access). It then guarantees that as long as the file is preallocated (no allocations during writing) the data will be persistent on media before the call returns. Mixing this will AIO allows an async interface to the device so that writes can be "pipeline" and still notify the higher level app on a sync.

I think block-level sync on EBS is a no-op since it always assumes FUA is set (https://twitter.com/molson/status/1350854839090102273), and (https://www.evanjones.ca/durability-cloud.html).

If most of the pieces are in place for this that would be great! I haven't looked through pebble (or RocksDB) much before, so this might "just work". If this does all work, it would address some of the issues related to writing to both RAFT and the state in the same pebble instance, but not all of them.

tbg pushed a commit to tbg/cockroach that referenced this issue Nov 5, 2022
The significant missing piece is part of the Init
implementation to handle a RecoveryInconsistentReplica that
requires applying committed raft log entries to the state
machine. This missing piece will need to wait until cockroachdb#75729
is fixed.

There are multiple TODOs, including related to concurrency,
but the implementation is complete enough for the datadriven
test to exercise many state transitions. Additionally, the
test exercises loss of unsynced state, and fixup of that
state in ReplicasStorage.Init, by using vfs.NewStrictMem.

Informs cockroachdb#16624

Release note: None
@tbg tbg changed the title storage: dedicated raft storage engine kvserver: separate raft log Dec 8, 2022
@tbg tbg removed their assignment Dec 8, 2022
tbg added a commit to tbg/cockroach that referenced this issue Dec 16, 2022
To support a separate raft log[^1] we need to perform certain
reconciliation operations at start-up to recover from a lack
of atomicity between the state and log engines.

This commit gets us closer to being able to do so by listing
all replicas before starting the store, which means we now
have a handle on which uninitialized replicas exist in the
system.

As a first application of this knowledge, we now ensure that every
replica has a persisted FullReplicaID. Since this would not necessarily
be true for stores that have seen older releases, we backfill the
ReplicaID in 23.1 and can then require it to be present in a future
release that forces a migration through 23.1.

[^1]: cockroachdb#16624

Epic: CRDB-220
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 19, 2022
This introduces the concept of an Engine specific for use in a log
store. In other words, it's the beginnings of a logical separation of
the state machine and log engines, i.e. cockroachdb#16624.

For now, both continue to be backed by the same storage.Engine, and
LogEngine is not correctly used in all places. For example, snapshot
application hasn't yet been updated to account for the possibility
of two separate engines and writes a set of SSTS that is atomically
ingested into the single engine currently present but which logically
spans both engines (cockroachdb#93251).

Epic: CRDB-220
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 19, 2022
To support a separate raft log[^1] we need to perform certain
reconciliation operations at start-up to recover from a lack
of atomicity between the state and log engines.

This commit gets us closer to being able to do so by listing
all replicas before starting the store, which means we now
have a handle on which uninitialized replicas exist in the
system.

As a first application of this knowledge, we now ensure that every
replica has a persisted FullReplicaID. Since this would not necessarily
be true for stores that have seen older releases, we backfill the
ReplicaID in 23.1 and can then require it to be present in a future
release that forces a migration through 23.1.

[^1]: cockroachdb#16624

Epic: CRDB-220
Release note: None
craig bot pushed a commit that referenced this issue Dec 19, 2022
93317: kvserver: also load uninitialized replicas, verify replicaID r=pavelkalinnikov a=tbg

To support a separate raft log[^1] we need to perform certain
reconciliation operations at start-up to recover from a lack
of atomicity between the state and log engines.

This commit gets us closer to being able to do so by listing
all replicas before starting the store, which means we now
have a handle on which uninitialized replicas exist in the
system.

As a first application of this knowledge, we now ensure that every
replica has a persisted FullReplicaID. Since this would not necessarily
be true for stores that have seen older releases, we backfill the
ReplicaID in 23.1 and can then require it to be present in a future
release that forces a migration through 23.1.

[^1]: #16624

Epic: CRDB-220
Release note: None


93721: roachpb: include tenant name in invalid name error r=andreimatei a=andreimatei

Release note: none
Epic: none

93736: ring: make ring.Buffer generic r=ajwerner a=ajwerner

Epic: none

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. sync-me-8 T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests