Optionally wait on bytes_per_sync to smooth I/O #29

ajkr · 2019-04-11T23:22:04Z

The existing implementation does not guarantee bytes reach disk every bytes_per_sync when writing SST files, or every wal_bytes_per_sync when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.

My understanding of the existing behavior is we used sync_file_range with SYNC_FILE_RANGE_WRITE to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.

Consider this sync_file_range usage: sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE). Expanding the range to start at 0 and adding the SYNC_FILE_RANGE_WAIT_BEFORE flag causes any pending writeback (like from a previous call to sync_file_range) to finish before it proceeds to submit the latest nbytes for writeback. The latest nbytes are still written back asynchronously, unless processing exceeds I/O speed, in which case the following sync_file_range will need to wait on it.

There is a second change in this PR to use fdatasync when sync_file_range is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).

The above two changes only apply when the user enables a new option, strict_bytes_per_sync.

Test Plan: ran it in Cockroach large-scale tests on ext4 and ZFS. It fixed problems caused by huge SST syncs in both scenarios.

This change is

ajkr · 2019-04-11T23:22:27Z

Upstream PR is facebook#5183.

petermattis

Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @ajkr)

env/io_posix.cc, line 989 at r1 (raw file):

  // upholds the contract of `bytes_per_sync`, it has disadvantages: (1) other
  // non-Posix `Env`s do not have this behavior yet; and (2) unlike
  // `sync_file_range`, `fdatasync` can sync metadata, increasing write-amp.

Perhaps the default behavior of WritableFile::RangeSync() should be to call Sync(). Then we would get uniform behavior across all systems. It is super surprising that bytes_per_sync only applies to Linux.

env/io_posix.cc, line 997 at r1 (raw file):

  assert(nbytes <= std::numeric_limits<off_t>::max());
  if (sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes),
                      SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE) == 0) {

I think you should add a comment explaining why SYNC_FILE_RANGE_WAIT_BEFORE is specified.

ajkr

Reviewable status: 0 of 14 files reviewed, 2 unresolved discussions (waiting on @petermattis)

env/io_posix.cc, line 989 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps the default behavior of WritableFile::RangeSync() should be to call Sync(). Then we would get uniform behavior across all systems. It is super surprising that bytes_per_sync only applies to Linux.

Done.

env/io_posix.cc, line 997 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you should add a comment explaining why SYNC_FILE_RANGE_WAIT_BEFORE is specified.

Done.

petermattis

Seems worthwhile testing this again, though let's wait until the upstream PR lands before merging.

Reviewable status: 0 of 15 files reviewed, all discussions resolved

ajkr · 2019-04-16T20:48:10Z

Sure, I'll do the measurements on XFS with this PR applied since we haven't started those yet.

ajkr

Even after fixing the bug mentioned here I am not seeing the same perf improvement anymore. The runs still show improvement more often than regression, just not consistently like before. But they do still consistently show it fixes the disk stall check (set at three seconds in my experiments) so I think we should proceed anyways.

ajkr · 2019-04-17T17:42:40Z

env/io_posix.cc

  }
+#endif  // ROCKSDB_RANGESYNC_PRESENT
+  return WritableFile::RangeSync(offset, nbytes);


This wasn't supposed to be called if sync_file_range was already called.

The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways. My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes. Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it. There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically). The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`. Test Plan: ran it in Cockroach large-scale tests on ext4 and ZFS. It fixed problems caused by huge SST syncs in both scenarios.

petermattis · 2019-04-17T18:51:29Z

Even after fixing the bug mentioned here I am not seeing the same perf improvement anymore. The runs still show improvement more often than regression, just not consistently like before. But they do still consistently show it fixes the disk stall check (set at three seconds in my experiments) so I think we should proceed anyways.

Huh, I wonder where the perf improvement went.

Did you have a chance to test on xfs?

ajkr · 2019-04-17T19:22:18Z

Even after fixing the bug mentioned here I am not seeing the same perf improvement anymore. The runs still show improvement more often than regression, just not consistently like before. But they do still consistently show it fixes the disk stall check (set at three seconds in my experiments) so I think we should proceed anyways.

Huh, I wonder where the perf improvement went.

Did you have a chance to test on xfs?

Yes, but it took ~4 hours last night for both with/without my changes, and ~2 hours this morning for both with/without. The following results are an average of two runs for xfs and one run for ext4. Write barriers were always disabled.

Filesystem | Changes | Import time (s) | Speedup
xfs | None | 10647 |  
xfs | (1), (2), (4) | 10137 | 1.05
ext4 | None | 6417 |  
ext4 | (1), (2), (4) | 6420 | 1.00

ajkr · 2019-04-17T19:28:37Z

My unvalidated suspicion is making flush/compaction to the tempstore slower actually benefits us. I think it schedules too many L0->L0 compactions when it's unthrottled which ends up increasing overall writes. That could explain why we saw improvement by making compactions marginally slower with strict_bytes_per_sync=true. Also could explain why we saw (1) (eliminating fdatasync on finished WAL) or (2) (setting wal_bytes_per_sync=0) alone made things slower.

petermattis · 2019-04-18T12:50:12Z

I think it schedules too many L0->L0 compactions when it's unthrottled which ends up increasing overall writes.

Do you know what the heuristics are for selecting an L0->L0 compaction? I'm not familiar with them.

ajkr · 2019-04-18T16:09:52Z

I think it schedules too many L0->L0 compactions when it's unthrottled which ends up increasing overall writes.

Do you know what the heuristics are for selecting an L0->L0 compaction? I'm not familiar with them.

I think it was when L0 file count exceeds the compaction threshold by at least two, and L0->base is not possible. L0->base is not possible when another L0->base or a base->base+1 compaction is ongoing.

ajkr · 2019-04-18T16:23:21Z

I think it was when L0 file count exceeds the compaction threshold by at least two, and L0->base is not possible. L0->base is not possible when another L0->base or a base->base+1 compaction is ongoing.

Although, reviewing cockroachdb/cockroach#34897 (comment), it looks like there were cases it did L0->L0 without contention at the base level. Seems suspicious. I will have to look again, maybe tomorrow.

petermattis · 2019-04-20T18:40:23Z

I think it was when L0 file count exceeds the compaction threshold by at least two, and L0->base is not possible. L0->base is not possible when another L0->base or a base->base+1 compaction is ongoing.

I took a look at RocksDB just now and I see the first condition. I wonder how this interacts with ingestion. In CRDB we set level0_file_num_compaction_trigger = 2. So if we ingest a bunch of sstables rapidly we can easily see the number of L0 files climb above the L0->L0 trigger threshold have 4.

It is possible we should bump level0_file_num_compaction_trigger higher. The default value is 4. I recall setting this to 2 based on a recommendation somewhere and seeing a benefit in some workload. I don't recall doing a thorough investigation. And I believe I chose that setting before L0->L0 compactions were a thing.

Includes the following changes, all of which have landed upstream. - cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans" - cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs" - cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O" - cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry" Also made the RocksDB changes that we decided in cockroachdb#34897: - Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary. - Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`. - Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`. Release note: None

37172: c-deps: bump rocksdb for multiple backported PRs r=ajkr a=ajkr Includes the following changes, all of which have landed upstream. - cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans" - cockroachdb/rocksdb#28: "Fix #3840: only `SyncClosedLogs` for multiple CFs" - cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O" - cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry" Also made the RocksDB changes that we decided in #34897: - Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary. - Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`. - Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`. Release note: None Co-authored-by: Andrew Kryczka <[email protected]>

Includes the following changes, all of which have landed upstream. - cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans" - cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs" - cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O" - cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry" Also made the RocksDB changes that we decided in cockroachdb#34897: - Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary. - Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`. - Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`. Release note: None

ajkr force-pushed the sync-file-range-fallback branch from 69335b9 to 9f1d884 Compare April 11, 2019 23:23

This was referenced Apr 11, 2019

IMPORT: fatal disk stall during CSV import cockroachdb/cockroach#36757

Closed

roachtest: clearrange/checks=false failed cockroachdb/cockroach#34897

Closed

ajkr force-pushed the sync-file-range-fallback branch from 9f1d884 to 164348e Compare April 12, 2019 01:32

petermattis approved these changes Apr 12, 2019

View reviewed changes

ajkr force-pushed the sync-file-range-fallback branch 2 times, most recently from 586a8f9 to fb8d7a9 Compare April 15, 2019 23:59

ajkr changed the title ~~Ensure data reaches disk according to bytes_per_sync~~ Optionally wait on bytes_per_sync to smooth I/O Apr 16, 2019

ajkr commented Apr 16, 2019

View reviewed changes

ajkr force-pushed the sync-file-range-fallback branch from fb8d7a9 to 2dffb28 Compare April 16, 2019 18:07

petermattis approved these changes Apr 16, 2019

View reviewed changes

ajkr commented Apr 17, 2019

View reviewed changes

ajkr force-pushed the sync-file-range-fallback branch from 2dffb28 to 449835f Compare April 17, 2019 18:04

ajkr merged commit 2597e93 into crl-release-5.17.2 Apr 23, 2019

ajkr deleted the sync-file-range-fallback branch April 23, 2019 22:16

ajkr mentioned this pull request Apr 27, 2019

c-deps: bump rocksdb for multiple backported PRs cockroachdb/cockroach#37172

Merged

nvanbenschoten mentioned this pull request Jun 1, 2020

storage: large value performance degradation since switching to pebble cockroachdb/cockroach#49750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally wait on bytes_per_sync to smooth I/O #29

Optionally wait on bytes_per_sync to smooth I/O #29

ajkr commented Apr 11, 2019 •

edited by petermattis

Loading

ajkr commented Apr 11, 2019

petermattis left a comment

ajkr left a comment

petermattis left a comment

ajkr commented Apr 16, 2019

ajkr left a comment •

edited

Loading

ajkr Apr 17, 2019

petermattis commented Apr 17, 2019

ajkr commented Apr 17, 2019

ajkr commented Apr 17, 2019 •

edited

Loading

petermattis commented Apr 18, 2019

ajkr commented Apr 18, 2019

ajkr commented Apr 18, 2019

petermattis commented Apr 20, 2019

Optionally wait on bytes_per_sync to smooth I/O #29

Optionally wait on bytes_per_sync to smooth I/O #29

Conversation

ajkr commented Apr 11, 2019 • edited by petermattis Loading

ajkr commented Apr 11, 2019

petermattis left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

ajkr commented Apr 16, 2019

ajkr left a comment • edited Loading

Choose a reason for hiding this comment

ajkr Apr 17, 2019

Choose a reason for hiding this comment

petermattis commented Apr 17, 2019

ajkr commented Apr 17, 2019

ajkr commented Apr 17, 2019 • edited Loading

petermattis commented Apr 18, 2019

ajkr commented Apr 18, 2019

ajkr commented Apr 18, 2019

petermattis commented Apr 20, 2019

ajkr commented Apr 11, 2019 •

edited by petermattis

Loading

ajkr left a comment •

edited

Loading

ajkr commented Apr 17, 2019 •

edited

Loading