Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally wait on bytes_per_sync to smooth I/O #29

Merged
merged 1 commit into from
Apr 23, 2019

Conversation

ajkr
Copy link

@ajkr ajkr commented Apr 11, 2019

The existing implementation does not guarantee bytes reach disk every bytes_per_sync when writing SST files, or every wal_bytes_per_sync when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.

My understanding of the existing behavior is we used sync_file_range with SYNC_FILE_RANGE_WRITE to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.

Consider this sync_file_range usage: sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE). Expanding the range to start at 0 and adding the SYNC_FILE_RANGE_WAIT_BEFORE flag causes any pending writeback (like from a previous call to sync_file_range) to finish before it proceeds to submit the latest nbytes for writeback. The latest nbytes are still written back asynchronously, unless processing exceeds I/O speed, in which case the following sync_file_range will need to wait on it.

There is a second change in this PR to use fdatasync when sync_file_range is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).

The above two changes only apply when the user enables a new option, strict_bytes_per_sync.

Test Plan: ran it in Cockroach large-scale tests on ext4 and ZFS. It fixed problems caused by huge SST syncs in both scenarios.


This change is Reviewable

@ajkr
Copy link
Author

ajkr commented Apr 11, 2019

Upstream PR is facebook#5183.

Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @ajkr)


env/io_posix.cc, line 989 at r1 (raw file):

  // upholds the contract of `bytes_per_sync`, it has disadvantages: (1) other
  // non-Posix `Env`s do not have this behavior yet; and (2) unlike
  // `sync_file_range`, `fdatasync` can sync metadata, increasing write-amp.

Perhaps the default behavior of WritableFile::RangeSync() should be to call Sync(). Then we would get uniform behavior across all systems. It is super surprising that bytes_per_sync only applies to Linux.


env/io_posix.cc, line 997 at r1 (raw file):

  assert(nbytes <= std::numeric_limits<off_t>::max());
  if (sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes),
                      SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE) == 0) {

I think you should add a comment explaining why SYNC_FILE_RANGE_WAIT_BEFORE is specified.

@ajkr ajkr force-pushed the sync-file-range-fallback branch 2 times, most recently from 586a8f9 to fb8d7a9 Compare April 15, 2019 23:59
@ajkr ajkr changed the title Ensure data reaches disk according to bytes_per_sync Optionally wait on bytes_per_sync to smooth I/O Apr 16, 2019
Copy link
Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 14 files reviewed, 2 unresolved discussions (waiting on @petermattis)


env/io_posix.cc, line 989 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps the default behavior of WritableFile::RangeSync() should be to call Sync(). Then we would get uniform behavior across all systems. It is super surprising that bytes_per_sync only applies to Linux.

Done.


env/io_posix.cc, line 997 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you should add a comment explaining why SYNC_FILE_RANGE_WAIT_BEFORE is specified.

Done.

@ajkr ajkr force-pushed the sync-file-range-fallback branch from fb8d7a9 to 2dffb28 Compare April 16, 2019 18:07
Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Seems worthwhile testing this again, though let's wait until the upstream PR lands before merging.

Reviewable status: 0 of 15 files reviewed, all discussions resolved

@ajkr
Copy link
Author

ajkr commented Apr 16, 2019

Sure, I'll do the measurements on XFS with this PR applied since we haven't started those yet.

Copy link
Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even after fixing the bug mentioned here I am not seeing the same perf improvement anymore. The runs still show improvement more often than regression, just not consistently like before. But they do still consistently show it fixes the disk stall check (set at three seconds in my experiments) so I think we should proceed anyways.

}
#endif // ROCKSDB_RANGESYNC_PRESENT
return WritableFile::RangeSync(offset, nbytes);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't supposed to be called if sync_file_range was already called.

The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.

My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.

Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.

There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).

The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.

Test Plan: ran it in Cockroach large-scale tests on ext4 and ZFS. It fixed problems caused by huge SST syncs in both scenarios.
@ajkr ajkr force-pushed the sync-file-range-fallback branch from 2dffb28 to 449835f Compare April 17, 2019 18:04
@petermattis
Copy link

Even after fixing the bug mentioned here I am not seeing the same perf improvement anymore. The runs still show improvement more often than regression, just not consistently like before. But they do still consistently show it fixes the disk stall check (set at three seconds in my experiments) so I think we should proceed anyways.

Huh, I wonder where the perf improvement went.

Did you have a chance to test on xfs?

@ajkr
Copy link
Author

ajkr commented Apr 17, 2019

Even after fixing the bug mentioned here I am not seeing the same perf improvement anymore. The runs still show improvement more often than regression, just not consistently like before. But they do still consistently show it fixes the disk stall check (set at three seconds in my experiments) so I think we should proceed anyways.

Huh, I wonder where the perf improvement went.

Did you have a chance to test on xfs?

Yes, but it took ~4 hours last night for both with/without my changes, and ~2 hours this morning for both with/without. The following results are an average of two runs for xfs and one run for ext4. Write barriers were always disabled.

Filesystem | Changes | Import time (s) | Speedup
xfs | None | 10647 |  
xfs | (1), (2), (4) | 10137 | 1.05
ext4 | None | 6417 |  
ext4 | (1), (2), (4) | 6420 | 1.00

@ajkr
Copy link
Author

ajkr commented Apr 17, 2019

My unvalidated suspicion is making flush/compaction to the tempstore slower actually benefits us. I think it schedules too many L0->L0 compactions when it's unthrottled which ends up increasing overall writes. That could explain why we saw improvement by making compactions marginally slower with strict_bytes_per_sync=true. Also could explain why we saw (1) (eliminating fdatasync on finished WAL) or (2) (setting wal_bytes_per_sync=0) alone made things slower.

@petermattis
Copy link

I think it schedules too many L0->L0 compactions when it's unthrottled which ends up increasing overall writes.

Do you know what the heuristics are for selecting an L0->L0 compaction? I'm not familiar with them.

@ajkr
Copy link
Author

ajkr commented Apr 18, 2019

I think it schedules too many L0->L0 compactions when it's unthrottled which ends up increasing overall writes.

Do you know what the heuristics are for selecting an L0->L0 compaction? I'm not familiar with them.

I think it was when L0 file count exceeds the compaction threshold by at least two, and L0->base is not possible. L0->base is not possible when another L0->base or a base->base+1 compaction is ongoing.

@ajkr
Copy link
Author

ajkr commented Apr 18, 2019

I think it was when L0 file count exceeds the compaction threshold by at least two, and L0->base is not possible. L0->base is not possible when another L0->base or a base->base+1 compaction is ongoing.

Although, reviewing cockroachdb/cockroach#34897 (comment), it looks like there were cases it did L0->L0 without contention at the base level. Seems suspicious. I will have to look again, maybe tomorrow.

@petermattis
Copy link

I think it was when L0 file count exceeds the compaction threshold by at least two, and L0->base is not possible. L0->base is not possible when another L0->base or a base->base+1 compaction is ongoing.

I took a look at RocksDB just now and I see the first condition. I wonder how this interacts with ingestion. In CRDB we set level0_file_num_compaction_trigger = 2. So if we ingest a bunch of sstables rapidly we can easily see the number of L0 files climb above the L0->L0 trigger threshold have 4.

It is possible we should bump level0_file_num_compaction_trigger higher. The default value is 4. I recall setting this to 2 based on a recommendation somewhere and seeing a benefit in some workload. I don't recall doing a thorough investigation. And I believe I chose that setting before L0->L0 compactions were a thing.

@ajkr ajkr merged commit 2597e93 into crl-release-5.17.2 Apr 23, 2019
@ajkr ajkr deleted the sync-file-range-fallback branch April 23, 2019 22:16
ajkr added a commit to ajkr/cockroach that referenced this pull request Apr 27, 2019
Includes the following changes, all of which have landed upstream.

- cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans"
- cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs"
- cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O"
- cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry"

Also made the RocksDB changes that we decided in cockroachdb#34897:

- Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary.
- Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`.
- Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`.

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Apr 27, 2019
37172: c-deps: bump rocksdb for multiple backported PRs r=ajkr a=ajkr

Includes the following changes, all of which have landed upstream.

- cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans"
- cockroachdb/rocksdb#28: "Fix #3840: only `SyncClosedLogs` for multiple CFs"
- cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O"
- cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry"

Also made the RocksDB changes that we decided in #34897:

- Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary.
- Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`.
- Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`.

Release note: None

Co-authored-by: Andrew Kryczka <[email protected]>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request May 15, 2019
Includes the following changes, all of which have landed upstream.

- cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans"
- cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs"
- cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O"
- cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry"

Also made the RocksDB changes that we decided in cockroachdb#34897:

- Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary.
- Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`.
- Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants