-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally wait on bytes_per_sync to smooth I/O #29
Conversation
Upstream PR is facebook#5183. |
69335b9
to
9f1d884
Compare
9f1d884
to
164348e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @ajkr)
env/io_posix.cc, line 989 at r1 (raw file):
// upholds the contract of `bytes_per_sync`, it has disadvantages: (1) other // non-Posix `Env`s do not have this behavior yet; and (2) unlike // `sync_file_range`, `fdatasync` can sync metadata, increasing write-amp.
Perhaps the default behavior of WritableFile::RangeSync()
should be to call Sync()
. Then we would get uniform behavior across all systems. It is super surprising that bytes_per_sync
only applies to Linux.
env/io_posix.cc, line 997 at r1 (raw file):
assert(nbytes <= std::numeric_limits<off_t>::max()); if (sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE) == 0) {
I think you should add a comment explaining why SYNC_FILE_RANGE_WAIT_BEFORE
is specified.
586a8f9
to
fb8d7a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 14 files reviewed, 2 unresolved discussions (waiting on @petermattis)
env/io_posix.cc, line 989 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Perhaps the default behavior of
WritableFile::RangeSync()
should be to callSync()
. Then we would get uniform behavior across all systems. It is super surprising thatbytes_per_sync
only applies to Linux.
Done.
env/io_posix.cc, line 997 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think you should add a comment explaining why
SYNC_FILE_RANGE_WAIT_BEFORE
is specified.
Done.
fb8d7a9
to
2dffb28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems worthwhile testing this again, though let's wait until the upstream PR lands before merging.
Reviewable status: 0 of 15 files reviewed, all discussions resolved
Sure, I'll do the measurements on XFS with this PR applied since we haven't started those yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even after fixing the bug mentioned here I am not seeing the same perf improvement anymore. The runs still show improvement more often than regression, just not consistently like before. But they do still consistently show it fixes the disk stall check (set at three seconds in my experiments) so I think we should proceed anyways.
} | ||
#endif // ROCKSDB_RANGESYNC_PRESENT | ||
return WritableFile::RangeSync(offset, nbytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't supposed to be called if sync_file_range
was already called.
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways. My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes. Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it. There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically). The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`. Test Plan: ran it in Cockroach large-scale tests on ext4 and ZFS. It fixed problems caused by huge SST syncs in both scenarios.
2dffb28
to
449835f
Compare
Huh, I wonder where the perf improvement went. Did you have a chance to test on xfs? |
Yes, but it took ~4 hours last night for both with/without my changes, and ~2 hours this morning for both with/without. The following results are an average of two runs for xfs and one run for ext4. Write barriers were always disabled.
|
My unvalidated suspicion is making flush/compaction to the tempstore slower actually benefits us. I think it schedules too many L0->L0 compactions when it's unthrottled which ends up increasing overall writes. That could explain why we saw improvement by making compactions marginally slower with |
Do you know what the heuristics are for selecting an L0->L0 compaction? I'm not familiar with them. |
I think it was when L0 file count exceeds the compaction threshold by at least two, and L0->base is not possible. L0->base is not possible when another L0->base or a base->base+1 compaction is ongoing. |
Although, reviewing cockroachdb/cockroach#34897 (comment), it looks like there were cases it did L0->L0 without contention at the base level. Seems suspicious. I will have to look again, maybe tomorrow. |
I took a look at RocksDB just now and I see the first condition. I wonder how this interacts with ingestion. In CRDB we set It is possible we should bump |
Includes the following changes, all of which have landed upstream. - cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans" - cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs" - cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O" - cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry" Also made the RocksDB changes that we decided in cockroachdb#34897: - Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary. - Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`. - Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`. Release note: None
37172: c-deps: bump rocksdb for multiple backported PRs r=ajkr a=ajkr Includes the following changes, all of which have landed upstream. - cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans" - cockroachdb/rocksdb#28: "Fix #3840: only `SyncClosedLogs` for multiple CFs" - cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O" - cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry" Also made the RocksDB changes that we decided in #34897: - Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary. - Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`. - Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`. Release note: None Co-authored-by: Andrew Kryczka <[email protected]>
Includes the following changes, all of which have landed upstream. - cockroachdb/rocksdb#27: "ldb: set `total_order_seek` for scans" - cockroachdb/rocksdb#28: "Fix cockroachdb#3840: only `SyncClosedLogs` for multiple CFs" - cockroachdb/rocksdb#29: "Optionally wait on bytes_per_sync to smooth I/O" - cockroachdb/rocksdb#30: "Option string/map/file can set env from object registry" Also made the RocksDB changes that we decided in cockroachdb#34897: - Do not sync WAL before installing flush result. This is achieved by backporting cockroachdb/rocksdb#28; no configuration change is necessary. - Do not sync WAL ever for temp stores. This is achieved by setting `wal_bytes_per_sync = 0`. - Limit size of final syncs when generating SSTs. This is achieved by backporting cockroachdb/rocksdb#29 and turning it on with `strict_bytes_per_sync = true`. Release note: None
The existing implementation does not guarantee bytes reach disk every
bytes_per_sync
when writing SST files, or everywal_bytes_per_sync
when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.My understanding of the existing behavior is we used
sync_file_range
withSYNC_FILE_RANGE_WRITE
to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.Consider this
sync_file_range
usage:sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)
. Expanding the range to start at 0 and adding theSYNC_FILE_RANGE_WAIT_BEFORE
flag causes any pending writeback (like from a previous call tosync_file_range
) to finish before it proceeds to submit the latestnbytes
for writeback. The latestnbytes
are still written back asynchronously, unless processing exceeds I/O speed, in which case the followingsync_file_range
will need to wait on it.There is a second change in this PR to use
fdatasync
whensync_file_range
is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).The above two changes only apply when the user enables a new option,
strict_bytes_per_sync
.Test Plan: ran it in Cockroach large-scale tests on ext4 and ZFS. It fixed problems caused by huge SST syncs in both scenarios.
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)