Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support an improved/global limit on BlobDB's space amp #10399

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

gangliao
Copy link
Contributor

@gangliao gangliao commented Jul 21, 2022

Summary:

BlobDB currently supports limiting space amplification via the configuration option blob_garbage_collection_force_threshold. It works by computing the ratio of garbage (i.e. garbage bytes divided by total bytes) over the oldest batch of blob files, and if the ratio exceeds the specified threshold, it triggers a special type of compaction targeting the SST files that point to the blob files in question. (There is a coarse mapping between SSTs and blob files, which we track in the MANIFEST.)

This existing option can be difficult to use or tune. There are (at least) two challenges:

(1). The occupancy of blob files is not uniform: older blob files tend to have more garbage, so if a service owner has a specific space amp goal, it is far from obvious what value they should set for blob_garbage_collection_force_threshold.
(2). BlobDB keeps track of the exact amount of garbage in blob files, which enables us to compute the blob files' "space amp" precisely. Even though it's an exact value, there is a disconnect between this metric and people's expectations regarding space amp. The problem is that while people tend to think of LSM tree space amp as the ratio between the total size of the DB and the total size of the live/current KVs, for the purposes of blob space amp, a blob is only considered garbage once the corresponding blob reference has already been compacted out from the LSM tree. (One could say the the LSM tree space amp notion described above is "logical", while the blob one is "physical".)

To make the users' lives easier and solve (1), we would want to add a new configuration option (working title: blob_garbage_collection_space_amp_limit) that would enable customers to directly set a space amp target (as opposed to a per-blob-file-batch garbage threshold). To bridge the gap between the above notion of LSM tree space amp and the blob space amp (2), we would want this limit to apply to the entire data structure/database (the LSM tree plus the blob files). Note that this will necessarily be an estimate, since we don't know exactly how much space the obsolete KVs take up in the LSM tree. One simple idea would be to take the reciprocal of the LSM tree space amp estimated using the method of VersionStorageInfo::EstimateLiveDataSize, and scale the number of live blob bytes using the same factor.

Example: let's say the LSM tree space amp is 1.5, which means that the live KVs take up two thirds of the LSM. Then, we can use the same 2/3 factor to multiply the value of (total blob bytes - garbage blob bytes) to get an estimate of the live blob bytes from the user's perspective.

Note: if the above limit is breached, we would still want to do the same thing as in the case of blob_garbage_collection_force_threshold, i.e. force-compact the SSTs pointing to the oldest blob files (potentially repeatedly, until the limit is satisified).

  • Added a new option blob_garbage_collection_space_amp_limit
  • Added Java and C APIs.
  • Add internal code to enable this space amp target for BlobDB

This task is a part of #10156

gangliao added 5 commits July 24, 2022 11:00
Summary:

BlobDB currently supports limiting space amplification via the configuration option `blob_garbage_collection_force_threshold` (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L958-L969). It works by computing the ratio of garbage (i.e. garbage bytes divided by total bytes) over the oldest batch of blob files, and if the ratio exceeds the specified threshold, it triggers a special type of compaction targeting the SST files that point to the blob files in question. (There is a coarse mapping between SSTs and blob files, which we track in the MANIFEST.)

This existing option can be difficult to use or tune. There are (at least) two challenges:

- The occupancy of blob files is not uniform: older blob files tend to have more garbage, so if a service owner has a specific space amp goal, it is far from obvious what value they should set for `blob_garbage_collection_force_threshold`.
- BlobDB keeps track of the exact amount of garbage in blob files, which enables us to compute the blob files' "space amp" precisely. Even though it's an exact value, there is a disconnect between this metric and people's expectations regarding space amp. The problem is that while people tend to think of LSM tree space amp as the ratio between the total size of the DB and the total size of the live/current KVs, for the purposes of blob space amp, a blob is only considered garbage once the corresponding blob reference has already been compacted out from the LSM tree. (One could say the the LSM tree space amp notion described above is "logical", while the blob one is "physical".)

To make the users' lives easier and solve facebook#1, we would want to add a new configuration option (working title: `blob_garbage_collection_space_amp_limit`) that would enable customers to directly set a space amp target (as opposed to a per-blob-file-batch garbage threshold). To bridge the gap between the above notion of LSM tree space amp and the blob space amp (facebook#2), we would want this limit to apply to the entire data structure/database (the LSM tree plus the blob files). Note that this will necessarily be an estimate, since we don't know exactly how much space the obsolete KVs take up in the LSM tree. One simple idea would be to take the reciprocal of the LSM tree space amp estimated using the method of `VersionStorageInfo::EstimateLiveDataSize`, and scale the number of live blob bytes using the same factor.

Example: let's say the LSM tree space amp is 1.5, which means that the live KVs take up two thirds of the LSM. Then, we can use the same 2/3 factor to multiply the value of (total blob bytes - garbage blob bytes) to get an estimate of the live blob bytes from the user's perspective.

Note: if the above limit is breached, we would still want to do the same thing as in the case of `blob_garbage_collection_force_threshold`, i.e. force-compact the SSTs pointing to the oldest blob files (potentially repeatedly, until the limit is satisified).
@gangliao gangliao mentioned this pull request Jul 25, 2022
14 tasks
@@ -4,6 +4,7 @@
* Added `prepopulate_blob_cache` to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies.
* Support using secondary cache with the blob cache. When creating a blob cache, the user can set a secondary blob cache by configuring `secondary_cache` in LRUCacheOptions.
* Charge memory usage of blob cache when the backing cache of the blob cache and the block cache are different. If an operation reserving memory for blob cache exceeds the avaible space left in the block cache at some point (i.e, causing a cache full under `LRUCacheOptions::strict_capacity_limit` = true), creation will fail with `Status::MemoryLimit()`. To opt in this feature, enable charging `CacheEntryRole::kBlobCache` in `BlockBasedTableOptions::cache_usage_options`.
* Added a new blob garbage collection option `blob_garbage_collection_space_amp_limit` to enable customers to directly set a space amplification target (as opposed to a per-blob-file-batch garbage threshold) to support an improved/global limit on BlobDB's space amplification.`blob_garbage_collection_space_amp_limit` is set to 0.0 (disabled) by default. To enable this feature, set `blob_garbage_collection_space_amp_limit` to a positive value between 1.0 and 50.0. The lower the value, the more aggressive the garbage collection. This option is only available when `blob_garbage_collection` is enabled, and it will replace the option `blob_garbage_collection_force_threshold` if it is set properly.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noe sure about the most suitable upper limit

@facebook-github-bot
Copy link
Contributor

Hi @gangliao!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@GOGOYAO
Copy link

GOGOYAO commented Aug 25, 2022

looking forward to this pr

@GOGOYAO
Copy link

GOGOYAO commented Sep 1, 2022

no more plans to solve the CLA checks?

@GOGOYAO
Copy link

GOGOYAO commented Sep 6, 2022

@riversand963 @akankshamahajan15 @ltamasi looking forward to merge this commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants