Skip to content

Commit

Permalink
New stable, fixed-length cache keys (#9126)
Browse files Browse the repository at this point in the history
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).

The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.

This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)

The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.

Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.

Fixes #7405 (presuming a root cause)

Pull Request resolved: #9126

Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.

### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)

### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime

We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:

```
Total cache or DBs size: 32TiB  Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```

These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key.  With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.

More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day

After enough data, we get a result at the end:

```
(keep 40 bits)  17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```

If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:

```
(keep 41 bits)  16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits)  19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```

The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:

```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```

I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.

Reviewed By: zhichao-cao

Differential Revision: D33171746

Pulled By: pdillinger

fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
  • Loading branch information
pdillinger authored and facebook-github-bot committed Dec 17, 2021
1 parent 9918e1e commit 0050a73
Show file tree
Hide file tree
Showing 36 changed files with 1,009 additions and 433 deletions.
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -636,6 +636,7 @@ find_package(Threads REQUIRED)
set(SOURCES
cache/cache.cc
cache/cache_entry_roles.cc
cache/cache_key.cc
cache/cache_reservation_manager.cc
cache/clock_cache.cc
cache/lru_cache.cc
Expand Down
5 changes: 4 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,18 @@
* Fixed a bug affecting custom memtable factories which are not registered with the `ObjectRegistry`. The bug could result in failure to save the OPTIONS file.
* Fixed a bug causing two duplicate entries to be appended to a file opened in non-direct mode and tracked by `FaultInjectionTestFS`.
* Fixed a bug in TableOptions.prepopulate_block_cache to support block-based filters also.
* Block cache keys no longer use `FSRandomAccessFile::GetUniqueId()` (previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.

### Behavior Changes
* MemTableList::TrimHistory now use allocated bytes when max_write_buffer_size_to_maintain > 0(default in TrasactionDB, introduced in PR#5022) Fix #8371.

### Public API change
* Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional `checker` argument that performs additional checking on timestamp sizes.
* Introduce a new EventListener callback that will be called upon the end of automatic error recovery.
* Introduce a new EventListener callback that will be called upon the end of automatic error recovery.

### Performance Improvements
* Replaced map property `TableProperties::properties_offsets` with uint64_t property `external_sst_file_global_seqno_offset` to save table properties's memory.
* Block cache accesses are faster by RocksDB using cache keys of fixed size (16 bytes).

### Java API Changes
* Removed Java API `TableProperties.getPropertiesOffsets()` as it exposed internal details to external users.
Expand Down
2 changes: 2 additions & 0 deletions TARGETS
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ cpp_library(
srcs = [
"cache/cache.cc",
"cache/cache_entry_roles.cc",
"cache/cache_key.cc",
"cache/cache_reservation_manager.cc",
"cache/clock_cache.cc",
"cache/lru_cache.cc",
Expand Down Expand Up @@ -472,6 +473,7 @@ cpp_library(
srcs = [
"cache/cache.cc",
"cache/cache_entry_roles.cc",
"cache/cache_key.cc",
"cache/cache_reservation_manager.cc",
"cache/clock_cache.cc",
"cache/lru_cache.cc",
Expand Down
221 changes: 221 additions & 0 deletions cache/cache_bench_tool.cc
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,14 @@

#ifdef GFLAGS
#include <cinttypes>
#include <cstddef>
#include <cstdio>
#include <limits>
#include <memory>
#include <set>
#include <sstream>

#include "db/db_impl/db_impl.h"
#include "monitoring/histogram.h"
#include "port/port.h"
#include "rocksdb/cache.h"
Expand All @@ -18,6 +21,8 @@
#include "rocksdb/env.h"
#include "rocksdb/secondary_cache.h"
#include "rocksdb/system_clock.h"
#include "rocksdb/table_properties.h"
#include "table/block_based/block_based_table_reader.h"
#include "table/block_based/cachable_entry.h"
#include "util/coding.h"
#include "util/gflags_compat.h"
Expand Down Expand Up @@ -73,6 +78,36 @@ static class std::shared_ptr<ROCKSDB_NAMESPACE::SecondaryCache> secondary_cache;

DEFINE_bool(use_clock_cache, false, "");

// ## BEGIN stress_cache_key sub-tool options ##
DEFINE_bool(stress_cache_key, false,
"If true, run cache key stress test instead");
DEFINE_uint32(sck_files_per_day, 2500000,
"(-stress_cache_key) Simulated files generated per day");
DEFINE_uint32(sck_duration, 90,
"(-stress_cache_key) Number of days to simulate in each run");
DEFINE_uint32(
sck_min_collision, 15,
"(-stress_cache_key) Keep running until this many collisions seen");
DEFINE_uint32(
sck_file_size_mb, 32,
"(-stress_cache_key) Simulated file size in MiB, for accounting purposes");
DEFINE_uint32(sck_reopen_nfiles, 100,
"(-stress_cache_key) Re-opens DB average every n files");
DEFINE_uint32(
sck_restarts_per_day, 24,
"(-stress_cache_key) Simulated process restarts per day (across DBs)");
DEFINE_uint32(sck_db_count, 100,
"(-stress_cache_key) Parallel DBs in operation");
DEFINE_uint32(sck_table_bits, 20,
"(-stress_cache_key) Log2 number of tracked files");
DEFINE_uint32(sck_keep_bits, 50,
"(-stress_cache_key) Number of cache key bits to keep");
DEFINE_bool(sck_randomize, false,
"(-stress_cache_key) Randomize (hash) cache key");
DEFINE_bool(sck_footer_unique_id, false,
"(-stress_cache_key) Simulate using proposed footer unique id");
// ## END stress_cache_key sub-tool options ##

namespace ROCKSDB_NAMESPACE {

class CacheBench;
Expand Down Expand Up @@ -548,9 +583,195 @@ class CacheBench {
}
};

// TODO: better description (see PR #9126 for some info)
class StressCacheKey {
public:
void Run() {
if (FLAGS_sck_footer_unique_id) {
FLAGS_sck_db_count = 1;
}

uint64_t mb_per_day =
uint64_t{FLAGS_sck_files_per_day} * FLAGS_sck_file_size_mb;
printf("Total cache or DBs size: %gTiB Writing %g MiB/s or %gTiB/day\n",
FLAGS_sck_file_size_mb / 1024.0 / 1024.0 *
std::pow(2.0, FLAGS_sck_table_bits),
mb_per_day / 86400.0, mb_per_day / 1024.0 / 1024.0);
multiplier_ = std::pow(2.0, 128 - FLAGS_sck_keep_bits) /
(FLAGS_sck_file_size_mb * 1024.0 * 1024.0);
printf(
"Multiply by %g to correct for simulation losses (but still assume "
"whole file cached)\n",
multiplier_);
restart_nfiles_ = FLAGS_sck_files_per_day / FLAGS_sck_restarts_per_day;
double without_ejection =
std::pow(1.414214, FLAGS_sck_keep_bits) / FLAGS_sck_files_per_day;
printf(
"Without ejection, expect random collision after %g days (%g "
"corrected)\n",
without_ejection, without_ejection * multiplier_);
double with_full_table =
std::pow(2.0, FLAGS_sck_keep_bits - FLAGS_sck_table_bits) /
FLAGS_sck_files_per_day;
printf(
"With ejection and full table, expect random collision after %g "
"days (%g corrected)\n",
with_full_table, with_full_table * multiplier_);
collisions_ = 0;

for (int i = 1; collisions_ < FLAGS_sck_min_collision; i++) {
RunOnce();
if (collisions_ == 0) {
printf(
"No collisions after %d x %u days "
" \n",
i, FLAGS_sck_duration);
} else {
double est = 1.0 * i * FLAGS_sck_duration / collisions_;
printf("%" PRIu64
" collisions after %d x %u days, est %g days between (%g "
"corrected) \n",
collisions_, i, FLAGS_sck_duration, est, est * multiplier_);
}
}
}

void RunOnce() {
const size_t db_count = FLAGS_sck_db_count;
dbs_.reset(new TableProperties[db_count]{});
const size_t table_mask = (size_t{1} << FLAGS_sck_table_bits) - 1;
table_.reset(new uint64_t[table_mask + 1]{});
if (FLAGS_sck_keep_bits > 64) {
FLAGS_sck_keep_bits = 64;
}
uint32_t shift_away = 64 - FLAGS_sck_keep_bits;
uint32_t shift_away_b = shift_away / 3;
uint32_t shift_away_a = shift_away - shift_away_b;

process_count_ = 0;
session_count_ = 0;
ResetProcess();

Random64 r{std::random_device{}()};

uint64_t max_file_count =
uint64_t{FLAGS_sck_files_per_day} * FLAGS_sck_duration;
uint64_t file_count = 0;
uint32_t report_count = 0;
uint32_t collisions_this_run = 0;
// Round robin through DBs
for (size_t db_i = 0;; ++db_i) {
if (db_i >= db_count) {
db_i = 0;
}
if (file_count >= max_file_count) {
break;
}
if (!FLAGS_sck_footer_unique_id && r.OneIn(FLAGS_sck_reopen_nfiles)) {
ResetSession(db_i);
} else if (r.OneIn(restart_nfiles_)) {
ResetProcess();
}
OffsetableCacheKey ock;
dbs_[db_i].orig_file_number += 1;
// skip some file numbers, unless 1 DB so that that can simulate
// better (DB-independent) unique IDs
if (db_count > 1) {
dbs_[db_i].orig_file_number += (r.Next() & 3);
}
BlockBasedTable::SetupBaseCacheKey(&dbs_[db_i], "", 42, 42, &ock);
CacheKey ck = ock.WithOffset(0);
uint64_t stripped;
if (FLAGS_sck_randomize) {
stripped = GetSliceHash64(ck.AsSlice()) >> shift_away;
} else if (FLAGS_sck_footer_unique_id) {
uint32_t a = DecodeFixed32(ck.AsSlice().data() + 4) >> shift_away_a;
uint32_t b = DecodeFixed32(ck.AsSlice().data() + 12) >> shift_away_b;
stripped = (uint64_t{a} << 32) + b;
} else {
uint32_t a = DecodeFixed32(ck.AsSlice().data()) << shift_away_a;
uint32_t b = DecodeFixed32(ck.AsSlice().data() + 12) >> shift_away_b;
stripped = (uint64_t{a} << 32) + b;
}
if (stripped == 0) {
// Unlikely, but we need to exclude tracking this value
printf("Hit Zero! \n");
continue;
}
file_count++;
uint64_t h = NPHash64(reinterpret_cast<char*>(&stripped), 8);
// Skew lifetimes
size_t pos =
std::min(Lower32of64(h) & table_mask, Upper32of64(h) & table_mask);
if (table_[pos] == stripped) {
collisions_this_run++;
// To predict probability of no collisions, we have to get rid of
// correlated collisions, which this takes care of:
ResetProcess();
} else {
// Replace
table_[pos] = stripped;
}

if (++report_count == FLAGS_sck_files_per_day) {
report_count = 0;
// Estimate fill %
size_t incr = table_mask / 1000;
size_t sampled_count = 0;
for (size_t i = 0; i <= table_mask; i += incr) {
if (table_[i] != 0) {
sampled_count++;
}
}
// Report
printf(
"%" PRIu64 " days, %" PRIu64 " proc, %" PRIu64
" sess, %u coll, occ %g%%, ejected %g%% \r",
file_count / FLAGS_sck_files_per_day, process_count_,
session_count_, collisions_this_run, 100.0 * sampled_count / 1000.0,
100.0 * (1.0 - sampled_count / 1000.0 * table_mask / file_count));
fflush(stdout);
}
}
collisions_ += collisions_this_run;
}

void ResetSession(size_t i) {
dbs_[i].db_session_id = DBImpl::GenerateDbSessionId(nullptr);
session_count_++;
}

void ResetProcess() {
process_count_++;
DBImpl::TEST_ResetDbSessionIdGen();
for (size_t i = 0; i < FLAGS_sck_db_count; ++i) {
ResetSession(i);
}
if (FLAGS_sck_footer_unique_id) {
dbs_[0].orig_file_number = 0;
}
}

private:
// Use db_session_id and orig_file_number from TableProperties
std::unique_ptr<TableProperties[]> dbs_;
std::unique_ptr<uint64_t[]> table_;
uint64_t process_count_ = 0;
uint64_t session_count_ = 0;
uint64_t collisions_ = 0;
uint32_t restart_nfiles_ = 0;
double multiplier_ = 0.0;
};

int cache_bench_tool(int argc, char** argv) {
ParseCommandLineFlags(&argc, &argv, true);

if (FLAGS_stress_cache_key) {
// Alternate tool
StressCacheKey().Run();
return 0;
}

if (FLAGS_threads <= 0) {
fprintf(stderr, "threads number <= 0\n");
exit(1);
Expand Down
16 changes: 9 additions & 7 deletions cache/cache_entry_stats.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include <mutex>

#include "cache/cache_helpers.h"
#include "cache/cache_key.h"
#include "port/lang.h"
#include "rocksdb/cache.h"
#include "rocksdb/status.h"
Expand Down Expand Up @@ -112,13 +113,7 @@ class CacheEntryStatsCollector {
// entry in cache until all refs are destroyed.
static Status GetShared(Cache *cache, SystemClock *clock,
std::shared_ptr<CacheEntryStatsCollector> *ptr) {
std::array<uint64_t, 3> cache_key_data{
{// First 16 bytes == md5 of class name
0x7eba5a8fb5437c90U, 0x8ca68c9b11655855U,
// Last 8 bytes based on a function pointer to make unique for each
// template instantiation
reinterpret_cast<uint64_t>(&CacheEntryStatsCollector::GetShared)}};
Slice cache_key = GetSlice(&cache_key_data);
const Slice &cache_key = GetCacheKey();

Cache::Handle *h = cache->Lookup(cache_key);
if (h == nullptr) {
Expand Down Expand Up @@ -166,6 +161,13 @@ class CacheEntryStatsCollector {
delete static_cast<CacheEntryStatsCollector *>(value);
}

static const Slice &GetCacheKey() {
// For each template instantiation
static CacheKey ckey = CacheKey::CreateUniqueForProcessLifetime();
static Slice ckey_slice = ckey.AsSlice();
return ckey_slice;
}

std::mutex saved_mutex_;
Stats saved_stats_;

Expand Down
Loading

0 comments on commit 0050a73

Please sign in to comment.