New stable, fixed-length cache keys (#9126)

Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes #7405 (presuming a root cause) Pull Request resolved: #9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
facebook · Dec 17, 2021 · 0050a73 · 0050a73
1 parent 9918e1e
commit 0050a73
Show file tree

Hide file tree

Showing 36 changed files with 1,009 additions and 433 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -636,6 +636,7 @@ find_package(Threads REQUIRED)
 set(SOURCES
         cache/cache.cc
         cache/cache_entry_roles.cc
+        cache/cache_key.cc
         cache/cache_reservation_manager.cc
         cache/clock_cache.cc
         cache/lru_cache.cc

diff --git a/HISTORY.md b/HISTORY.md
@@ -9,15 +9,18 @@
 * Fixed a bug affecting custom memtable factories which are not registered with the `ObjectRegistry`. The bug could result in failure to save the OPTIONS file.
 * Fixed a bug causing two duplicate entries to be appended to a file opened in non-direct mode and tracked by `FaultInjectionTestFS`.
 * Fixed a bug in TableOptions.prepopulate_block_cache to support block-based filters also.
+* Block cache keys no longer use `FSRandomAccessFile::GetUniqueId()` (previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.
 
 ### Behavior Changes
 * MemTableList::TrimHistory now use allocated bytes when max_write_buffer_size_to_maintain > 0(default in TrasactionDB, introduced in PR#5022) Fix #8371.
+
 ### Public API change
 * Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional `checker` argument that performs additional checking on timestamp sizes.
-* Introduce a new EventListener callback that will be called upon the end of automatic error recovery. 
+* Introduce a new EventListener callback that will be called upon the end of automatic error recovery.
 
 ### Performance Improvements
 * Replaced map property `TableProperties::properties_offsets`  with uint64_t property `external_sst_file_global_seqno_offset` to save table properties's memory.
+* Block cache accesses are faster by RocksDB using cache keys of fixed size (16 bytes).
 
 ### Java API Changes
 * Removed Java API `TableProperties.getPropertiesOffsets()` as it exposed internal details to external users.

diff --git a/TARGETS b/TARGETS
@@ -143,6 +143,7 @@ cpp_library(
     srcs = [
         "cache/cache.cc",
         "cache/cache_entry_roles.cc",
+        "cache/cache_key.cc",
         "cache/cache_reservation_manager.cc",
         "cache/clock_cache.cc",
         "cache/lru_cache.cc",
@@ -472,6 +473,7 @@ cpp_library(
     srcs = [
         "cache/cache.cc",
         "cache/cache_entry_roles.cc",
+        "cache/cache_key.cc",
         "cache/cache_reservation_manager.cc",
         "cache/clock_cache.cc",
         "cache/lru_cache.cc",

diff --git a/cache/cache_bench_tool.cc b/cache/cache_bench_tool.cc
@@ -5,11 +5,14 @@
 
 #ifdef GFLAGS
 #include <cinttypes>
+#include <cstddef>
 #include <cstdio>
 #include <limits>
+#include <memory>
 #include <set>
 #include <sstream>
 
+#include "db/db_impl/db_impl.h"
 #include "monitoring/histogram.h"
 #include "port/port.h"
 #include "rocksdb/cache.h"
@@ -18,6 +21,8 @@
 #include "rocksdb/env.h"
 #include "rocksdb/secondary_cache.h"
 #include "rocksdb/system_clock.h"
+#include "rocksdb/table_properties.h"
+#include "table/block_based/block_based_table_reader.h"
 #include "table/block_based/cachable_entry.h"
 #include "util/coding.h"
 #include "util/gflags_compat.h"
@@ -73,6 +78,36 @@ static class std::shared_ptr<ROCKSDB_NAMESPACE::SecondaryCache> secondary_cache;
 
 DEFINE_bool(use_clock_cache, false, "");
 
+// ## BEGIN stress_cache_key sub-tool options ##
+DEFINE_bool(stress_cache_key, false,
+            "If true, run cache key stress test instead");
+DEFINE_uint32(sck_files_per_day, 2500000,
+              "(-stress_cache_key) Simulated files generated per day");
+DEFINE_uint32(sck_duration, 90,
+              "(-stress_cache_key) Number of days to simulate in each run");
+DEFINE_uint32(
+    sck_min_collision, 15,
+    "(-stress_cache_key) Keep running until this many collisions seen");
+DEFINE_uint32(
+    sck_file_size_mb, 32,
+    "(-stress_cache_key) Simulated file size in MiB, for accounting purposes");
+DEFINE_uint32(sck_reopen_nfiles, 100,
+              "(-stress_cache_key) Re-opens DB average every n files");
+DEFINE_uint32(
+    sck_restarts_per_day, 24,
+    "(-stress_cache_key) Simulated process restarts per day (across DBs)");
+DEFINE_uint32(sck_db_count, 100,
+              "(-stress_cache_key) Parallel DBs in operation");
+DEFINE_uint32(sck_table_bits, 20,
+              "(-stress_cache_key) Log2 number of tracked files");
+DEFINE_uint32(sck_keep_bits, 50,
+              "(-stress_cache_key) Number of cache key bits to keep");
+DEFINE_bool(sck_randomize, false,
+            "(-stress_cache_key) Randomize (hash) cache key");
+DEFINE_bool(sck_footer_unique_id, false,
+            "(-stress_cache_key) Simulate using proposed footer unique id");
+// ## END stress_cache_key sub-tool options ##
+
 namespace ROCKSDB_NAMESPACE {
 
 class CacheBench;
@@ -548,9 +583,195 @@ class CacheBench {
   }
 };
 
+// TODO: better description (see PR #9126 for some info)
+class StressCacheKey {
+ public:
+  void Run() {
+    if (FLAGS_sck_footer_unique_id) {
+      FLAGS_sck_db_count = 1;
+    }
+
+    uint64_t mb_per_day =
+        uint64_t{FLAGS_sck_files_per_day} * FLAGS_sck_file_size_mb;
+    printf("Total cache or DBs size: %gTiB  Writing %g MiB/s or %gTiB/day\n",
+           FLAGS_sck_file_size_mb / 1024.0 / 1024.0 *
+               std::pow(2.0, FLAGS_sck_table_bits),
+           mb_per_day / 86400.0, mb_per_day / 1024.0 / 1024.0);
+    multiplier_ = std::pow(2.0, 128 - FLAGS_sck_keep_bits) /
+                  (FLAGS_sck_file_size_mb * 1024.0 * 1024.0);
+    printf(
+        "Multiply by %g to correct for simulation losses (but still assume "
+        "whole file cached)\n",
+        multiplier_);
+    restart_nfiles_ = FLAGS_sck_files_per_day / FLAGS_sck_restarts_per_day;
+    double without_ejection =
+        std::pow(1.414214, FLAGS_sck_keep_bits) / FLAGS_sck_files_per_day;
+    printf(
+        "Without ejection, expect random collision after %g days (%g "
+        "corrected)\n",
+        without_ejection, without_ejection * multiplier_);
+    double with_full_table =
+        std::pow(2.0, FLAGS_sck_keep_bits - FLAGS_sck_table_bits) /
+        FLAGS_sck_files_per_day;
+    printf(
+        "With ejection and full table, expect random collision after %g "
+        "days (%g corrected)\n",
+        with_full_table, with_full_table * multiplier_);
+    collisions_ = 0;
+
+    for (int i = 1; collisions_ < FLAGS_sck_min_collision; i++) {
+      RunOnce();
+      if (collisions_ == 0) {
+        printf(
+            "No collisions after %d x %u days                              "
+            "                   \n",
+            i, FLAGS_sck_duration);
+      } else {
+        double est = 1.0 * i * FLAGS_sck_duration / collisions_;
+        printf("%" PRIu64
+               " collisions after %d x %u days, est %g days between (%g "
+               "corrected)        \n",
+               collisions_, i, FLAGS_sck_duration, est, est * multiplier_);
+      }
+    }
+  }
+
+  void RunOnce() {
+    const size_t db_count = FLAGS_sck_db_count;
+    dbs_.reset(new TableProperties[db_count]{});
+    const size_t table_mask = (size_t{1} << FLAGS_sck_table_bits) - 1;
+    table_.reset(new uint64_t[table_mask + 1]{});
+    if (FLAGS_sck_keep_bits > 64) {
+      FLAGS_sck_keep_bits = 64;
+    }
+    uint32_t shift_away = 64 - FLAGS_sck_keep_bits;
+    uint32_t shift_away_b = shift_away / 3;
+    uint32_t shift_away_a = shift_away - shift_away_b;
+
+    process_count_ = 0;
+    session_count_ = 0;
+    ResetProcess();
+
+    Random64 r{std::random_device{}()};
+
+    uint64_t max_file_count =
+        uint64_t{FLAGS_sck_files_per_day} * FLAGS_sck_duration;
+    uint64_t file_count = 0;
+    uint32_t report_count = 0;
+    uint32_t collisions_this_run = 0;
+    // Round robin through DBs
+    for (size_t db_i = 0;; ++db_i) {
+      if (db_i >= db_count) {
+        db_i = 0;
+      }
+      if (file_count >= max_file_count) {
+        break;
+      }
+      if (!FLAGS_sck_footer_unique_id && r.OneIn(FLAGS_sck_reopen_nfiles)) {
+        ResetSession(db_i);
+      } else if (r.OneIn(restart_nfiles_)) {
+        ResetProcess();
+      }
+      OffsetableCacheKey ock;
+      dbs_[db_i].orig_file_number += 1;
+      // skip some file numbers, unless 1 DB so that that can simulate
+      // better (DB-independent) unique IDs
+      if (db_count > 1) {
+        dbs_[db_i].orig_file_number += (r.Next() & 3);
+      }
+      BlockBasedTable::SetupBaseCacheKey(&dbs_[db_i], "", 42, 42, &ock);
+      CacheKey ck = ock.WithOffset(0);
+      uint64_t stripped;
+      if (FLAGS_sck_randomize) {
+        stripped = GetSliceHash64(ck.AsSlice()) >> shift_away;
+      } else if (FLAGS_sck_footer_unique_id) {
+        uint32_t a = DecodeFixed32(ck.AsSlice().data() + 4) >> shift_away_a;
+        uint32_t b = DecodeFixed32(ck.AsSlice().data() + 12) >> shift_away_b;
+        stripped = (uint64_t{a} << 32) + b;
+      } else {
+        uint32_t a = DecodeFixed32(ck.AsSlice().data()) << shift_away_a;
+        uint32_t b = DecodeFixed32(ck.AsSlice().data() + 12) >> shift_away_b;
+        stripped = (uint64_t{a} << 32) + b;
+      }
+      if (stripped == 0) {
+        // Unlikely, but we need to exclude tracking this value
+        printf("Hit Zero!                                                  \n");
+        continue;
+      }
+      file_count++;
+      uint64_t h = NPHash64(reinterpret_cast<char*>(&stripped), 8);
+      // Skew lifetimes
+      size_t pos =
+          std::min(Lower32of64(h) & table_mask, Upper32of64(h) & table_mask);
+      if (table_[pos] == stripped) {
+        collisions_this_run++;
+        // To predict probability of no collisions, we have to get rid of
+        // correlated collisions, which this takes care of:
+        ResetProcess();
+      } else {
+        // Replace
+        table_[pos] = stripped;
+      }
+
+      if (++report_count == FLAGS_sck_files_per_day) {
+        report_count = 0;
+        // Estimate fill %
+        size_t incr = table_mask / 1000;
+        size_t sampled_count = 0;
+        for (size_t i = 0; i <= table_mask; i += incr) {
+          if (table_[i] != 0) {
+            sampled_count++;
+          }
+        }
+        // Report
+        printf(
+            "%" PRIu64 " days, %" PRIu64 " proc, %" PRIu64
+            " sess, %u coll, occ %g%%, ejected %g%%   \r",
+            file_count / FLAGS_sck_files_per_day, process_count_,
+            session_count_, collisions_this_run, 100.0 * sampled_count / 1000.0,
+            100.0 * (1.0 - sampled_count / 1000.0 * table_mask / file_count));
+        fflush(stdout);
+      }
+    }
+    collisions_ += collisions_this_run;
+  }
+
+  void ResetSession(size_t i) {
+    dbs_[i].db_session_id = DBImpl::GenerateDbSessionId(nullptr);
+    session_count_++;
+  }
+
+  void ResetProcess() {
+    process_count_++;
+    DBImpl::TEST_ResetDbSessionIdGen();
+    for (size_t i = 0; i < FLAGS_sck_db_count; ++i) {
+      ResetSession(i);
+    }
+    if (FLAGS_sck_footer_unique_id) {
+      dbs_[0].orig_file_number = 0;
+    }
+  }
+
+ private:
+  // Use db_session_id and orig_file_number from TableProperties
+  std::unique_ptr<TableProperties[]> dbs_;
+  std::unique_ptr<uint64_t[]> table_;
+  uint64_t process_count_ = 0;
+  uint64_t session_count_ = 0;
+  uint64_t collisions_ = 0;
+  uint32_t restart_nfiles_ = 0;
+  double multiplier_ = 0.0;
+};
+
 int cache_bench_tool(int argc, char** argv) {
   ParseCommandLineFlags(&argc, &argv, true);
 
+  if (FLAGS_stress_cache_key) {
+    // Alternate tool
+    StressCacheKey().Run();
+    return 0;
+  }
+
   if (FLAGS_threads <= 0) {
     fprintf(stderr, "threads number <= 0\n");
     exit(1);

diff --git a/cache/cache_entry_stats.h b/cache/cache_entry_stats.h
@@ -11,6 +11,7 @@
 #include <mutex>
 
 #include "cache/cache_helpers.h"
+#include "cache/cache_key.h"
 #include "port/lang.h"
 #include "rocksdb/cache.h"
 #include "rocksdb/status.h"
@@ -112,13 +113,7 @@ class CacheEntryStatsCollector {
   // entry in cache until all refs are destroyed.
   static Status GetShared(Cache *cache, SystemClock *clock,
                           std::shared_ptr<CacheEntryStatsCollector> *ptr) {
-    std::array<uint64_t, 3> cache_key_data{
-        {// First 16 bytes == md5 of class name
-         0x7eba5a8fb5437c90U, 0x8ca68c9b11655855U,
-         // Last 8 bytes based on a function pointer to make unique for each
-         // template instantiation
-         reinterpret_cast<uint64_t>(&CacheEntryStatsCollector::GetShared)}};
-    Slice cache_key = GetSlice(&cache_key_data);
+    const Slice &cache_key = GetCacheKey();
 
     Cache::Handle *h = cache->Lookup(cache_key);
     if (h == nullptr) {
@@ -166,6 +161,13 @@ class CacheEntryStatsCollector {
     delete static_cast<CacheEntryStatsCollector *>(value);
   }
 
+  static const Slice &GetCacheKey() {
+    // For each template instantiation
+    static CacheKey ckey = CacheKey::CreateUniqueForProcessLifetime();
+    static Slice ckey_slice = ckey.AsSlice();
+    return ckey_slice;
+  }
+
   std::mutex saved_mutex_;
   Stats saved_stats_;