-
Notifications
You must be signed in to change notification settings - Fork 140
Tombstones purge from hashtable: theory and practice
Given: hashtable playing the role of fixed-sized LRU cache: on inserting a new entry the oldest one should be removed. DHash and QHash leave so-called tombstones instead of removed entries (these slots are considered as free for insertion and as full for lookup queries and removals). The hashtable is getting more and more polluted: free slots are running out slowly. Insertion can displace a free slot (total free slots count decreases) or a tombstone (total free slots count doesn't change). Effective load of the hashtable for lookup queries is (capacity - free slots) / capacity
, not size / capacity
, so queries become more and more expensive. At some point we should rehash the table to purge it from tombstones.
The problem: at what effective table load we should rehash, depending on the target load of rehash?
All the following reasoning regards to DHash model, on the other hand, QHash model is not very different.
Let
c
- the hashtable's capacity
s
- the size (fixed)
l
- target (and initial) hashtable load. l = s / c
t
- current number of tombstone slots
f
- current number of free slots. c = s + t + f
irp
- total number of "insertion-removal pairs" since the previous rehash or hashtable construction
The probability of displacing a free slot on insertion of a new entry is (f / (t + f)) ^ 2
.
Solving a differential equation, we have t(irp) = (c - s) * irp / (c - s + irp)
.
If irp = 0
then t = 0
, if irp → ∞
then t → c - s
, f → 0
.
Suppose irp_rehash = k * s
, i. e. our strategy is to rehash every "size * k" insertion-removal pairs.
Let rl = (s + t_rehash) / c
- "rehash load" - load, at which rehash will occur with our strategy.
Then rl = (l * (k - l + 1)) / (k * l - l + 1)
.
E. g. if k = 1
, i. e. our strategy is to rehash every s
insertion-removal pairs, then rl = 2 * l - l^2
. E. g. if l = 0.5
, rl = 0.75
.
I made a series of experiments using this benchmark to determine really optimal rehash loads for DHash and QHash.
The header column contain hashtable target loads.
The first two columns denote expected effective loads at rehash, is we use the proposed strategy to rehash each k * s
insertion-removal pairs. k = 2.5
seems to produce the closest approximation to the optimal rehash loads.
The 3rd and 4th columns contain really optimal effective loads to rehash at, for the given target loads, for DHash and QHash respectively.
l |
Strategy, k = 1
|
Strategy, k = 2.5
|
DHash, optimal | QHash, optimal |
---|---|---|---|---|
.3 | .51 | .66 | .72-.76 | .75-.8 |
.4 | .64 | .775 | .83 | .81-.85 |
.5 | .75 | .86 | .84-.87 | .86-.88 |
.6 | .84 | .915 | .88-.9 | .9 |
.7 | .91 | .955 | .93-.94 | .93-.95 |
.8 | .96 | .98 | .955 | .96 |
.9 | .99 | .995 | .985 | .985 |
Apparently the proposed strategy leads to too often purges for low target loads (sparse hashtables) and too rare purges for high loads (dense hashtables). |
Approximation of the DHash optimal rehash loads right from the 3rd column of the table above:
rl = 0.54 + 0.816 * l - 0.363 * l^2
QHash, 4th column:
rl = 0.6 + 0.671 * l - 0.274 * l^2
In hashtables, the cost of any query consists of constant base cost and additional cost multiplied by the number of probes over the first probe. I. e. if query take 3 probes, it's cost is base cost plus 2 * additional cost. Of cause, base and additional cost are different for different types of queries (lookup, insertion, removal).
On low loads, most of queries take only one probe, on very high loads the average probe count per query raise up to 10-20 and more.
During rehash, entries are reinserted into newly-created clean table, therefore even at high loads at least a half of the entries are reinserted at the base cost (reisertion took only one probe). But all ordinary queries (including insertions and removals) take many additional probes. While on low loads reqular queries and reinserions both take mostly one probe.
That is why on high loads rehash (reinsertion of every entry) is relatively cheaper than on low loads. But the proposed strategy don't know that ordinary query cost / reinsertion cost ratio changes depending on the target load.
Why DHash optimal rehash loads are lower than QHash??
It definitely means that QHash reinsertions are relatively more expensive, than ordinary queries, than DHash reinsertions. But I don't understand why, I was expected the opposite, because
- DHash base costs include two expensive integral divisions, while QHash - only one. Additional costs of the both algorithms don't include integral divisions and pretty similar. So DHash base costs are relatively more expensive, than additional costs, than QHash base costs.
- QHash average probe counts are higher by 10-28% than DHash for typical loads.
Therefore QHash reinsertions should be relatively cheaper, than ordinary queries, than DHash reinsertions.
TODO: benchmark DHash vs. QHash purge exclusively.
In real practice, except insertions and removals there are many lookups, making rehash cost relatively less meaningful, therefore optimal effective loads to rehash at should be lower, than in the 3rd and 4th columns of the table above.
On the other hand, LRU cache keeps the oldest entry slot, making removals very cheap (in my benchmark there are ordinary removals).
I failed to obtain optimal rehash loads considering a portion of simple lookups along with insertions and removals with any precision, because measurement results vary from run to run too much. Estimation by sight: rl_consider_lookups = rl - 0.05 * (1 - l)
. With it, approximation of optimal rehash loads
for DHash is rl = 0.49 + 0.866 * l - 0.363 * l^2
,
for QHash: rl = 0.55 + 0.721 * l - 0.274 * l^2
.