Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better caching options #197

Merged
merged 1 commit into from
Mar 12, 2024
Merged

Better caching options #197

merged 1 commit into from
Mar 12, 2024

Conversation

masklinn
Copy link
Contributor

@masklinn masklinn commented Mar 1, 2024

  • random replacement
  • fifo
  • LFU
  • general purpose Quick Demotion (QD) decorator (?) FIFO w/ flag + ghost cache
  • QD-LP-FIFO
  • W-TinyLFU
  • sieve

Possible additional candidates:

  • ARC have not looked at it yet but people seem to find it very complex
  • LFU with dynamic aging
  • segmented LRU
  • Clock (FIFO with reinsertion? not entirely clear) with Adaptive Replacement

Closes #143

@masklinn masklinn force-pushed the caches branch 3 times, most recently from 18cee5a to fc1af24 Compare March 2, 2024 12:58
@masklinn masklinn force-pushed the caches branch 5 times, most recently from 317c1e1 to 88de76e Compare March 12, 2024 21:14
Closes ua-parser#143

Memory Shavings
===============

It was the plan all along, but since I worried about the overhead of
caches it made no sense to keep the result objects (which would
compose the cache entries) as dict-instances, so they've been
converted to `__slots__` (manually since `dataclasses` only supports
slots from 3.10).

Sadly this requires adding explicit `__init__` to every dataclass
involved as default values are not compatible with `__slots__`.

Cache Policies
==============

S3Fifo as default
-----------------

Testing on the sample file taught me what cache people have clearly
known for a while: lru is *awful*. You can do worse, but it takes
surprisingly little to be competitive with it.

S3Fifo turns out to have pretty good performances while being
relatively simple. S3 is not perfect, notably like most CLOCK-type
algorithm its eviction is O(n) which might be a bit of an issue in
some cases. But until someone complains...

As a result, S3 is now the cache policy for the basic cache (if `re2`
is not available) replacing LRU, and it's also exported as `Cache`
from the package root.

From an implementation perspective, the original exploratory version
(of this and most FIFOs tested) used an ordered dict as an indexed
fifo but the memory consumption is not great, the final version uses a
single index dict and separate deques for the FIFOs, an idea found in
@cmcaine's s3fifo which significantly compacts memory requirements
(though it's still a good 50% higher than a SIEVE or OD-based LRU of
the same size).

LFU
---

Matani et al's O(1) LFU had a great showing on hitrates and perfs
(though slightly worse than s3 still), however the implementation
still required the addition of some form of aging, which was not worth
it. Theoretically a straight LFU could work for offline use but...
that's a pretty pointles use as in that case you can just parse each
unique value once and splat by the entry count.

W-TinyLFU is the big modern cheese in the field, but I opted to avoid
it for now: it's a lot more complicated than the existing caches
(requiring a bloom filter, a frequency sketch or counting bloom
filter, an SLRU, and an LRU), plus a good implementation clearly
requires a lot of bit twiddling (for the bloom filters / frequency
sketch), which Python is not great at from a performance point of view
(I tried implementing CLOCK using a bytearray for bitmap and it was
crap).

SIEVE
-----

SIEVE is consistently a few percentage point below S3, and it's
lacking a few properties (e.g. scan resistance), however it does have
one interesting property which S3 lacks: at small cache sizes it has
less memory overhead than LRU, despite Python-level linked list and
nodes where LRU gets to use the native-coded OrderedDict, with a
C-level linked list and a bespoke secondary hashmap. And it does that
with the hitrates of an LRU double the size until we get to caches a
significant fraction the size of uniques (5000). It also features a
truly thread-safe unsynchronized cache hit.

Note: while the reference paper uses a doubly linked list, this
implementation uses a singly linked list for the sieve hand. This
means the hand is a pair of pointers but it saves 11% memory on the
nodes (72 -> 64 bytes), which gets significant as the size of the
cache increases.

Other Caches
------------

A number of simple cache implementations were temporarily
~~embarassed~~ implemented for testing:

- random
- fifo
- lp-fifo / fifo-reinsertion
- CLOCK (0 to 2), which is a different implementation of the same
  algorithm, tried a bitmap, it was horrible, an array of counters was
  competitive with lp-fifo using an ordereddict (perf-wise, I had yet
  to start looking at memory use).
- QD-LP-FIFO which is not *really* an algorithm but was an
  intermediate stations to S3 (the addition of a fixed-size
  probationary fifo and a ghost cache to an LP-FIFO, S3 is basically a
  more advanced and flexible version)

The trivial caches (RR, fifo) were worse than LRU but very simple, the
others were better than LRU but at the end of the day didn't really
pull their weight compared to alternatives (even if they were easy to
implement).

An interesting note here is that the quick-demotion scheme of S3 can
be put in front of LRU to some success (it does improve hit rates
significantly as the sample trace has a large number of one hit
wonders), but without excellent reasons to use an LRU on the back end
it doesn't seem super useful.

Thread Safety
=============

The `Locking` wrapper has been removed, probably for ever: testing
showed that the perf hit of a lock in GILPython was basically nil (at
least for the amount of work ua-python has to do, on uncontended
locks). Since none of the caches are intrinsically safe anymore (and
the clearing cache's lack of performance was a lot worse than any
synchronisation could be) it's better to just have synchronised
caches.

Thread-local cache support has however been added in case, and will be
documented, in case it turns out to be of use to the !gil mode (it
basically trades memory and / or hitrate for lower contention).

s3fifo implementation notes
===========================

The initial implementation of S3Fifo was done using ordered dicts as
indexed fifos, this was easy but after adding some memory tracking it
turns out to have a lot of overhead, at around 250% the overhead of
Lru (which makes sense, it needs 2 ordered dicts of about the same
size, plus a smaller ordered dict, plus entry objects to track
frequency).

An implementation based on deques is a lot more reasonable, it only
needs a single dict and CPython's deques are implemented as unrolled
linked lists of order 64 (so each link of the list stores 64
elements). It still needs about 150% of the Lru space but that's a lot
more reasonable. At n=5000 after a full run on the sample file the
measurements from tracemalloc indicates 785576 bytes, with
`sys.getsizeof` measurements of the different elements indicating:

- 415152 bytes for the index dict
-   4984 bytes for the small cache deque
-  37720 bytes for the main cache deque
-  38248 bytes for the ghost cache deque
- 280000 bytes for the CacheEntry objects

For LRU this is 500488 bytes of which 498752 are attributed to the
`OrderedDict`.

It seems difficult to go below: while in theory the ~9500 entries
should fit in a dict of class 14, as the dicts have a lot of traffic
(keys being added and removed) — and possibly because they're never
iterated so this is not a concern (have not checked if this is a
consideration) — cpython uses a dict one size larger to compact less
often[^dict]. However the issue also occurs in the LRU so it's
"fair" (while the OrderedDict has a Python implementation which uses
two maps, it also has a native implementation which uses an internal
ad-hoc hashmap rather than a full blow dict, so it doesn't quite have
double-hashmap overhead).

Note that this only measures *cache overhead*, so the cache keys are
not counted, and all parses result in a global singleton:

- user agent strings are around 195 bytes on average
- parse results, user agent, and os objects are 72 bytes
- device objects are 56 bytes
- the extracted strings total about 200 bytes on average[^interning]

That's some 600 bytes per cache entry, or 3000000 bytes
for a 5000 entries cache. In view of that, the cache overhead hardly
seems consequential, but still.

[^dict]: Roughly python's dict has power of two size classes, a size
         class `n` leads to a total capacity of `1<<n` and an
         effective capacity of `(1<<n<<1)/3`. The dict object is
         composed of a sparse array of indices sized to the total
         capacity, these indices can be u8, u16, u32, or u64 depending
         on the effective capacity. The dict object is then composed
         of a dense array of entries sized to the effective capacity.
         An entry is generally three pointers (hash, key, value) but
         can be just two as an optimisation e.g. for string keys (as
         strings memoise their own hash). Thus the space needed for a
         dict of class `n` is `sizeof(idx) * (1 << n) + (2|3) * 8 *
         ((1<<n<<1)/3)` (plus a few dozen bytes of various metadata).
         Thus for a dict of size n, the way to get the minimum class
         is `ceil(log2(len * 3/2))`. As such a 5000 entries
         string-keyed dict (Lru) should be in size 13 and of size
         taking about 101kB, and a ~9500 entries dict (S3Fifo index)
         should be in size 14 taking about 202kB. These are what's
         observed by straight filling dicts to those sizes, but
         churning them a few hundred to thousand times (removing and
         adding keys, keeping their sizes constant) ends up one size
         class above. I've not confirmed it but it's likely because a
         dict of size 14 has an effective capacity of 10922, which
         means every ~1500 removals and insertions the dense array
         would need to be compacted, rehashed, and rewritten. By
         bumping over to class 15, this happens every 12000 cycles
         instead, at the cost of double the memory.

[^interning]: Technically it's around 500, but single-character
              strings are always interned and those are common for the
              version fields of UserAgent and OS (about 56% of them)
              and they account for most of the possible overhead, 2
              and 3 characters strings account for a further 24 and
              17%, though with diminishing returns: 2-char strings
              seems the most promising as 93 of them are represented
              (91 being two-digit numbers) and almost all of them more
              than once (the sample file has only two singleton
              two-char strings, only one of which is a number) by
              comparison all 3-character strings are numbers but 57
              out of 251 are singletons.
@masklinn masklinn merged commit b45380d into ua-parser:master Mar 12, 2024
29 checks passed
@masklinn masklinn deleted the caches branch March 12, 2024 21:27
@cmcaine
Copy link

cmcaine commented Mar 13, 2024

Thanks for the citation/credit :) I'm glad you found my code useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use better cache?
2 participants