Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add S3 and SIEVE, make S3 the default, remove clearing and locking
Closes ua-parser#143 Memory Shavings =============== It was the plan all along, but since I worried about the overhead of caches it made no sense to keep the result objects (which would compose the cache entries) as dict-instances, so they've been converted to `__slots__` (manually since `dataclasses` only supports slots from 3.10). Sadly this requires adding explicit `__init__` to every dataclass involved as default values are not compatible with `__slots__`. Cache Policies ============== S3Fifo as default ----------------- Testing on the sample file taught me what cache people have clearly known for a while: lru is *awful*. You can do worse, but it takes surprisingly little to be competitive with it. S3Fifo turns out to have pretty good performances while being relatively simple. S3 is not perfect, notably like most CLOCK-type algorithm its eviction is O(n) which might be a bit of an issue in some cases. But until someone complains... As a result, S3 is now the cache policy for the basic cache (if `re2` is not available) replacing LRU, and it's also exported as `Cache` from the package root. From an implementation perspective, the original exploratory version (of this and most FIFOs tested) used an ordered dict as an indexed fifo but the memory consumption is not great, the final version uses a single index dict and separate deques for the FIFOs, an idea found in @cmcaine's s3fifo which significantly compacts memory requirements (though it's still a good 50% higher than a SIEVE or OD-based LRU of the same size). LFU --- Matani et al's O(1) LFU had a great showing on hitrates and perfs (though slightly worse than s3 still), however the implementation still required the addition of some form of aging, which was not worth it. Theoretically a straight LFU could work for offline use but... that's a pretty pointles use as in that case you can just parse each unique value once and splat by the entry count. W-TinyLFU is the big modern cheese in the field, but I opted to avoid it for now: it's a lot more complicated than the existing caches (requiring a bloom filter, a frequency sketch or counting bloom filter, an SLRU, and an LRU), plus a good implementation clearly requires a lot of bit twiddling (for the bloom filters / frequency sketch), which Python is not great at from a performance point of view (I tried implementing CLOCK using a bytearray for bitmap and it was crap). SIEVE ----- SIEVE is consistently a few percentage point below S3, and it's lacking a few properties (e.g. scan resistance), however it does have one interesting property which S3 lacks: at small cache sizes it has less memory overhead than LRU, despite Python-level linked list and nodes where LRU gets to use the native-coded OrderedDict, with a C-level linked list and a bespoke secondary hashmap. And it does that with the hitrates of an LRU double the size until we get to caches a significant fraction the size of uniques (5000). It also features a truly thread-safe unsynchronized cache hit. Note: while the reference paper uses a doubly linked list, this implementation uses a singly linked list for the sieve hand. This means the hand is a pair of pointers but it saves 11% memory on the nodes (72 -> 64 bytes), which gets significant as the size of the cache increases. Other Caches ------------ A number of simple cache implementations were temporarily ~~embarassed~~ implemented for testing: - random - fifo - lp-fifo / fifo-reinsertion - CLOCK (0 to 2), which is a different implementation of the same algorithm, tried a bitmap, it was horrible, an array of counters was competitive with lp-fifo using an ordereddict (perf-wise, I had yet to start looking at memory use). - QD-LP-FIFO which is not *really* an algorithm but was an intermediate stations to S3 (the addition of a fixed-size probationary fifo and a ghost cache to an LP-FIFO, S3 is basically a more advanced and flexible version) The trivial caches (RR, fifo) were worse than LRU but very simple, the others were better than LRU but at the end of the day didn't really pull their weight compared to alternatives (even if they were easy to implement). An interesting note here is that the quick-demotion scheme of S3 can be put in front of LRU to some success (it does improve hit rates significantly as the sample trace has a large number of one hit wonders), but without excellent reasons to use an LRU on the back end it doesn't seem super useful. Thread Safety ============= The `Locking` wrapper has been removed, probably for ever: testing showed that the perf hit of a lock in GILPython was basically nil (at least for the amount of work ua-python has to do, on uncontended locks). Since none of the caches are intrinsically safe anymore (and the clearing cache's lack of performance was a lot worse than any synchronisation could be) it's better to just have synchronised caches. Thread-local cache support has however been added in case, and will be documented, in case it turns out to be of use to the !gil mode (it basically trades memory and / or hitrate for lower contention). s3fifo implementation notes =========================== The initial implementation of S3Fifo was done using ordered dicts as indexed fifos, this was easy but after adding some memory tracking it turns out to have a lot of overhead, at around 250% the overhead of Lru (which makes sense, it needs 2 ordered dicts of about the same size, plus a smaller ordered dict, plus entry objects to track frequency). An implementation based on deques is a lot more reasonable, it only needs a single dict and CPython's deques are implemented as unrolled linked lists of order 64 (so each link of the list stores 64 elements). It still needs about 150% of the Lru space but that's a lot more reasonable. At n=5000 after a full run on the sample file the measurements from tracemalloc indicates 785576 bytes, with `sys.getsizeof` measurements of the different elements indicating: - 415152 bytes for the index dict - 4984 bytes for the small cache deque - 37720 bytes for the main cache deque - 38248 bytes for the ghost cache deque - 280000 bytes for the CacheEntry objects For LRU this is 500488 bytes of which 498752 are attributed to the `OrderedDict`. It seems difficult to go below: while in theory the ~9500 entries should fit in a dict of class 14, as the dicts have a lot of traffic (keys being added and removed) — and possibly because they're never iterated so this is not a concern (have not checked if this is a consideration) — cpython uses a dict one size larger to compact less often[^dict]. However the issue also occurs in the LRU so it's "fair" (while the OrderedDict has a Python implementation which uses two maps, it also has a native implementation which uses an internal ad-hoc hashmap rather than a full blow dict, so it doesn't quite have double-hashmap overhead). Note that this only measures *cache overhead*, so the cache keys are not counted, and all parses result in a global singleton: - user agent strings are around 195 bytes on average - parse results, user agent, and os objects are 72 bytes - device objects are 56 bytes - the extracted strings total about 200 bytes on average[^interning] That's some 600 bytes per cache entry, or 3000000 bytes for a 5000 entries cache. In view of that, the cache overhead hardly seems consequential, but still. [^dict]: Roughly python's dict has power of two size classes, a size class `n` leads to a total capacity of `1<<n` and an effective capacity of `(1<<n<<1)/3`. The dict object is composed of a sparse array of indices sized to the total capacity, these indices can be u8, u16, u32, or u64 depending on the effective capacity. The dict object is then composed of a dense array of entries sized to the effective capacity. An entry is generally three pointers (hash, key, value) but can be just two as an optimisation e.g. for string keys (as strings memoise their own hash). Thus the space needed for a dict of class `n` is `sizeof(idx) * (1 << n) + (2|3) * 8 * ((1<<n<<1)/3)` (plus a few dozen bytes of various metadata). Thus for a dict of size n, the way to get the minimum class is `ceil(log2(len * 3/2))`. As such a 5000 entries string-keyed dict (Lru) should be in size 13 and of size taking about 101kB, and a ~9500 entries dict (S3Fifo index) should be in size 14 taking about 202kB. These are what's observed by straight filling dicts to those sizes, but churning them a few hundred to thousand times (removing and adding keys, keeping their sizes constant) ends up one size class above. I've not confirmed it but it's likely because a dict of size 14 has an effective capacity of 10922, which means every ~1500 removals and insertions the dense array would need to be compacted, rehashed, and rewritten. By bumping over to class 15, this happens every 12000 cycles instead, at the cost of double the memory. [^interning]: Technically it's around 500, but single-character strings are always interned and those are common for the version fields of UserAgent and OS (about 56% of them) and they account for most of the possible overhead, 2 and 3 characters strings account for a further 24 and 17%, though with diminishing returns: 2-char strings seems the most promising as 93 of them are represented (91 being two-digit numbers) and almost all of them more than once (the sample file has only two singleton two-char strings, only one of which is a number) by comparison all 3-character strings are numbers but 57 out of 251 are singletons.
- Loading branch information