From 9b24838f3a3b57873eee0459ca161d5d8b6ddf16 Mon Sep 17 00:00:00 2001 From: masklinn Date: Fri, 1 Mar 2024 20:10:29 +0100 Subject: [PATCH] Add S3 and SIEVE, make S3 the default, remove clearing and locking MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes #143 Memory Shavings =============== It was the plan all along, but since I worried about the overhead of caches it made no sense to keep the result objects (which would compose the cache entries) as dict-instances, so they've been converted to `__slots__` (manually since `dataclasses` only supports slots from 3.10). Sadly this requires adding explicit `__init__` to every dataclass involved as default values are not compatible with `__slots__`. Cache Policies ============== S3Fifo as default ----------------- Testing on the sample file taught me what cache people have clearly known for a while: lru is *awful*. You can do worse, but it takes surprisingly little to be competitive with it. S3Fifo turns out to have pretty good performances while being relatively simple. S3 is not perfect, notably like most CLOCK-type algorithm its eviction is O(n) which might be a bit of an issue in some cases. But until someone complains... As a result, S3 is now the cache policy for the basic cache (if `re2` is not available) replacing LRU, and it's also exported as `Cache` from the package root. From an implementation perspective, the original exploratory version (of this and most FIFOs tested) used an ordered dict as an indexed fifo but the memory consumption is not great, the final version uses a single index dict and separate deques for the FIFOs, an idea found in @cmcaine's s3fifo which significantly compacts memory requirements (though it's still a good 50% higher than a SIEVE or OD-based LRU of the same size). LFU --- Matani et al's O(1) LFU had a great showing on hitrates and perfs (though slightly worse than s3 still), however the implementation still required the addition of some form of aging, which was not worth it. Theoretically a straight LFU could work for offline use but... that's a pretty pointles use as in that case you can just parse each unique value once and splat by the entry count. W-TinyLFU is the big modern cheese in the field, but I opted to avoid it for now: it's a lot more complicated than the existing caches (requiring a bloom filter, a frequency sketch or counting bloom filter, an SLRU, and an LRU), plus a good implementation clearly requires a lot of bit twiddling (for the bloom filters / frequency sketch), which Python is not great at from a performance point of view (I tried implementing CLOCK using a bytearray for bitmap and it was crap). SIEVE ----- SIEVE is consistently a few percentage point below S3, and it's lacking a few properties (e.g. scan resistance), however it does have one interesting property which S3 lacks: at small cache sizes it has less memory overhead than LRU, despite Python-level linked list and nodes where LRU gets to use the native-coded OrderedDict, with a C-level linked list and a bespoke secondary hashmap. And it does that with the hitrates of an LRU double the size until we get to caches a significant fraction the size of uniques (5000). It also features a truly thread-safe unsynchronized cache hit. Note: while the reference paper uses a doubly linked list, this implementation uses a singly linked list for the sieve hand. This means the hand is a pair of pointers but it saves 11% memory on the nodes (72 -> 64 bytes), which gets significant as the size of the cache increases. Other Caches ------------ A number of simple cache implementations were temporarily ~~embarassed~~ implemented for testing: - random - fifo - lp-fifo / fifo-reinsertion - CLOCK (0 to 2), which is a different implementation of the same algorithm, tried a bitmap, it was horrible, an array of counters was competitive with lp-fifo using an ordereddict (perf-wise, I had yet to start looking at memory use). - QD-LP-FIFO which is not *really* an algorithm but was an intermediate stations to S3 (the addition of a fixed-size probationary fifo and a ghost cache to an LP-FIFO, S3 is basically a more advanced and flexible version) The trivial caches (RR, fifo) were worse than LRU but very simple, the others were better than LRU but at the end of the day didn't really pull their weight compared to alternatives (even if they were easy to implement). An interesting note here is that the quick-demotion scheme of S3 can be put in front of LRU to some success (it does improve hit rates significantly as the sample trace has a large number of one hit wonders), but without excellent reasons to use an LRU on the back end it doesn't seem super useful. Thread Safety ============= The `Locking` wrapper has been removed, probably for ever: testing showed that the perf hit of a lock in GILPython was basically nil (at least for the amount of work ua-python has to do, on uncontended locks). Since none of the caches are intrinsically safe anymore (and the clearing cache's lack of performance was a lot worse than any synchronisation could be) it's better to just have synchronised caches. Thread-local cache support has however been added in case, and will be documented, in case it turns out to be of use to the !gil mode (it basically trades memory and / or hitrate for lower contention). s3fifo implementation notes =========================== The initial implementation of S3Fifo was done using ordered dicts as indexed fifos, this was easy but after adding some memory tracking it turns out to have a lot of overhead, at around 250% the overhead of Lru (which makes sense, it needs 2 ordered dicts of about the same size, plus a smaller ordered dict, plus entry objects to track frequency). An implementation based on deques is a lot more reasonable, it only needs a single dict and CPython's deques are implemented as unrolled linked lists of order 64 (so each link of the list stores 64 elements). It still needs about 150% of the Lru space but that's a lot more reasonable. At n=5000 after a full run on the sample file the measurements from tracemalloc indicates 785576 bytes, with `sys.getsizeof` measurements of the different elements indicating: - 415152 bytes for the index dict - 4984 bytes for the small cache deque - 37720 bytes for the main cache deque - 38248 bytes for the ghost cache deque - 280000 bytes for the CacheEntry objects For LRU this is 500488 bytes of which 498752 are attributed to the `OrderedDict`. It seems difficult to go below: while in theory the ~9500 entries should fit in a dict of class 14, as the dicts have a lot of traffic (keys being added and removed) — and possibly because they're never iterated so this is not a concern (have not checked if this is a consideration) — cpython uses a dict one size larger to compact less often[^dict]. However the issue also occurs in the LRU so it's "fair" (while the OrderedDict has a Python implementation which uses two maps, it also has a native implementation which uses an internal ad-hoc hashmap rather than a full blow dict, so it doesn't quite have double-hashmap overhead). Note that this only measures *cache overhead*, so the cache keys are not counted, and all parses result in a global singleton: - user agent strings are around 195 bytes on average - parse results, user agent, and os objects are 72 bytes - device objects are 56 bytes - the extracted strings total about 200 bytes on average[^interning] That's some 600 bytes per cache entry, or 3000000 bytes for a 5000 entries cache. In view of that, the cache overhead hardly seems consequential, but still. [^dict]: Roughly python's dict has power of two size classes, a size class `n` leads to a total capacity of `1< Parser: return cls( CachingResolver( BasicResolver(m), - Locking(LRU(200)), + Cache(200), ) ) diff --git a/src/ua_parser/__main__.py b/src/ua_parser/__main__.py index db4d9a3..92ef72d 100644 --- a/src/ua_parser/__main__.py +++ b/src/ua_parser/__main__.py @@ -1,5 +1,6 @@ import argparse import csv +import gc import io import itertools import math @@ -8,19 +9,29 @@ import sys import threading import time -from typing import Any, Callable, Iterable, List, Optional, Sequence, Tuple, Union +import tracemalloc +from typing import ( + Any, + Callable, + Dict, + Iterable, + List, + Optional, + Sequence, + Tuple, + Union, + cast, +) from . import ( BasicResolver, CachingResolver, - Clearing, Domain, - Locking, - LRU, Matchers, Parser, PartialParseResult, Resolver, + caching, ) from .caching import Cache, Local from .loaders import load_builtins, load_yaml @@ -34,6 +45,17 @@ } +CACHES: Dict[str, Optional[Callable[[int], Cache]]] = {"none": None} +CACHES.update( + (cache.__name__.lower(), cache) + for cache in [ + cast(Callable[[int], Cache], caching.Lru), + caching.S3Fifo, + caching.Sieve, + ] +) + + def get_rules(parsers: List[str], regexes: Optional[io.IOBase]) -> Matchers: if regexes: if not load_yaml: @@ -156,18 +178,13 @@ def get_parser( else: sys.exit(f"unknown parser {parser!r}") - c: Callable[[int], Cache] - if cache == "none": - return Parser(r).parse - elif cache == "clearing": - c = Clearing - elif cache == "lru": - c = LRU - elif cache == "lru-threadsafe": - c = lambda size: Locking(LRU(size)) # noqa: E731 - else: + if cache not in CACHES: sys.exit(f"unknown cache algorithm {cache!r}") + c = CACHES.get(cache) + if c is None: + return Parser(r).parse + return Parser(CachingResolver(r, c(cachesize))).parse @@ -182,14 +199,16 @@ def run( def run_hitrates(args: argparse.Namespace) -> None: - def noop(ua: str, domains: Domain, /) -> PartialParseResult: - return PartialParseResult( - domains=domains, - string=ua, - user_agent=None, - os=None, - device=None, - ) + r = PartialParseResult( + domains=Domain.ALL, + string="", + user_agent=None, + os=None, + device=None, + ) + + def noop(_ua: str, _domains: Domain, /) -> PartialParseResult: + return r class Counter: def __init__(self, parser: Resolver) -> None: @@ -206,19 +225,25 @@ def __call__(self, ua: str, domains: Domain, /) -> PartialParseResult: print(total, "lines", uniques, "uniques") print(f"ideal hit rate: {(total - uniques)/total:.0%}") print() - caches: List[Callable[[int], Cache]] = [Clearing, LRU] + w = int(math.log10(max(args.cachesizes)) + 1) + tracemalloc.start() for cache, cache_size in itertools.product( - caches, + filter(None, CACHES.values()), args.cachesizes, ): misses = Counter(noop) + gc.collect() + before = tracemalloc.take_snapshot() parser = Parser(CachingResolver(misses, cache(cache_size))) for line in lines: parser.parse(line) - + gc.collect() + after = tracemalloc.take_snapshot() + diff = sum(s.size_diff for s in after.compare_to(before, "filename")) print( - f"{cache.__name__.lower()}({cache_size}): {(total - misses.count)/total:.0%} hit rate" + f"{cache.__name__.lower():8}({cache_size:{w}}): {(total - misses.count)/total*100:2.0f}% hit rate, {diff:9} bytes" ) + del misses, parser CACHESIZE = 1000 @@ -242,9 +267,8 @@ def run_threaded(args: argparse.Namespace) -> None: lines = list(args.file) basic = BasicResolver(load_builtins()) resolvers: List[Tuple[str, Resolver]] = [ - ("clearing", CachingResolver(basic, Clearing(CACHESIZE))), - ("locking-lru", CachingResolver(basic, Locking(LRU(CACHESIZE)))), - ("local-lru", CachingResolver(basic, Local(lambda: LRU(CACHESIZE)))), + ("locking-lru", CachingResolver(basic, caching.Lru(CACHESIZE))), + ("local-lru", CachingResolver(basic, Local(lambda: caching.Lru(CACHESIZE)))), ("re2", Re2Resolver(load_builtins())), ] for name, resolver in resolvers: @@ -367,8 +391,8 @@ def __call__( bench.add_argument( "--caches", nargs="+", - choices=["none", "clearing", "lru", "lru-threadsafe"], - default=["none", "clearing", "lru", "lru-threadsafe"], + choices=list(CACHES), + default=list(CACHES), help="""Cache implementations to test. `clearing` completely clears the cache when full, `lru` uses a least-recently-eviction policy. `lru` is not thread-safe, so `lru-threadsafe` adds a mutex diff --git a/src/ua_parser/caching.py b/src/ua_parser/caching.py index d112b68..57cef5a 100644 --- a/src/ua_parser/caching.py +++ b/src/ua_parser/caching.py @@ -1,17 +1,27 @@ +from __future__ import annotations + import abc +import dataclasses import threading -from collections import OrderedDict +from collections import OrderedDict, deque from contextvars import ContextVar -from typing import Callable, Dict, Optional, Protocol +from typing import ( + Callable, + Deque, + Dict, + Optional, + Protocol, + Union, +) from .core import Domain, PartialParseResult, Resolver __all__ = [ - "CachingResolver", "Cache", - "Clearing", - "Locking", - "LRU", + "CachingResolver", + "Lru", + "S3Fifo", + "Sieve", ] @@ -34,36 +44,7 @@ def __getitem__(self, key: str) -> Optional[PartialParseResult]: ... -class Clearing: - """A clearing cache, if the cache is full, just remove all the entries - and re-fill from scratch. - - This can also be used as a cheap permanent cache by setting the - ``maxsize`` to infinity (or at least some very large value), - however this is probably a bad idea as it *will* lead to an - ever-growing memory allocation, until every possible user agent - string has been seen. - - Thread-safety: thread-safe, although concurrent insertion may - cause over-clearing of the cache. - - """ - - def __init__(self, maxsize: int): - self.maxsize = maxsize - self.cache: Dict[str, PartialParseResult] = {} - - def __getitem__(self, key: str) -> Optional[PartialParseResult]: - return self.cache.get(key) - - def __setitem__(self, key: str, value: PartialParseResult) -> None: - if key not in self.cache and len(self.cache) >= self.maxsize: - self.cache.clear() - - self.cache[key] = value - - -class LRU: +class Lru: """Cache following a least-recently used replacement policy: when there is no more room in the cache, whichever entry was last seen the least recently is removed. @@ -90,37 +71,162 @@ class LRU: def __init__(self, maxsize: int): self.maxsize = maxsize self.cache: OrderedDict[str, PartialParseResult] = OrderedDict() + self.lock = threading.Lock() def __getitem__(self, key: str) -> Optional[PartialParseResult]: - e = self.cache.get(key) - if e: - self.cache.move_to_end(key) - return e + with self.lock: + e = self.cache.get(key) + if e: + self.cache.move_to_end(key) + return e def __setitem__(self, key: str, value: PartialParseResult) -> None: - self.cache[key] = value - self.cache.move_to_end(key) - while len(self.cache) > self.maxsize: - self.cache.popitem(last=False) + with self.lock: + self.cache[key] = value + self.cache.move_to_end(key) + while len(self.cache) > self.maxsize: + self.cache.popitem(last=False) -class Locking: - """Locking cache decorator. Takes a non-thread-safe cache and - ensures retrieving and setting entries is protected by a mutex. +@dataclasses.dataclass +class CacheEntry: + __slots__ = ["key", "value", "freq"] + key: str + value: PartialParseResult + freq: int - """ - def __init__(self, cache: Cache): - self.cache: Cache = cache +class S3Fifo: + def __init__(self, maxsize: int): + self.maxsize = maxsize + self.index: Dict[str, Union[CacheEntry, str]] = {} + self.small_target = max(1, int(maxsize / 10)) + self.small: Deque[CacheEntry] = deque() + self.main_target = maxsize - self.small_target + self.main: Deque[CacheEntry] = deque() + self.ghost: Deque[str] = deque() self.lock = threading.Lock() def __getitem__(self, key: str) -> Optional[PartialParseResult]: + e = self.index.get(key) + if e and isinstance(e, CacheEntry): + # small race here, we could bump the freq above the limit + e.freq = min(e.freq + 1, 3) + return e.value + + return None + + def __setitem__(self, key: str, r: PartialParseResult) -> None: with self.lock: - return self.cache[key] + if len(self.small) + len(self.main) >= self.maxsize: + # if main is not overcapacity, resize small + if len(self.main) < self.main_target: + self._evict_small() + # evict_small could have moved every entry to main, in + # which case we now need to evict from main + if len(self.small) + len(self.main) >= self.maxsize: + self._evict_main() + + entry = CacheEntry(key, r, 0) + if isinstance(self.index.get(key), str): + self.main.appendleft(entry) + else: + self.small.appendleft(entry) + self.index[key] = entry + + def _evict_main(self) -> None: + while True: + e = self.main.pop() + if e.freq: + e.freq -= 1 + self.main.appendleft(e) + else: + del self.index[e.key] + return + + def _evict_small(self) -> None: + while self.small: + e = self.small.pop() + if e.freq: + e.freq = 0 + self.main.appendleft(e) + else: + g = self.index[e.key] = e.key + self.ghost.appendleft(g) + while len(self.ghost) > self.main_target: + g = self.ghost.pop() + if self.index.get(g) is g: + del self.index[g] + return + + +@dataclasses.dataclass +class SieveNode: + __slots__ = ("key", "value", "visited", "next") + key: str + value: PartialParseResult + visited: bool + next: Optional[SieveNode] + + +class Sieve: + def __init__(self, maxsize: int) -> None: + self.maxsize = maxsize + self.cache: Dict[str, SieveNode] = {} + self.head: Optional[SieveNode] = None + self.tail: Optional[SieveNode] = None + self.hand: Optional[SieveNode] = None + self.prev: Optional[SieveNode] = None + self.lock = threading.Lock() + + def __getitem__(self, key: str) -> Optional[PartialParseResult]: + entry = self.cache.get(key) + if entry: + entry.visited = True + return entry.value + + return None def __setitem__(self, key: str, value: PartialParseResult) -> None: with self.lock: - self.cache[key] = value + if len(self.cache) >= self.maxsize: + self._evict() + + node = self.cache[key] = SieveNode(key, value, False, None) + if self.head: + self.head.next = node + self.head = node + if self.tail is None: + self.tail = node + + def _evict(self) -> None: + obj: Optional[SieveNode] + if self.hand: + obj, pobj = self.hand, self.prev + else: + obj, pobj = self.tail, None + + while obj and obj.visited: + obj.visited = False + if obj.next: + obj, pobj = obj.next, obj + else: + obj, pobj = self.tail, None + + if not obj: + return + + self.hand = obj.next + self.prev = pobj + + del self.cache[obj.key] + if not obj.next: + self.head = pobj + + if pobj: + pobj.next = obj.next + else: + self.tail = obj.next class Local: @@ -129,7 +235,8 @@ class Local: This means the cache capacity and memory consumption is figuratively multiplied by however many threads the cache is used - from, but those threads don't share their caching. + from, but those threads don't share their caching, and thus don't + contend on cache use. """ diff --git a/src/ua_parser/core.py b/src/ua_parser/core.py index 427d1b7..615b9dc 100644 --- a/src/ua_parser/core.py +++ b/src/ua_parser/core.py @@ -22,31 +22,72 @@ class UserAgent: information parsed from the user agent string. """ - family: str = "Other" - major: Optional[str] = None - minor: Optional[str] = None - patch: Optional[str] = None - patch_minor: Optional[str] = None + __slots__ = ("family", "major", "minor", "patch", "patch_minor") + family: str + major: Optional[str] + minor: Optional[str] + patch: Optional[str] + patch_minor: Optional[str] + + def __init__( + self, + family: str = "Other", + major: Optional[str] = None, + minor: Optional[str] = None, + patch: Optional[str] = None, + patch_minor: Optional[str] = None, + ) -> None: + object.__setattr__(self, "family", family) + object.__setattr__(self, "major", major) + object.__setattr__(self, "minor", minor) + object.__setattr__(self, "patch", patch) + object.__setattr__(self, "patch_minor", patch_minor) @dataclass(frozen=True) class OS: """OS information parsed from the user agent string.""" - family: str = "Other" - major: Optional[str] = None - minor: Optional[str] = None - patch: Optional[str] = None - patch_minor: Optional[str] = None + __slots__ = ("family", "major", "minor", "patch", "patch_minor") + family: str + major: Optional[str] + minor: Optional[str] + patch: Optional[str] + patch_minor: Optional[str] + + def __init__( + self, + family: str = "Other", + major: Optional[str] = None, + minor: Optional[str] = None, + patch: Optional[str] = None, + patch_minor: Optional[str] = None, + ) -> None: + object.__setattr__(self, "family", family) + object.__setattr__(self, "major", major) + object.__setattr__(self, "minor", minor) + object.__setattr__(self, "patch", patch) + object.__setattr__(self, "patch_minor", patch_minor) @dataclass(frozen=True) class Device: """Device information parsed from the user agent string.""" - family: str = "Other" - brand: Optional[str] = None - model: Optional[str] = None + __slots__ = ("family", "brand", "model") + family: str + brand: Optional[str] + model: Optional[str] + + def __init__( + self, + family: str = "Other", + brand: Optional[str] = None, + model: Optional[str] = None, + ) -> None: + object.__setattr__(self, "family", family) + object.__setattr__(self, "brand", brand) + object.__setattr__(self, "model", model) class Domain(Flag): @@ -128,6 +169,7 @@ class PartialParseResult: If the flag is unset, the field has not been looked up yet. """ + __slots__ = ("domains", "user_agent", "os", "device", "string") domains: Domain user_agent: Optional[UserAgent] os: Optional[OS] diff --git a/tests/test_caches.py b/tests/test_caches.py index e41d978..ccce4de 100644 --- a/tests/test_caches.py +++ b/tests/test_caches.py @@ -3,44 +3,22 @@ from ua_parser import ( BasicResolver, CachingResolver, - Clearing, Device, Domain, - LRU, OS, Parser, PartialParseResult, UserAgent, ) +from ua_parser.caching import Lru from ua_parser.matchers import DeviceMatcher, OSMatcher, UserAgentMatcher -def test_clearing(): - """Tests that the cache correctly gets cleared to make room for new - entries. - """ - cache = Clearing(2) - p = Parser(CachingResolver(BasicResolver(([], [], [])), cache)) - - p.parse("a") - p.parse("b") - - assert cache.cache == { - "a": PartialParseResult(Domain.ALL, None, None, None, "a"), - "b": PartialParseResult(Domain.ALL, None, None, None, "b"), - } - - p.parse("c") - assert cache.cache == { - "c": PartialParseResult(Domain.ALL, None, None, None, "c"), - } - - def test_lru(): """Tests that the cache entries do get moved when accessed, and are popped LRU-first. """ - cache = LRU(2) + cache = Lru(2) p = Parser(CachingResolver(BasicResolver(([], [], [])), cache)) p.parse("a") @@ -67,7 +45,7 @@ def test_backfill(): """Tests that caches handle partial parsing correctly, by updating the existing entry when new parts get parsed. """ - cache = Clearing(2) + cache = Lru(2) p = Parser( CachingResolver( BasicResolver( diff --git a/tests/test_core.py b/tests/test_core.py index 2c89ba5..c2d874d 100644 --- a/tests/test_core.py +++ b/tests/test_core.py @@ -30,7 +30,6 @@ Parser, ParseResult, UserAgent, - caching, load_builtins, load_lazy_builtins, ) @@ -42,24 +41,6 @@ PARSERS = [ pytest.param(Parser(BasicResolver(load_builtins())), id="basic"), pytest.param(Parser(BasicResolver(load_lazy_builtins())), id="lazy"), - pytest.param( - Parser( - caching.CachingResolver( - BasicResolver(load_builtins()), - caching.Clearing(10), - ) - ), - id="clearing", - ), - pytest.param( - Parser( - caching.CachingResolver( - BasicResolver(load_builtins()), - caching.LRU(10), - ) - ), - id="lru", - ), ] try: from ua_parser import re2