`_lookup()` generates `KeyError` exception #132

michael-a-green · 2022-08-23T18:40:27Z

Method user_agent_parser::_lookup() generates a KeyError exception at line 237 in uap-python/ua_parser/user_agent_parser.py. Not all callers of this method can handle the exception.

It would be better if line 237 was this:

_PARSE_CACHE.pop(next(iter(_PARSE_CACHE)), "UNKNOWN")

I'll send a pull request. Please let me know if you need more info.

The text was updated successfully, but these errors were encountered:

masklinn · 2022-08-25T17:50:54Z

I'm a bit confused as to how you hit this error, and neither this nor #133 explains the error conditions.

The code should be popping the first key found in the map, and said map should have items otherwise the condition shouldn't trigger. Are you setting MAX_CACHE_SIZE to a value below 1?

michael-a-green · 2022-08-26T00:43:33Z

Hi @masklinn ,

No I did not modify MAX_CACHE_SIZE from what it's set to in the original code.

It's clear based on the definition of the pop() method for python dictionaries that without a default value, an exception will be thrown if the key doesn't exist in the dictionary. So I figured this is just increasing the robustness of the code irrespective of how the condition came about.

Please let me know if you need a test case showing the exception being thrown. Thanks.

mattrobenolt · 2022-08-26T00:46:54Z

It could ~~theoretically~~ be triggered with a race condition with multiple threads. I would just personally use None as the default value tho. Or a try..except KeyError, but it doesn't particularly matter.

It's just 100% possible to trigger this with multithreading.

michael-a-green · 2022-08-26T01:01:07Z

OK, should I change it to return None? More context: This exception was triggered while running this method as an apply() run on a dask dataframe, which does run stuff multi-threaded. It doesn't trigger when I run the exact same thing on a pandas dataframe, which runs single-threaded only.

mattrobenolt · 2022-08-26T01:30:27Z

Right, the race condition is between iterating (reading) the keys of the dictionary, and deleting. So if 2 threads are using this library at the same time, and once you read capacity, the idea of "removing the top one" is going to be contending for that spot and threads are going to both be trying to remove the top.

Overall, I think using a dict here is less than idea because of this. The max size can't really be maintained well, since if a pop here fails, now it'll grow beyond the intended size.

A lock/Mutex can solve this though, I just don't know what the author's intent is. I did not write this.

The issue is that if the pop() fails because of a race condition, this dict is effectively going to forever be oversize, unless the pop happens within a while loop with the condition is that the dict is less than the max size. Otherwise, this dict can keep growing indefinitely.

The naive approach, which I had, and the trick was stolen from stdlib re package, was to just clear the entire dict once it exceeded size, which doesn't have this issue. It's also less than ideal, but I think leveraging the stdlib lru_cache is probably the move here if we're trying to do anything more complicated.

This current implementation is just unsafe and will leak memory.

mattrobenolt · 2022-08-26T01:33:10Z

I guess I didn't answer your question directly, it doesn't really matter, since this fix isn't fixing the underlying problem. So it's relatively insignificant. It'll stop the KeyError, but instead you're just going to consume potentially infinite memory.

masklinn · 2022-08-26T15:22:04Z

It could ~~theoretically~~ be triggered with a race condition with multiple threads. I would just personally use None as the default value tho. Or a try..except KeyError, but it doesn't particularly matter.

It's just 100% possible to trigger this with multithreading.

Good point, I guess I just didn't really consider that factor.

The naive approach, which I had, and the trick was stolen from stdlib re package, was to just clear the entire dict once it exceeded size, which doesn't have this issue.

This version is actually also stolen from the stdlib, though the stdlib uses del rather than pop (which saves one opcode, maybe?) and does wrap the entire thing in a try/except:

        if len(_cache) >= _MAXCACHE:
            # Drop the oldest item
            try:
                del _cache[next(iter(_cache))]
            except (StopIteration, RuntimeError, KeyError):
                pass

It's also less than ideal, but I think leveraging the stdlib lru_cache is probably the move here if we're trying to do anything more complicated.

That is very much something I'd want to do, but 0.x still supports 2.x so it wasn't an option.

I definitely should have used odict though (similar to the previous version of the stdlib, between bpo-28293 and bpo-pbo-30215), and I can't say why I didn't (or felt clever enough not to just straight copy from the stdlib), that's definitely on me.

mattrobenolt · 2022-08-26T17:54:01Z

This version is actually also stolen from the stdlib, though the stdlib uses del rather than pop (which saves one opcode, maybe?) and does wrap the entire thing in a try/except

Yeah, that's interesting. I didn't realize it changed in stdlib, but yes, I'd agree that del is likely faster than pop if the case is more likely to succeed rather than fail, which in this case, it is.

That is very much something I'd want to do, but 0.x still supports 2.x so it wasn't an option.

I'd uhh, look into dropping that, or bump rev and drop it. 2.x is officially EOL'd for a while now.

I definitely should have used odict though (similar to the previous version of the stdlib, between bpo-28293 and bpo-pbo-30215), and I can't say why I didn't (or felt clever enough not to just straight copy from the stdlib), that's definitely on me.

Yeah, I think without ordereddict, since you're trying to support 2.x, this actually is less than an idea implementation. Being able to leverage the natural ordered dict in 3.6+ is a nice benefit of what stdlib is doing.

I don't think I'd try to accommodate less than 3.6 with changes around here.

I'd suggest, if you truly need to support 2.x, keep it back how it was, and just clear the whole dict, since it's not really deterministic otherwise.

If you want to rev and support 3.6+, just use lru_cache. stdlib is choosing not to use lru_cache for that specific instance still because of the additional complexity that caller has. They discussed this when it was implemented. Otherwise, there's no reason not to use the stdlib lru_cache instead and avoid all of this.

I'd still vote to simply revert this here back to how it was, then move forward with a 3.6+ compatible version, which can probably clean up other code and be only 3.x compatible moving forward.

I think there are too many downsides for little gain this way on 2.x.

mattrobenolt · 2022-08-26T17:55:26Z

Or alternatively, do a different cache implementation based on checking python version, and do it how it was for 2.x and use lru_cache if it's available.

Something like:

try:
  from functools import lru_cache
  has_lru_cache = True
except ImportError:
  has_lru_cache = False

And do whatever differently.

masklinn · 2022-08-27T10:26:31Z

I'd suggest, if you truly need to support 2.x, keep it back how it was, and just clear the whole dict, since it's not really deterministic otherwise.

Yeah I'm going to revert to just clearing the entire thing, without a good dataset to bench on it's hard to know what the tradeoff is between the potential extra misses from that, and the potential extra work from a big lock, and the "overflow" from not synchronising the cache (which was always an issue but still).

Amongst other changes, ua-parser#113 switched the cache to a FIFO inspired by the standard library's re module, however it didn't really take concurrency in account, so didn't really consider: that double-pops are possible (probably why the stdlib ignores a bunch of errors), which can cause KeyError during lookup (as two workers try to clear the first key, one succeeds, and the other doesn't find the key and fails). It also has a few other less major issues: - double-inserts are possible, which can cause the cache to exceed set capacity permanently by the number of concurrent workers - the stdlib's method only works properly with Python 3.6's naturally ordered `dict`, but I'd rather not drop 2.7 compatibility from 0.x unless there are very good causes to as, despite 2.7 having been EOL'd in 2020, it still accounts for more downloads than 3.10 (according to pypistats) Using an ordered dict would solve (3), and allow using an LRU rather than a FIFO, but it would not actually prevent double-pops or double-inserts, that would require a proper lock on lookup. Which might not be that expensive but given the lack of a good dataset to bench with, it seems a lot of additional complexity for something we've got no visibility on. But that can be considered if someone reports a serious performance regression from this. So for now just revert to a "reset" cache replacement policy. If / when we drop older versions we can switch to `functools.lru_cache` and let the stdlib take care of this (and possibly have cache stats). Alternatively if we get a good testing dataset one day we can bench cache replacement policies or even provide pluggable policies. Anyway fixes ua-parser#132, closes ua-parser#133

Fix for ua-parser#132 on 0.15

Fix for #132 on 0.15

Merge fix of ua-parser#132 from 0.15.x

michael-a-green changed the title ~~_lookup() generates KeyError exception~~ _lookup() generates KeyError exception Aug 23, 2022

michael-a-green changed the title ~~_lookup() generates KeyError exception~~ _lookup() generates KeyError exception Aug 23, 2022

michael-a-green mentioned this issue Aug 23, 2022

fix issue where _lookup() generates KeyError execeptions #133

Closed

masklinn mentioned this issue Aug 27, 2022

Revert cache FIFO replacement policy to reset/clear #134

Merged

masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022

Release 0.15.2

d734cc1

Fix for ua-parser#132 on 0.15

masklinn mentioned this issue Aug 27, 2022

Release 0.15.2 #136

Merged

masklinn added a commit that referenced this issue Aug 27, 2022

Release 0.15.2

05be64d

Fix for #132 on 0.15

masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022

Release 0.16.1

d7c73af

Merge fix of ua-parser#132 from 0.15.x

masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022

Merge from 0.15.x

f1cf6fa

Merge fix of ua-parser#132 from 0.15.x

masklinn closed this as completed in 0939d42 Aug 27, 2022

masklinn mentioned this issue Sep 24, 2022

Use better cache? #143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_lookup()` generates `KeyError` exception #132

`_lookup()` generates `KeyError` exception #132

michael-a-green commented Aug 23, 2022

masklinn commented Aug 25, 2022

michael-a-green commented Aug 26, 2022 •

edited

Loading

mattrobenolt commented Aug 26, 2022 •

edited

Loading

michael-a-green commented Aug 26, 2022 •

edited

Loading

mattrobenolt commented Aug 26, 2022

mattrobenolt commented Aug 26, 2022

masklinn commented Aug 26, 2022 •

edited

Loading

mattrobenolt commented Aug 26, 2022

mattrobenolt commented Aug 26, 2022

masklinn commented Aug 27, 2022 •

edited

Loading

_lookup() generates KeyError exception #132

_lookup() generates KeyError exception #132

Comments

michael-a-green commented Aug 23, 2022

masklinn commented Aug 25, 2022

michael-a-green commented Aug 26, 2022 • edited Loading

mattrobenolt commented Aug 26, 2022 • edited Loading

michael-a-green commented Aug 26, 2022 • edited Loading

mattrobenolt commented Aug 26, 2022

mattrobenolt commented Aug 26, 2022

masklinn commented Aug 26, 2022 • edited Loading

mattrobenolt commented Aug 26, 2022

mattrobenolt commented Aug 26, 2022

masklinn commented Aug 27, 2022 • edited Loading

`_lookup()` generates `KeyError` exception #132

`_lookup()` generates `KeyError` exception #132

michael-a-green commented Aug 26, 2022 •

edited

Loading

mattrobenolt commented Aug 26, 2022 •

edited

Loading

michael-a-green commented Aug 26, 2022 •

edited

Loading

masklinn commented Aug 26, 2022 •

edited

Loading

masklinn commented Aug 27, 2022 •

edited

Loading