Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_lookup() generates KeyError exception #132

Closed
michael-a-green opened this issue Aug 23, 2022 · 10 comments
Closed

_lookup() generates KeyError exception #132

michael-a-green opened this issue Aug 23, 2022 · 10 comments

Comments

@michael-a-green
Copy link

Method user_agent_parser::_lookup() generates a KeyError exception at line 237 in uap-python/ua_parser/user_agent_parser.py. Not all callers of this method can handle the exception.

It would be better if line 237 was this:

_PARSE_CACHE.pop(next(iter(_PARSE_CACHE)), "UNKNOWN")

I'll send a pull request. Please let me know if you need more info.

@michael-a-green michael-a-green changed the title _lookup() generates KeyError exception _lookup() generates KeyError exception Aug 23, 2022
@michael-a-green michael-a-green changed the title _lookup() generates KeyError exception _lookup() generates KeyError exception Aug 23, 2022
@masklinn
Copy link
Contributor

I'm a bit confused as to how you hit this error, and neither this nor #133 explains the error conditions.

The code should be popping the first key found in the map, and said map should have items otherwise the condition shouldn't trigger. Are you setting MAX_CACHE_SIZE to a value below 1?

@michael-a-green
Copy link
Author

michael-a-green commented Aug 26, 2022

Hi @masklinn ,

No I did not modify MAX_CACHE_SIZE from what it's set to in the original code.

It's clear based on the definition of the pop() method for python dictionaries that without a default value, an exception will be thrown if the key doesn't exist in the dictionary. So I figured this is just increasing the robustness of the code irrespective of how the condition came about.

Please let me know if you need a test case showing the exception being thrown. Thanks.

@mattrobenolt
Copy link
Member

mattrobenolt commented Aug 26, 2022

It could theoretically be triggered with a race condition with multiple threads. I would just personally use None as the default value tho. Or a try..except KeyError, but it doesn't particularly matter.

It's just 100% possible to trigger this with multithreading.

@michael-a-green
Copy link
Author

michael-a-green commented Aug 26, 2022

OK, should I change it to return None? More context: This exception was triggered while running this method as an apply() run on a dask dataframe, which does run stuff multi-threaded. It doesn't trigger when I run the exact same thing on a pandas dataframe, which runs single-threaded only.

@mattrobenolt
Copy link
Member

Right, the race condition is between iterating (reading) the keys of the dictionary, and deleting. So if 2 threads are using this library at the same time, and once you read capacity, the idea of "removing the top one" is going to be contending for that spot and threads are going to both be trying to remove the top.

Overall, I think using a dict here is less than idea because of this. The max size can't really be maintained well, since if a pop here fails, now it'll grow beyond the intended size.

A lock/Mutex can solve this though, I just don't know what the author's intent is. I did not write this.

The issue is that if the pop() fails because of a race condition, this dict is effectively going to forever be oversize, unless the pop happens within a while loop with the condition is that the dict is less than the max size. Otherwise, this dict can keep growing indefinitely.

The naive approach, which I had, and the trick was stolen from stdlib re package, was to just clear the entire dict once it exceeded size, which doesn't have this issue. It's also less than ideal, but I think leveraging the stdlib lru_cache is probably the move here if we're trying to do anything more complicated.

This current implementation is just unsafe and will leak memory.

@mattrobenolt
Copy link
Member

I guess I didn't answer your question directly, it doesn't really matter, since this fix isn't fixing the underlying problem. So it's relatively insignificant. It'll stop the KeyError, but instead you're just going to consume potentially infinite memory.

@masklinn
Copy link
Contributor

masklinn commented Aug 26, 2022

It could theoretically be triggered with a race condition with multiple threads. I would just personally use None as the default value tho. Or a try..except KeyError, but it doesn't particularly matter.

It's just 100% possible to trigger this with multithreading.

Good point, I guess I just didn't really consider that factor.

The naive approach, which I had, and the trick was stolen from stdlib re package, was to just clear the entire dict once it exceeded size, which doesn't have this issue.

This version is actually also stolen from the stdlib, though the stdlib uses del rather than pop (which saves one opcode, maybe?) and does wrap the entire thing in a try/except:

        if len(_cache) >= _MAXCACHE:
            # Drop the oldest item
            try:
                del _cache[next(iter(_cache))]
            except (StopIteration, RuntimeError, KeyError):
                pass

It's also less than ideal, but I think leveraging the stdlib lru_cache is probably the move here if we're trying to do anything more complicated.

That is very much something I'd want to do, but 0.x still supports 2.x so it wasn't an option.

I definitely should have used odict though (similar to the previous version of the stdlib, between bpo-28293 and bpo-pbo-30215), and I can't say why I didn't (or felt clever enough not to just straight copy from the stdlib), that's definitely on me.

@mattrobenolt
Copy link
Member

This version is actually also stolen from the stdlib, though the stdlib uses del rather than pop (which saves one opcode, maybe?) and does wrap the entire thing in a try/except

Yeah, that's interesting. I didn't realize it changed in stdlib, but yes, I'd agree that del is likely faster than pop if the case is more likely to succeed rather than fail, which in this case, it is.

That is very much something I'd want to do, but 0.x still supports 2.x so it wasn't an option.

I'd uhh, look into dropping that, or bump rev and drop it. 2.x is officially EOL'd for a while now.

I definitely should have used odict though (similar to the previous version of the stdlib, between bpo-28293 and bpo-pbo-30215), and I can't say why I didn't (or felt clever enough not to just straight copy from the stdlib), that's definitely on me.

Yeah, I think without ordereddict, since you're trying to support 2.x, this actually is less than an idea implementation. Being able to leverage the natural ordered dict in 3.6+ is a nice benefit of what stdlib is doing.

I don't think I'd try to accommodate less than 3.6 with changes around here.

I'd suggest, if you truly need to support 2.x, keep it back how it was, and just clear the whole dict, since it's not really deterministic otherwise.

If you want to rev and support 3.6+, just use lru_cache. stdlib is choosing not to use lru_cache for that specific instance still because of the additional complexity that caller has. They discussed this when it was implemented. Otherwise, there's no reason not to use the stdlib lru_cache instead and avoid all of this.

I'd still vote to simply revert this here back to how it was, then move forward with a 3.6+ compatible version, which can probably clean up other code and be only 3.x compatible moving forward.

I think there are too many downsides for little gain this way on 2.x.

@mattrobenolt
Copy link
Member

Or alternatively, do a different cache implementation based on checking python version, and do it how it was for 2.x and use lru_cache if it's available.

Something like:

try:
  from functools import lru_cache
  has_lru_cache = True
except ImportError:
  has_lru_cache = False

And do whatever differently.

@masklinn
Copy link
Contributor

masklinn commented Aug 27, 2022

I'd suggest, if you truly need to support 2.x, keep it back how it was, and just clear the whole dict, since it's not really deterministic otherwise.

Yeah I'm going to revert to just clearing the entire thing, without a good dataset to bench on it's hard to know what the tradeoff is between the potential extra misses from that, and the potential extra work from a big lock, and the "overflow" from not synchronising the cache (which was always an issue but still).

masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022
Amongst other changes, ua-parser#113 switched the cache to a FIFO inspired by
the standard library's re module, however it didn't really take
concurrency in account, so didn't really consider: that double-pops
are possible (probably why the stdlib ignores a bunch of errors),
which can cause KeyError during lookup (as two workers try to clear
the first key, one succeeds, and the other doesn't find the key and
fails).

It also has a few other less major issues:

- double-inserts are possible, which can cause the cache to exceed set
  capacity permanently by the number of concurrent workers
- the stdlib's method only works properly with Python 3.6's naturally
  ordered `dict`, but I'd rather not drop 2.7 compatibility from 0.x
  unless there are very good causes to as, despite 2.7 having been
  EOL'd in 2020, it still accounts for more downloads than 3.10
  (according to pypistats)

Using an ordered dict would solve (3), and allow using an LRU rather
than a FIFO, but it would not actually prevent double-pops or
double-inserts, that would require a proper lock on lookup. Which
might not be that expensive but given the lack of a good dataset to
bench with, it seems a lot of additional complexity for something
we've got no visibility on. But that can be considered if someone
reports a serious performance regression from this.

So for now just revert to a "reset" cache replacement policy. If /
when we drop older versions we can switch to `functools.lru_cache` and
let the stdlib take care of this (and possibly have cache
stats). Alternatively if we get a good testing dataset one day we can
bench cache replacement policies or even provide pluggable policies.

Anyway fixes ua-parser#132, closes ua-parser#133
masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022
Amongst other changes, ua-parser#113 switched the cache to a FIFO inspired by
the standard library's re module, however it didn't really take
concurrency in account, so didn't really consider: that double-pops
are possible (probably why the stdlib ignores a bunch of errors),
which can cause KeyError during lookup (as two workers try to clear
the first key, one succeeds, and the other doesn't find the key and
fails).

It also has a few other less major issues:

- double-inserts are possible, which can cause the cache to exceed set
  capacity permanently by the number of concurrent workers
- the stdlib's method only works properly with Python 3.6's naturally
  ordered `dict`, but I'd rather not drop 2.7 compatibility from 0.x
  unless there are very good causes to as, despite 2.7 having been
  EOL'd in 2020, it still accounts for more downloads than 3.10
  (according to pypistats)

Using an ordered dict would solve (3), and allow using an LRU rather
than a FIFO, but it would not actually prevent double-pops or
double-inserts, that would require a proper lock on lookup. Which
might not be that expensive but given the lack of a good dataset to
bench with, it seems a lot of additional complexity for something
we've got no visibility on. But that can be considered if someone
reports a serious performance regression from this.

So for now just revert to a "reset" cache replacement policy. If /
when we drop older versions we can switch to `functools.lru_cache` and
let the stdlib take care of this (and possibly have cache
stats). Alternatively if we get a good testing dataset one day we can
bench cache replacement policies or even provide pluggable policies.

Anyway fixes ua-parser#132, closes ua-parser#133
masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022
masklinn added a commit that referenced this issue Aug 27, 2022
Fix for #132 on 0.15
masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022
Merge fix of ua-parser#132 from 0.15.x
masklinn added a commit to masklinn/uap-python that referenced this issue Aug 27, 2022
Merge fix of ua-parser#132 from 0.15.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants