-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_lookup()
generates KeyError
exception
#132
Comments
_lookup()
generates KeyError exception
_lookup()
generates KeyError exception_lookup()
generates KeyError
exception
I'm a bit confused as to how you hit this error, and neither this nor #133 explains the error conditions. The code should be popping the first key found in the map, and said map should have items otherwise the condition shouldn't trigger. Are you setting |
Hi @masklinn , No I did not modify It's clear based on the definition of the Please let me know if you need a test case showing the exception being thrown. Thanks. |
It could It's just 100% possible to trigger this with multithreading. |
OK, should I change it to return |
Right, the race condition is between iterating (reading) the keys of the dictionary, and deleting. So if 2 threads are using this library at the same time, and once you read capacity, the idea of "removing the top one" is going to be contending for that spot and threads are going to both be trying to remove the top. Overall, I think using a dict here is less than idea because of this. The max size can't really be maintained well, since if a pop here fails, now it'll grow beyond the intended size. A lock/Mutex can solve this though, I just don't know what the author's intent is. I did not write this. The issue is that if the The naive approach, which I had, and the trick was stolen from stdlib re package, was to just clear the entire dict once it exceeded size, which doesn't have this issue. It's also less than ideal, but I think leveraging the stdlib lru_cache is probably the move here if we're trying to do anything more complicated. This current implementation is just unsafe and will leak memory. |
I guess I didn't answer your question directly, it doesn't really matter, since this fix isn't fixing the underlying problem. So it's relatively insignificant. It'll stop the KeyError, but instead you're just going to consume potentially infinite memory. |
Good point, I guess I just didn't really consider that factor.
This version is actually also stolen from the stdlib, though the stdlib uses if len(_cache) >= _MAXCACHE:
# Drop the oldest item
try:
del _cache[next(iter(_cache))]
except (StopIteration, RuntimeError, KeyError):
pass
That is very much something I'd want to do, but 0.x still supports 2.x so it wasn't an option. I definitely should have used odict though (similar to the previous version of the stdlib, between bpo-28293 and bpo-pbo-30215), and I can't say why I didn't (or felt clever enough not to just straight copy from the stdlib), that's definitely on me. |
Yeah, that's interesting. I didn't realize it changed in stdlib, but yes, I'd agree that
I'd uhh, look into dropping that, or bump rev and drop it. 2.x is officially EOL'd for a while now.
Yeah, I think without ordereddict, since you're trying to support 2.x, this actually is less than an idea implementation. Being able to leverage the natural ordered dict in 3.6+ is a nice benefit of what stdlib is doing. I don't think I'd try to accommodate less than 3.6 with changes around here. I'd suggest, if you truly need to support 2.x, keep it back how it was, and just clear the whole dict, since it's not really deterministic otherwise. If you want to rev and support 3.6+, just use lru_cache. stdlib is choosing not to use lru_cache for that specific instance still because of the additional complexity that caller has. They discussed this when it was implemented. Otherwise, there's no reason not to use the stdlib lru_cache instead and avoid all of this. I'd still vote to simply revert this here back to how it was, then move forward with a 3.6+ compatible version, which can probably clean up other code and be only 3.x compatible moving forward. I think there are too many downsides for little gain this way on 2.x. |
Or alternatively, do a different cache implementation based on checking python version, and do it how it was for 2.x and use lru_cache if it's available. Something like: try:
from functools import lru_cache
has_lru_cache = True
except ImportError:
has_lru_cache = False And do whatever differently. |
Yeah I'm going to revert to just clearing the entire thing, without a good dataset to bench on it's hard to know what the tradeoff is between the potential extra misses from that, and the potential extra work from a big lock, and the "overflow" from not synchronising the cache (which was always an issue but still). |
Amongst other changes, ua-parser#113 switched the cache to a FIFO inspired by the standard library's re module, however it didn't really take concurrency in account, so didn't really consider: that double-pops are possible (probably why the stdlib ignores a bunch of errors), which can cause KeyError during lookup (as two workers try to clear the first key, one succeeds, and the other doesn't find the key and fails). It also has a few other less major issues: - double-inserts are possible, which can cause the cache to exceed set capacity permanently by the number of concurrent workers - the stdlib's method only works properly with Python 3.6's naturally ordered `dict`, but I'd rather not drop 2.7 compatibility from 0.x unless there are very good causes to as, despite 2.7 having been EOL'd in 2020, it still accounts for more downloads than 3.10 (according to pypistats) Using an ordered dict would solve (3), and allow using an LRU rather than a FIFO, but it would not actually prevent double-pops or double-inserts, that would require a proper lock on lookup. Which might not be that expensive but given the lack of a good dataset to bench with, it seems a lot of additional complexity for something we've got no visibility on. But that can be considered if someone reports a serious performance regression from this. So for now just revert to a "reset" cache replacement policy. If / when we drop older versions we can switch to `functools.lru_cache` and let the stdlib take care of this (and possibly have cache stats). Alternatively if we get a good testing dataset one day we can bench cache replacement policies or even provide pluggable policies. Anyway fixes ua-parser#132, closes ua-parser#133
Amongst other changes, ua-parser#113 switched the cache to a FIFO inspired by the standard library's re module, however it didn't really take concurrency in account, so didn't really consider: that double-pops are possible (probably why the stdlib ignores a bunch of errors), which can cause KeyError during lookup (as two workers try to clear the first key, one succeeds, and the other doesn't find the key and fails). It also has a few other less major issues: - double-inserts are possible, which can cause the cache to exceed set capacity permanently by the number of concurrent workers - the stdlib's method only works properly with Python 3.6's naturally ordered `dict`, but I'd rather not drop 2.7 compatibility from 0.x unless there are very good causes to as, despite 2.7 having been EOL'd in 2020, it still accounts for more downloads than 3.10 (according to pypistats) Using an ordered dict would solve (3), and allow using an LRU rather than a FIFO, but it would not actually prevent double-pops or double-inserts, that would require a proper lock on lookup. Which might not be that expensive but given the lack of a good dataset to bench with, it seems a lot of additional complexity for something we've got no visibility on. But that can be considered if someone reports a serious performance regression from this. So for now just revert to a "reset" cache replacement policy. If / when we drop older versions we can switch to `functools.lru_cache` and let the stdlib take care of this (and possibly have cache stats). Alternatively if we get a good testing dataset one day we can bench cache replacement policies or even provide pluggable policies. Anyway fixes ua-parser#132, closes ua-parser#133
Fix for ua-parser#132 on 0.15
Merge fix of ua-parser#132 from 0.15.x
Merge fix of ua-parser#132 from 0.15.x
Method
user_agent_parser::_lookup()
generates aKeyError
exception at line 237 inuap-python/ua_parser/user_agent_parser.py
. Not all callers of this method can handle the exception.It would be better if line 237 was this:
I'll send a pull request. Please let me know if you need more info.
The text was updated successfully, but these errors were encountered: