Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a wrapper class for cache hash to prevent cache serialization #620

Merged
merged 8 commits into from
Feb 10, 2020

Conversation

pganssle
Copy link
Member

@pganssle pganssle commented Jan 27, 2020

When reviewing #615, @godlygeek came up with this brilliant alternative suggestion: use an int subclass that always pickles as None, so that unless someone goes out of their way to capture the cache value and include it in a __reduce__ or __setstate__, the cache will be cleared on any copy operation.

My main fears (both of which turned out to be unsubstantiated) were:

  1. This might slow down hash() in favor of correctness in serialization/deserialization for people implementing their own __reduce__ functions, which seems like a very niche concern.
  2. Returning a _CacheHashValue from __hash__ might cause other problems, like if someone were taking the hash value of something and pickling or copying it for some deliberate reason (i.e. not copying the object it's stored on, but copying the return value of hash(x)).

In terms of performance, it seems like the only performance impact this has is on the first call to hash(), with:

import attr
@attr.s(frozen=True, cache_hash=True)
class A:
    x = attr.ib()

For %timeit hash(A(1)), I get around 800ns with the current version of attrs and ~1-1.2μs with this change, so about a 300-400ns cost on cache misses. It's a decently high percentage of the hash for this class, but it is kinda stupid to use cache_hash for this class if you're mostly getting cache misses anyway...

For a = A(1), %timeit hash(a), I get around 220-250ns per call with the old and new versions.

Regarding point 2, returning an integer subclass here instead of an integer doesn't seem to be a problem, because hash() casts the value of __hash__ to int anyway, at least on CPython.

One alternate implementation I tried out is to use a 1-slot slots class instead of an integer subclass for _CacheHashWrapper, and unwrap the value on every call to __hash__. This was a bit slower than the integer subclass version in both the cache miss and cache hit cases. It was around 275ns in the cache hit case instead of 220-250ns.

One last thing to note: There is one use case that this is likely to break, which is people who are creating an attrs class with one version of attrs, pickling it, and then unpickling it in another version. Pickles made on Python 3 even with the same attrs version cannot be unpickled in Python 2 (though Python 2 pickles will probably work in Python 3). Since using pickle other than to send data between identical Python environments is unsupported and a bad idea, I think we can discount this concern.

I think this is probably the best approach. If you give me the green light I'll tweak whatever documentation needs to be tweaked and add a changelog. I actually don't expect many if any documentation changes with this version since there should be nothing to caution people about - we would just need to remove any language about #494.

This PR includes the additional tests added in #615.

Fixes #613.
Fixes #494.

Pull Request Check List

  • Added tests for changed code.
  • Updated documentation for changed code.
  • Documentation in .rst files is written using semantic newlines.
  • Changes (and possible deprecations) have news fragments in changelog.d.

@gabbard
Copy link
Member

gabbard commented Jan 27, 2020

This is a clever solution!

src/attr/_make.py Outdated Show resolved Hide resolved
foo_value = attr.ib()


class IncrementingHasher(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice solution to the testing need below.

Copy link

@llllllllll llllllllll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good from a usage of pickle standpoint. I really like the solution here. Making the cached hash value itself special makes it much easier to use default pickle/copy behavior and reduces the number of things someone needs to worry about when implementing a custom reduce.

Regarding the performance: I think this is close to as low overhead as possible. I don't really have any ideas to make it faster.

# actually need a constructor for None objects, we just need any
# available function that returns None.
def __reduce__(self):
return (getattr, (0, "", None))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more clear to create a module scope function like:

def return_none():
    return None

and then in both Python 2 and 3 make the __reduce__ return return_none, ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to know how that compares from a performance point of view.

Copy link

@llllllllll llllllllll Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a not super scientific check:

In [64]: class C: 
    ...:     def __reduce__(self): 
    ...:         return getattr, (0, '', None) 
    ...:                                                                                   

In [65]: class D: 
    ...:     def __reduce__(self): 
    ...:         return type(None), () 
    ...:                                                                                   

In [66]: def return_none(): 
    ...:     return None 
    ...: class E: 
    ...:     def __reduce__(self): 
    ...:         return return_none, () 
    ...:                                                                                   

In [67]: len(pickle.dumps(C()))                                                            
Out[67]: 39

In [68]: len(pickle.dumps(D()))                                                            
Out[68]: 31

In [69]: len(pickle.dumps(E()))                                                            
Out[69]: 31

In [70]: %timeit pickle.loads(pickle.dumps(C()))                                           
3.33 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [71]: %timeit pickle.loads(pickle.dumps(D()))                                           
3.12 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [72]: %timeit pickle.loads(pickle.dumps(E()))                                           
2.62 µs ± 39.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like return_none is almost 20% faster? Which makes sense, since it doesn't do any function calls itself. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try to additionally define the method in the class scope please? Global lookups are relatively slow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think defining it as a class method will work - at least in Python 2, you can't use a class method as a constructor and will get something like:

> pickle.PicklingError: Can't pickle <function _return_none at 0x7f7b13d250c8>: it's not found as __main__._return_none

But, this works to avoid the global lookup:

def _return_none(): 
    return None 
class E: 
    def __reduce__(self, _return_none=_return_none): 
        return _return_none, () 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably shouldn't be optimizing based on the speed of pickle.loads(pickle.dumps()) - almost always if you're using pickle then you are using it in an RPC or writing it to a file. Maybe on-device RPC calls through a unix socket or something are fast enough that the speed of __reduce__ calls matters, but I think if we're optimizing for speed then copy.copy() is a better metric. When I time copy.copy, all of these classes give me approximately the same speed.

If we're really micro-optimizing this function, I'd say that copy.copy should be the benchmark for speed and len(pickle.dumps(x)) should be the benchmark for size.

That said, it would be unusual to use cache_hash for a particularly small object in the first place. For a big enough object that calculating the hash is expensive, you'd also expect it to take a while to pickle it, so an extra 200ns is going to be noise on the process.

It may be more clear to create a module scope function like:

I don't know what @hynek's plans are for Python 2 support, but I assume that it won't be too long before Python 2 is gone. I feel like it would preferable to use the built-in NoneType in Python 3 since that's already a builtin that does the exact same thing. I like the getattr hack because it doesn't leave any compatibility functions lying around as critical components of the system, at the cost of needing a comment and having the pickle be a little bigger.

Copy link
Contributor

@godlygeek godlygeek Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest downside I see to the def _return_none() approach is that the pickled object will contain a reference to attr._make._return_none, meaning it'll be larger than necessary and unnecessarily fragile in the case of pickling with one version of attr and unpickling with a different one (should this ever get refactored in the future). I think the best balance of maintainability, performance, and future-proofness might be something like:

class UnpicklableInt(int):
    if six.PY2:
        def __reduce__(self, _none_constructor=getattr, _args=(0, "", None)):
            return _none_constructor, _args
    else:
        def __reduce__(self, _none_constructor=type(None), _args=()):
            return _none_constructor, _args

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @godlygeek's version, so I updated the code to use that. I was hoping to make _none_constructor and _args keyword-only arguments in Python 3, but that's not possible in 2/3 compatible code because it's a syntax error even in the "dead code" branches. Alas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what @hynek's plans are for Python 2 support, but I assume that it won't be too long before Python 2 is gone.

I wish: https://www.attrs.org/en/stable/python-2.html

However the rule is: optimize for Python 3. Make Python 2 work sOmEhOw.

src/attr/_make.py Outdated Show resolved Hide resolved
@pganssle
Copy link
Member Author

So just to throw one potential monkey wrench into the mix here. It turns out that copy.copy() only calls __reduce__ on the instance - it does not recursively call __reduce__ on all the attributes, you need to call copy.deepcopy() in order for that to happen.

I tend to think that this is fine, since we're mainly worried about the hash cache persisting between different interpreter runs, and if a shallow copy preserves the hash cache that shouldn't be so bad, particularly because cache_hash=True doesn't work on classes with a custom hash or where the auto-generated hash is based on the id of the object. Even if the auto-generated hash relies on the id of the attributes, a copy.copy call won't change that anyway, so I think the behavior is still sane.

@hynek
Copy link
Member

hynek commented Feb 1, 2020

As far as my limited understand of the whole topic goes, this looks mostly good to me – modulo missing newsfragment. Is it still “WIP”? Has anyone any quibbles left?

@pganssle pganssle changed the title WIP: Use a wrapper class for cache hash to prevent cache serialization Use a wrapper class for cache hash to prevent cache serialization Feb 3, 2020
@pganssle
Copy link
Member Author

pganssle commented Feb 3, 2020

@hynek I'm going to give the documentation a once-over to make sure that this doesn't invalidate any existing documentation and to see if it makes sense to add some documentation for this behavior, but I'll remove the WIP since the "to-do" part of it is captured adequately by the checklist.

@pganssle
Copy link
Member Author

pganssle commented Feb 5, 2020

@hynek I've added a changelog entry. Looking through the docs, I didn't see anywhere that the old __setstate__ + cache_hash=True = exception behavior was documented, or any other places that would obviously need updating with this change. Let me know if it's too long or if you want any other adjustments.

I wasn't sure how much of this new behavior you want me to officially document outside the changelog. Should I add a ..versionchanged:: 20.1.0 to cache_hash? Should I maybe add some information about this to the hash code caching page? Up to you how you want to do it. We can also kick the can down the road by merging this and then spawn a new issue for improving the documentation around cache hashing.

Barring any documentation updates, I think this is ready to go.

Copy link
Member

@hynek hynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trusting the others with the technical review, just a bit of nitpicking from my side – sorry.

changelog.d/620.change.rst Outdated Show resolved Hide resolved
changelog.d/620.change.rst Outdated Show resolved Hide resolved
changelog.d/620.change.rst Outdated Show resolved Hide resolved
src/attr/_make.py Outdated Show resolved Hide resolved
pganssle and others added 8 commits February 10, 2020 10:17
Rather than attempting to remove the hash cache from the object state on
deserialization or serialization, instead we store the hash cache in an
object that reduces to None, thus clearing itself when pickled or
copied.

This fixes GH python-attrs#494 and python-attrs#613.

Co-authored-by: Matt Wozniski <[email protected]>
I couldn't think of any way to make a useful and meaningful class that
has no state and also has no custom __reduce__ method, so I went
minimalist with it.
Previously, there was some miniscule risk of hash collision, and also it
was relying on the implementation details of `pickle` (the assumption
that `hash()` is never called as part of `pickle.loads`).
Co-Authored-By: Ryan Gabbard <[email protected]>
Since the cached hash value is not actually serialized in __getstate__,
__setstate__ is not actually "clearing" it on deserialization - it's
initializing the value to None.
This change was overshadowed by a more fundamental change in python-attrs#620.
@pganssle
Copy link
Member Author

@hynek Thanks for the nitpicking, I admit I'm often a bit sloppy about stylistic issues (especially with black allowing me to throw caution to the wind in like 95% of situations), so it's helpful to get feedback about the conventions of the project.

Your comments prompted me to actually finish reading through the "semantic newlines" document and I realized I was doing it wrong (plus I decided to reword the changelog to read a bit naturally) - let me know if I've got it wrong.

@hynek
Copy link
Member

hynek commented Feb 10, 2020

I think we're good! Except that codecov is acting up once again but hey that's rock'n'roll. Thanks everybody involved!

@hynek hynek merged commit d6a65fb into python-attrs:master Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants