-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 #20373
[SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 #20373
Conversation
object.__new__: _get_object_new, | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAINT: Handle builtin type new attrs: cloudpipe/cloudpickle@f0d2011
python/pyspark/cloudpickle.py
Outdated
msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg) | ||
print_exec(sys.stderr) | ||
raise pickle.PicklingError(msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception handling is Spark specific, it has been moved to serializers.py CloudPickleSerializer.dumps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad this is moved, should make the next update easier.
"""Fallback to save_string""" | ||
Pickler.save_string(self, str(obj)) | ||
self.save(obj.tobytes()) | ||
dispatch[memoryview] = save_memoryview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cleanups, fix memoryview support cloudpipe/cloudpickle@f8187e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So without the new line between these my brain doesn't parse this right on the first read. What do you think of adding a new line here + back to cloud pickle (can be a follow up)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you mean change to this
def save_memoryview(self, obj):
self.save(obj.tobytes())
dispatch[memoryview] = save_memoryview
That format is done in a few places and I saw a couple other formatting issues.. so yeah I can submit a PR to do those
# If the function we've received is in that cache, we just | ||
# serialize it as a lookup into the cache. | ||
return self.save_reduce(_BUILTIN_TYPE_CONSTRUCTORS[obj], (), obj=obj) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUG: Hit the builtin type cache for any function cloudpipe/cloudpickle@d84980c
# In Python 2, we can't set this attribute after construction. | ||
__dict__ = clsdict.pop('__dict__', None) | ||
if isinstance(__dict__, property): | ||
type_kwargs['__dict__'] = __dict__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUG: Fix bug pickling namedtuple cloudpipe/cloudpickle@28070bb
} | ||
if hasattr(func, '__qualname__'): | ||
state['qualname'] = func.__qualname__ | ||
save(state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve func.qualname when defined cloudpipe/cloudpickle@14b38a3
python/pyspark/cloudpickle.py
Outdated
except Exception: | ||
if obj.__module__ == "__builtin__" or obj.__module__ == "builtins": | ||
if obj in _BUILTIN_TYPE_NAMES: | ||
return self.save_reduce(_builtin_type, (_BUILTIN_TYPE_NAMES[obj],), obj=obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cleanups, fix memoryview support cloudpipe/cloudpickle@f8187e9
@@ -709,12 +702,7 @@ def save_property(self, obj): | |||
dispatch[property] = save_property | |||
|
|||
def save_classmethod(self, obj): | |||
try: | |||
orig_func = obj.__func__ | |||
except AttributeError: # Python 2.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support for Python 2.6 removed
if sys.version_info < (2,7): # 2.7 supports partial pickling | ||
dispatch[partial] = save_partial | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove save_reduce() override: It is the exactly the same code as in Python 2's Pickler class.
cloudpipe/cloudpickle@2da4c24
def inject_addons(self): | ||
"""Plug in system. Register additional pickling functions if modules already loaded""" | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further cleanups cloudpipe/cloudpickle@c91aaf1
cp.dump(obj) | ||
return file.getvalue() | ||
finally: | ||
file.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close StringIO timely on exception cloudpipe/cloudpickle@ca4661b
def _fill_function(*args): | ||
"""Fills in the rest of function data into the skeleton function object | ||
|
||
The skeleton itself is create by _make_skel_func(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore compatibility with functions pickled with 0.4.0 (#128)
cloudpipe/cloudpickle@7d8c670
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's more of an issue if using pickles stored on disk or if nodes in the cluster are on different versions. Is that likely for Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we made the guarantee on it but the best is always to stay safer. Others seem bug fixes or improvements but this one could be a regression fix (about the support we haven't guaranteed). It's the part of 0.4.2v anyway.
(I pointed out cloudpipe/cloudpickle#145 too for the same reason. IIUC, this could be a regression)
if 'module' in state: | ||
func.__module__ = state['module'] | ||
if 'qualname' in state: | ||
func.__qualname__ = state['qualname'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve func.qualname when defined cloudpipe/cloudpickle@14b38a3
""" | ||
from collections import namedtuple | ||
return namedtuple(name, fields) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't seem necessary anymore after the fix for namedtuples
@holdenk @HyukjinKwon it seemed like mostly straightforward fixes/cleanups to match cloudpickle 0.4.2 but you two are way more experienced here than me. Are there any concerns over these updates or additional tests to run? I did test that namedtuple pickling works with the new fix in cloudpickle, but since the standard pickle still fails we still need the hijack workaround in Spark. |
Test build #86553 has finished for PR 20373 at commit
|
Thank you for going through and documentation the related changes back to the cloudpickle changes :) |
Whoa, nice efforts! Will take a close look within few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Not a full review but a quick pass before my talk with minor feedback. I'll look through more later.
python/pyspark/cloudpickle.py
Outdated
msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg) | ||
print_exec(sys.stderr) | ||
raise pickle.PicklingError(msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad this is moved, should make the next update easier.
"""Fallback to save_string""" | ||
Pickler.save_string(self, str(obj)) | ||
self.save(obj.tobytes()) | ||
dispatch[memoryview] = save_memoryview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So without the new line between these my brain doesn't parse this right on the first read. What do you think of adding a new line here + back to cloud pickle (can be a follow up)?
raise pickle.PicklingError("Can't pickle %r" % obj) | ||
else: | ||
rv = obj.__reduce_ex__(self.proto) | ||
rv = obj.__reduce_ex__(self.proto) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable since Py 3.4+ only anyways.
So any reason for the WIP tag? |
I wasn't sure if the named tuple hijack issue from https://issues.apache.org/jira/browse/SPARK-22674 could be fixed here, but it looks like that would require more outside of the scope of this since the problem is with the standard pickling too, right? |
Wait, so we left out cloudpickle#113 even though its in 0.4.2? |
hmm sorry nvm. So not for this time, but maybe next time we could also copy the cloudpickle_test file over as well. |
Yup, I think so. |
That patch is in here and this exactly matches 0.4.2. I also manually verified that cloudpickle will pickle named tuples with and without the hijack in |
@holdenk and @HyukjinKwon , is there any further testing you guys can think that needs to be done to verify this is ok? |
I took a quick look for the commits and seems we should backport cloudpipe/cloudpickle#145 too as looks introduced from cloudpipe/cloudpickle#113. Let me try to backport it to cloudpickle and let's hear their opinion, if I didn't misunderstand. |
@HyukjinKwon would it be good to update this PR to match the upcoming 4.3 release you are working on? If the code is the same, then just updating the title/description so it is clear |
Yup, now the codes of branch "0.4.x" in cloudpickle is the same with the current PR. Was thinking of letting you know after 0.4.3. Please give me few days ... :-). |
Sounds good! No rush, I'll keep an eye out for the release |
@BryanCutler, I just released 0.4.3 - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.3. Would you mind if I ask to fix PR and JIRA accordingly? |
Great, thanks @HyukjinKwon! The 0.4.3 code matches this exactly, so I will just adjust the descriptions. |
…oudpickle-42-SPARK-23159
Does the hijacking of the namedtuple still cause problems on Python 3.6? |
Test build #87419 has finished for PR 20373 at commit
|
I think it's fine in cloudpickle but Spark has the hijacking for regular pickling. I was thinking of a possibility for a deduplicated fix but might have to be investigated separately. Let's hold this on a bit until the release of 2.3.0 as it's going to go into master anyway (I think). Seems it's been delayed unexpectedly and we better keep the diff small between master and branch-2.3 for now. Will keep my eyes on this PR anyway. |
I'm not too familiar with the history of this, but I ran PySpark tests that cover namedtuples with 3.6.3 and all passed. |
So it looks like the 2.3 release is probably going to go out but Jenkins thinks this can't be merged with master. So lets do a jenkins retest this please and I'll try and take some review cycles this week :) |
Test build #87682 has finished for PR 20373 at commit
|
retest this please |
Test build #87939 has finished for PR 20373 at commit
|
@holdenk, have you had a chance to take a look for this one? |
Let me merge with master since it has been sitting a while |
…oudpickle-42-SPARK-23159
Test build #87972 has finished for PR 20373 at commit
|
LGTM |
Merged to master. |
Thanks @HyukjinKwon @holdenk and @rgbkrk ! |
Woohoo! |
What changes were proposed in this pull request?
The version of cloudpickle in PySpark was close to version 0.4.0 with some additional backported fixes and some minor additions for Spark related things. This update removes Spark related changes and matches cloudpickle v0.4.3:
Changes by updating to 0.4.3 include:
How was this patch tested?
Existing pyspark.tests using python 2.7.14, 3.5.2, 3.6.3