Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 #20373

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Jan 23, 2018

What changes were proposed in this pull request?

The version of cloudpickle in PySpark was close to version 0.4.0 with some additional backported fixes and some minor additions for Spark related things. This update removes Spark related changes and matches cloudpickle v0.4.3:

Changes by updating to 0.4.3 include:

How was this patch tested?

Existing pyspark.tests using python 2.7.14, 3.5.2, 3.6.3

object.__new__: _get_object_new,
}


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAINT: Handle builtin type new attrs: cloudpipe/cloudpickle@f0d2011

msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
print_exec(sys.stderr)
raise pickle.PicklingError(msg)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception handling is Spark specific, it has been moved to serializers.py CloudPickleSerializer.dumps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad this is moved, should make the next update easier.

"""Fallback to save_string"""
Pickler.save_string(self, str(obj))
self.save(obj.tobytes())
dispatch[memoryview] = save_memoryview
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cleanups, fix memoryview support cloudpipe/cloudpickle@f8187e9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So without the new line between these my brain doesn't parse this right on the first read. What do you think of adding a new line here + back to cloud pickle (can be a follow up)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you mean change to this

def save_memoryview(self, obj):
        self.save(obj.tobytes())

dispatch[memoryview] = save_memoryview

That format is done in a few places and I saw a couple other formatting issues.. so yeah I can submit a PR to do those

# If the function we've received is in that cache, we just
# serialize it as a lookup into the cache.
return self.save_reduce(_BUILTIN_TYPE_CONSTRUCTORS[obj], (), obj=obj)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: Hit the builtin type cache for any function cloudpipe/cloudpickle@d84980c

# In Python 2, we can't set this attribute after construction.
__dict__ = clsdict.pop('__dict__', None)
if isinstance(__dict__, property):
type_kwargs['__dict__'] = __dict__
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: Fix bug pickling namedtuple cloudpipe/cloudpickle@28070bb

}
if hasattr(func, '__qualname__'):
state['qualname'] = func.__qualname__
save(state)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserve func.qualname when defined cloudpipe/cloudpickle@14b38a3

except Exception:
if obj.__module__ == "__builtin__" or obj.__module__ == "builtins":
if obj in _BUILTIN_TYPE_NAMES:
return self.save_reduce(_builtin_type, (_BUILTIN_TYPE_NAMES[obj],), obj=obj)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cleanups, fix memoryview support cloudpipe/cloudpickle@f8187e9

@@ -709,12 +702,7 @@ def save_property(self, obj):
dispatch[property] = save_property

def save_classmethod(self, obj):
try:
orig_func = obj.__func__
except AttributeError: # Python 2.6
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support for Python 2.6 removed

if sys.version_info < (2,7): # 2.7 supports partial pickling
dispatch[partial] = save_partial


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove save_reduce() override: It is the exactly the same code as in Python 2's Pickler class.
cloudpipe/cloudpickle@2da4c24

def inject_addons(self):
"""Plug in system. Register additional pickling functions if modules already loaded"""
pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp.dump(obj)
return file.getvalue()
finally:
file.close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close StringIO timely on exception cloudpipe/cloudpickle@ca4661b

def _fill_function(*args):
"""Fills in the rest of function data into the skeleton function object

The skeleton itself is create by _make_skel_func().
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore compatibility with functions pickled with 0.4.0 (#128)
cloudpipe/cloudpickle@7d8c670

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we need this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more of an issue if using pickles stored on disk or if nodes in the cluster are on different versions. Is that likely for Spark?

Copy link
Member

@HyukjinKwon HyukjinKwon Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we made the guarantee on it but the best is always to stay safer. Others seem bug fixes or improvements but this one could be a regression fix (about the support we haven't guaranteed). It's the part of 0.4.2v anyway.

(I pointed out cloudpipe/cloudpickle#145 too for the same reason. IIUC, this could be a regression)

if 'module' in state:
func.__module__ = state['module']
if 'qualname' in state:
func.__qualname__ = state['qualname']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preserve func.qualname when defined cloudpipe/cloudpickle@14b38a3

"""
from collections import namedtuple
return namedtuple(name, fields)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't seem necessary anymore after the fix for namedtuples

@BryanCutler
Copy link
Member Author

@holdenk @HyukjinKwon it seemed like mostly straightforward fixes/cleanups to match cloudpickle 0.4.2 but you two are way more experienced here than me. Are there any concerns over these updates or additional tests to run?

I did test that namedtuple pickling works with the new fix in cloudpickle, but since the standard pickle still fails we still need the hijack workaround in Spark.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86553 has finished for PR 20373 at commit c362df8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 24, 2018

Thank you for going through and documentation the related changes back to the cloudpickle changes :)

@HyukjinKwon
Copy link
Member

Whoa, nice efforts! Will take a close look within few days.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Not a full review but a quick pass before my talk with minor feedback. I'll look through more later.

msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
print_exec(sys.stderr)
raise pickle.PicklingError(msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad this is moved, should make the next update easier.

"""Fallback to save_string"""
Pickler.save_string(self, str(obj))
self.save(obj.tobytes())
dispatch[memoryview] = save_memoryview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So without the new line between these my brain doesn't parse this right on the first read. What do you think of adding a new line here + back to cloud pickle (can be a follow up)?

raise pickle.PicklingError("Can't pickle %r" % obj)
else:
rv = obj.__reduce_ex__(self.proto)
rv = obj.__reduce_ex__(self.proto)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable since Py 3.4+ only anyways.

@holdenk
Copy link
Contributor

holdenk commented Jan 26, 2018

So any reason for the WIP tag?

@BryanCutler
Copy link
Member Author

I wasn't sure if the named tuple hijack issue from https://issues.apache.org/jira/browse/SPARK-22674 could be fixed here, but it looks like that would require more outside of the scope of this since the problem is with the standard pickling too, right?

@BryanCutler BryanCutler changed the title [WIP][SPARK-23159][PYTHON] Update cloudpickle to match 0.4.2 [SPARK-23159][PYTHON] Update cloudpickle to match 0.4.2 Jan 26, 2018
@holdenk
Copy link
Contributor

holdenk commented Jan 26, 2018

Wait, so we left out cloudpickle#113 even though its in 0.4.2?

@holdenk
Copy link
Contributor

holdenk commented Jan 26, 2018

hmm sorry nvm. So not for this time, but maybe next time we could also copy the cloudpickle_test file over as well.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 26, 2018

it looks like that would require more outside of the scope of this since the problem is with the standard pickling too, right?

Yup, I think so.

@BryanCutler
Copy link
Member Author

Wait, so we left out cloudpickle#113 even though its in 0.4.2?

That patch is in here and this exactly matches 0.4.2. I also manually verified that cloudpickle will pickle named tuples with and without the hijack in types.py

@BryanCutler
Copy link
Member Author

@holdenk and @HyukjinKwon , is there any further testing you guys can think that needs to be done to verify this is ok?

@HyukjinKwon
Copy link
Member

I took a quick look for the commits and seems we should backport cloudpipe/cloudpickle#145 too as looks introduced from cloudpipe/cloudpickle#113. Let me try to backport it to cloudpickle and let's hear their opinion, if I didn't misunderstand.

@BryanCutler
Copy link
Member Author

@HyukjinKwon would it be good to update this PR to match the upcoming 4.3 release you are working on? If the code is the same, then just updating the title/description so it is clear

@HyukjinKwon
Copy link
Member

Yup, now the codes of branch "0.4.x" in cloudpickle is the same with the current PR. Was thinking of letting you know after 0.4.3. Please give me few days ... :-).

@BryanCutler
Copy link
Member Author

Sounds good! No rush, I'll keep an eye out for the release

@HyukjinKwon
Copy link
Member

@BryanCutler, I just released 0.4.3 - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.3. Would you mind if I ask to fix PR and JIRA accordingly?

@BryanCutler
Copy link
Member Author

Great, thanks @HyukjinKwon! The 0.4.3 code matches this exactly, so I will just adjust the descriptions.

@BryanCutler BryanCutler changed the title [SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus useful backport fixes [SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 plus useful backport fixes Feb 13, 2018
@BryanCutler BryanCutler changed the title [SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 plus useful backport fixes [SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 Feb 13, 2018
@rgbkrk
Copy link
Contributor

rgbkrk commented Feb 13, 2018

Does the hijacking of the namedtuple still cause problems on Python 3.6?

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87419 has finished for PR 20373 at commit 2d19f0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I think it's fine in cloudpickle but Spark has the hijacking for regular pickling. I was thinking of a possibility for a deduplicated fix but might have to be investigated separately.

Let's hold this on a bit until the release of 2.3.0 as it's going to go into master anyway (I think). Seems it's been delayed unexpectedly and we better keep the diff small between master and branch-2.3 for now. Will keep my eyes on this PR anyway.

@BryanCutler
Copy link
Member Author

Does the hijacking of the namedtuple still cause problems on Python 3.6?

I'm not too familiar with the history of this, but I ran PySpark tests that cover namedtuples with 3.6.3 and all passed.

@holdenk
Copy link
Contributor

holdenk commented Feb 26, 2018

So it looks like the 2.3 release is probably going to go out but Jenkins thinks this can't be merged with master. So lets do a jenkins retest this please and I'll try and take some review cycles this week :)

@SparkQA
Copy link

SparkQA commented Feb 26, 2018

Test build #87682 has finished for PR 20373 at commit 2d19f0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 4, 2018

Test build #87939 has finished for PR 20373 at commit 2d19f0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@holdenk, have you had a chance to take a look for this one?

@BryanCutler
Copy link
Member Author

Let me merge with master since it has been sitting a while

@SparkQA
Copy link

SparkQA commented Mar 5, 2018

Test build #87972 has finished for PR 20373 at commit 7d265f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 9bb239c Mar 8, 2018
@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon @holdenk and @rgbkrk !

@BryanCutler BryanCutler deleted the pyspark-update-cloudpickle-42-SPARK-23159 branch March 8, 2018 17:40
@rgbkrk
Copy link
Contributor

rgbkrk commented Mar 8, 2018

Woohoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants