[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

codesue · 2020-06-29T16:39:37Z

What changes were proposed in this pull request?

Update cloudpickle in PySpark to cloudpickle v1.4.1 (See https://github.com/cloudpipe/cloudpickle/blob/v1.4.1/cloudpickle/cloudpickle.py)

This is currently the highest version of cloudpickle, and it does not support Python 2.

Port cloudpickle_fast.py to leverage C pickle in Python 3.8+ (See https://github.com/cloudpipe/cloudpickle/blob/v1.4.1/cloudpickle/cloudpickle_fast.py)

Why are the changes needed?

PySpark's cloudpickle.py and versions of cloudpickle below 1.3.0 interfere with dill unpickling because they define types.ClassType, which is undefined in dill. This results in the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads
    return load(file, ignore)
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load
    obj = pik.load()
  File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type
    return _reverse_typemap[name]
KeyError: 'ClassType'

(See cloudpipe/cloudpickle#82)

This was fixed for cloudpickle 1.3.0+ (cloudpipe/cloudpickle#337), but PySpark's cloudpickle.py doesn't have this change yet.

Does this PR introduce any user-facing change?

No

How was this patch tested?

python/run-tests

viirya · 2020-06-29T16:45:15Z

@HyukjinKwon Do we want to drop Python 2 in Spark 3.1.0?

Btw, if we want to backport this to 3.0 branch, we also cannot drop Python 2.

codesue · 2020-06-29T16:47:12Z

In that case, cloudpickle v1.3.0 might be a better option.

dongjoon-hyun · 2020-06-29T16:49:23Z

Thank you for your first contribution, @codesue .

Also, since this is related to Python 2.x, ping @holdenk , @mengxr , @gatorsmile , @srowen , @BryanCutler .

holdenk · 2020-06-29T23:08:10Z

Thanks for the ping @dongjoon-hyun & thanks for working on this PR @codesue, I've been meaning to take a look at cloudpickle's updates. @viirya I think backporting cloudpickle changes is a bit risky given how core it is. I'd rather target the latest cloud pickle to Spark 3.1 if that sounds good to folks?

dongjoon-hyun · 2020-06-30T02:37:27Z

+1 for @holdenk 's advice.

HyukjinKwon · 2020-06-30T03:47:18Z

Yeah, to upgrade we should drop Python 2. I target to drop it in Spark 3.1. I will make a PR to officially drop first.

HyukjinKwon

@codesue, BTW is it a clean matching? We should also port https://github.com/cloudpipe/cloudpickle/blob/master/cloudpickle/cloudpickle_fast.py too to leverage C pickle in Python 3.8+

HyukjinKwon · 2020-06-30T03:50:41Z

ok to test

viirya · 2020-06-30T16:20:44Z

ok to test

SparkQA · 2020-06-30T16:33:05Z

Test build #124677 has finished for PR 28950 at commit 164764a.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-30T16:55:09Z

Test build #124680 has finished for PR 28950 at commit 5f137d3.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-06-30T17:00:36Z

python/pyspark/cloudpickle_fast.py

@@ -0,0 +1,557 @@
+"""


You could add cloudpickle_fast.py to dev/.rat-excludes to avoid the RAT test error.

SparkQA · 2020-06-30T17:24:03Z

Test build #124683 has finished for PR 28950 at commit 23933eb.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2020-06-30T19:36:11Z

Do the docs build locally for you @codesue? It's possible we need to file an Spark JIRA ticket to ask Shane to update the version of our doc tooling.

HyukjinKwon · 2020-07-01T03:49:32Z

Ah .. yeah, if this is the Python version issue in the Jenkins, let's file a JIRA under https://issues.apache.org/jira/browse/SPARK-29909. Seems like there are some more works that block this PR ...

codesue · 2020-07-02T03:22:44Z

@holdenk I'm able to build the python docs when I run jekyll build. I don't have R installed, so I get Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is properly installed. When I run SKIP_RDOC=1 jekyll build, I get

[error] (streaming-kafka-0-10-assembly/*:projectDescriptors) Could not create directory /Users/codesue/oss/spark/external/kafka-0-10-assembly/target/streams/$global/projectDescriptors/$global/streams
[error] (sketch/*:projectDescriptors) Could not create file /Users/codesue/oss/spark/common/sketch/target/streams/$global/projectDescriptors/$global/streams/outjava.io.IOException: No such file or directory
[error] Total time: 1 s, completed Jul 1, 2020 11:15:12 PM

HyukjinKwon · 2020-07-02T03:33:05Z

I double checked we can build the documentation with Python 3.7 too. R build is fine. It's not relevant.

Then seems to be it's the problem of the Python version installed in Jenkins machine for documentation build in Sphinx. I will file a separate ticket soon for that. If you want to at least test though, you can disable the documentation build for now with the changes d74aa53 and https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R161.

I am pushing to drop Python 2, 3.4 and 3.5 so this PR can be unblocked at #28957. Feel free to give more feedback on [DISCUSS] Drop Python 2, 3.4 and 3.5.

codesue · 2020-07-02T21:29:14Z

@HyukjinKwon, I'll wait for #28957 to be merged.

### What changes were proposed in this pull request? This PR aims to drop Python 2.7, 3.4 and 3.5. Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark. ### Why are the changes needed? 1. Unsupport EOL Python versions 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2. 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation. 4. Users can use Python type hints with Pandas UDFs without thinking about Python version 5. Users can leverage one latest cloudpickle, #28950. With Python 3.8+ it can also leverage C pickle. ### Does this PR introduce _any_ user-facing change? Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version. ### How was this patch tested? Manually tested and also tested in Jenkins. Closes #28957 from HyukjinKwon/SPARK-32138. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

holdenk · 2020-07-14T16:35:08Z

It looks like it's accidentally picking up some unrelated changes. When I have this happen to me I normally just re-create the PR from a fresh branch (I know there's other ways to clean it up, but my git magic is not strong enough).

codesue · 2020-07-14T16:52:45Z

Thanks, @holdenk! I'll do that.

SparkQA · 2020-07-14T19:13:38Z

Test build #125844 has finished for PR 28950 at commit d731f63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-14T23:57:54Z

Looks something went wrong during the rebase.

HyukjinKwon · 2020-07-15T00:45:29Z

I just resolved it and pushed by myself

HyukjinKwon · 2020-07-15T00:50:14Z

python/pyspark/cloudpickle.py

@@ -42,27 +42,42 @@
 """


@codesue, we should make it as a proper package so cloudpickle_fast can be properly used as well. Seems like v.1.5 was release too .. we should better match it as well.

Let me create a PR by myself to upgrade to 1.5, and co-author you together.

That sounds great! Thank you!!

SparkQA · 2020-07-15T00:51:44Z

Test build #125862 has finished for PR 28950 at commit b90ea32.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
skeleton_class = types.new_class(
class CloudPickler(Pickler):
function/class builtin-saving method (save_global), to serialize dynamic
is_anyclass = issubclass(t, type)
except TypeError: # t is not a class (old Boost; see SF #502085)

HyukjinKwon · 2020-07-15T01:16:40Z

Let's track it in #29114.

probot-autolabeler bot added CORE PYTHON labels Jun 29, 2020

HyukjinKwon reviewed Jun 30, 2020

View reviewed changes

viirya reviewed Jun 30, 2020

View reviewed changes

probot-autolabeler bot added the BUILD label Jun 30, 2020

HyukjinKwon mentioned this pull request Jul 10, 2020

[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 #28957

Closed

probot-autolabeler bot added INFRA WEB UI labels Jul 14, 2020

Update cloudpickle to v1.4.1

b90ea32

HyukjinKwon reviewed Jul 15, 2020

View reviewed changes

HyukjinKwon closed this Jul 15, 2020

HyukjinKwon mentioned this pull request Jul 15, 2020

[SPARK-32094][PYTHON] Update cloudpickle to v1.5.0 #29114

Closed

codesue deleted the SPARK-32094-update-cloudpickle branch July 15, 2020 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

codesue commented Jun 29, 2020 •

edited

Loading

viirya commented Jun 29, 2020

codesue commented Jun 29, 2020

dongjoon-hyun commented Jun 29, 2020

holdenk commented Jun 29, 2020

dongjoon-hyun commented Jun 30, 2020

HyukjinKwon commented Jun 30, 2020

HyukjinKwon left a comment

HyukjinKwon commented Jun 30, 2020

viirya commented Jun 30, 2020

SparkQA commented Jun 30, 2020

SparkQA commented Jun 30, 2020

viirya Jun 30, 2020

SparkQA commented Jun 30, 2020

holdenk commented Jun 30, 2020

HyukjinKwon commented Jul 1, 2020

codesue commented Jul 2, 2020

HyukjinKwon commented Jul 2, 2020

codesue commented Jul 2, 2020

holdenk commented Jul 14, 2020

codesue commented Jul 14, 2020

SparkQA commented Jul 14, 2020

HyukjinKwon commented Jul 14, 2020

HyukjinKwon commented Jul 15, 2020

HyukjinKwon Jul 15, 2020 •

edited

Loading

codesue Jul 15, 2020

SparkQA commented Jul 15, 2020

HyukjinKwon commented Jul 15, 2020

[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950

Conversation

codesue commented Jun 29, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

viirya commented Jun 29, 2020

codesue commented Jun 29, 2020

dongjoon-hyun commented Jun 29, 2020

holdenk commented Jun 29, 2020

dongjoon-hyun commented Jun 30, 2020

HyukjinKwon commented Jun 30, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 30, 2020

viirya commented Jun 30, 2020

SparkQA commented Jun 30, 2020

SparkQA commented Jun 30, 2020

viirya Jun 30, 2020

Choose a reason for hiding this comment

SparkQA commented Jun 30, 2020

holdenk commented Jun 30, 2020

HyukjinKwon commented Jul 1, 2020

codesue commented Jul 2, 2020

HyukjinKwon commented Jul 2, 2020

codesue commented Jul 2, 2020

holdenk commented Jul 14, 2020

codesue commented Jul 14, 2020

SparkQA commented Jul 14, 2020

HyukjinKwon commented Jul 14, 2020

HyukjinKwon commented Jul 15, 2020

HyukjinKwon Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

codesue Jul 15, 2020

Choose a reason for hiding this comment

SparkQA commented Jul 15, 2020

HyukjinKwon commented Jul 15, 2020

codesue commented Jun 29, 2020 •

edited

Loading

HyukjinKwon Jul 15, 2020 •

edited

Loading