-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32094][PYTHON] Update cloudpickle to v1.4.1 #28950
Conversation
@HyukjinKwon Do we want to drop Python 2 in Spark 3.1.0? Btw, if we want to backport this to 3.0 branch, we also cannot drop Python 2. |
In that case, cloudpickle v1.3.0 might be a better option. |
Thank you for your first contribution, @codesue . Also, since this is related to Python 2.x, ping @holdenk , @mengxr , @gatorsmile , @srowen , @BryanCutler . |
Thanks for the ping @dongjoon-hyun & thanks for working on this PR @codesue, I've been meaning to take a look at cloudpickle's updates. @viirya I think backporting cloudpickle changes is a bit risky given how core it is. I'd rather target the latest cloud pickle to Spark 3.1 if that sounds good to folks? |
+1 for @holdenk 's advice. |
Yeah, to upgrade we should drop Python 2. I target to drop it in Spark 3.1. I will make a PR to officially drop first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codesue, BTW is it a clean matching? We should also port https://github.com/cloudpipe/cloudpickle/blob/master/cloudpickle/cloudpickle_fast.py too to leverage C pickle in Python 3.8+
ok to test |
1 similar comment
ok to test |
Test build #124677 has finished for PR 28950 at commit
|
Test build #124680 has finished for PR 28950 at commit
|
@@ -0,0 +1,557 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add cloudpickle_fast.py to dev/.rat-excludes
to avoid the RAT test error.
Test build #124683 has finished for PR 28950 at commit
|
Do the docs build locally for you @codesue? It's possible we need to file an Spark JIRA ticket to ask Shane to update the version of our doc tooling. |
Ah .. yeah, if this is the Python version issue in the Jenkins, let's file a JIRA under https://issues.apache.org/jira/browse/SPARK-29909. Seems like there are some more works that block this PR ... |
@holdenk I'm able to build the python docs when I run
|
I double checked we can build the documentation with Python 3.7 too. R build is fine. It's not relevant. Then seems to be it's the problem of the Python version installed in Jenkins machine for documentation build in Sphinx. I will file a separate ticket soon for that. If you want to at least test though, you can disable the documentation build for now with the changes d74aa53 and https://github.com/apache/spark/pull/28957/files#diff-871d87c62d4e9228a47145a8894b6694R161. I am pushing to drop Python 2, 3.4 and 3.5 so this PR can be unblocked at #28957. Feel free to give more feedback on [DISCUSS] Drop Python 2, 3.4 and 3.5. |
@HyukjinKwon, I'll wait for #28957 to be merged. |
### What changes were proposed in this pull request? This PR aims to drop Python 2.7, 3.4 and 3.5. Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark. ### Why are the changes needed? 1. Unsupport EOL Python versions 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2. 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation. 4. Users can use Python type hints with Pandas UDFs without thinking about Python version 5. Users can leverage one latest cloudpickle, #28950. With Python 3.8+ it can also leverage C pickle. ### Does this PR introduce _any_ user-facing change? Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version. ### How was this patch tested? Manually tested and also tested in Jenkins. Closes #28957 from HyukjinKwon/SPARK-32138. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
It looks like it's accidentally picking up some unrelated changes. When I have this happen to me I normally just re-create the PR from a fresh branch (I know there's other ways to clean it up, but my git magic is not strong enough). |
Thanks, @holdenk! I'll do that. |
Test build #125844 has finished for PR 28950 at commit
|
Looks something went wrong during the rebase. |
I just resolved it and pushed by myself |
@@ -42,27 +42,42 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codesue, we should make it as a proper package so cloudpickle_fast can be properly used as well. Seems like v.1.5 was release too .. we should better match it as well.
Let me create a PR by myself to upgrade to 1.5, and co-author you together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds great! Thank you!!
Test build #125862 has finished for PR 28950 at commit
|
Let's track it in #29114. |
What changes were proposed in this pull request?
Update cloudpickle in PySpark to cloudpickle v1.4.1 (See https://github.com/cloudpipe/cloudpickle/blob/v1.4.1/cloudpickle/cloudpickle.py)
This is currently the highest version of cloudpickle, and it does not support Python 2.
Port cloudpickle_fast.py to leverage C pickle in Python 3.8+ (See https://github.com/cloudpipe/cloudpickle/blob/v1.4.1/cloudpickle/cloudpickle_fast.py)
Why are the changes needed?
PySpark's cloudpickle.py and versions of cloudpickle below 1.3.0 interfere with dill unpickling because they define types.ClassType, which is undefined in dill. This results in the following error:
(See cloudpipe/cloudpickle#82)
This was fixed for cloudpickle 1.3.0+ (cloudpipe/cloudpickle#337), but PySpark's cloudpickle.py doesn't have this change yet.
Does this PR introduce any user-facing change?
No
How was this patch tested?
python/run-tests