-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659
Conversation
… list of jars, fix extras_require decl, etc.)
…ject, so create symlinks so we can package the JARs with it
…x symlink farm issue, fix scripts issue, TODO: fix SPARK_HOME and find out why JARs aren't ending up in the install
…add pyspark.bin and pyspark.jars packages and set their package dirs as desired, make the spark scripts check and see if they are in a pip installed enviroment and if SPARK_HOME in unset then resolve it with Python [otherwise use the current behaviour]
…d spark home finder
…d_spark_home.py around
Test build #68620 has finished for PR 15659 at commit
|
Test build #68619 has finished for PR 15659 at commit
|
Ah, so the idea is that we'll make separate PyPi artifacts for each Hadoop version and want that to be reflected in the version shown in Python? |
Test build #68638 has finished for PR 15659 at commit
|
@JoshRosen - yes since we ship the jars with them we want people to be able to install the correct package for the hadoop distribution they are running with/against. |
Ping @JoshRosen / @rxin :) |
graft deps/bin | ||
recursive-include deps/examples *.py | ||
recursive-include lib *.zip | ||
include README.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk - When you look at the packaged files, do you see a bunch of cruft like .pyc
files and the like in there?
If so, you may want to add something like this here:
global-exclude *.py[cod] __pycache__ .DS_Store
Just checking since it's a common problem with Python packaging. I can check this for you myself later today, if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it wouldn't happen with the make release scripts since they use a fresh copy of the source, but if we were making the packages by hand those could certainly show up. I'll add the exclusion rule since it shouldn't break anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually even then it shouldn't happen "normally" (since we use recursive-include *.py as the inclusion rule for the python directory and our own graft directory is the bin directory). But still better to have the exclusion rule incase someone has pyc files in bin and is rolling their own package. Thanks for the suggestion :)
…sembly jar scala versions
Test build #68669 has finished for PR 15659 at commit
|
I'v added the extra exclusion rules for extra safety in packaging :) cc @rxin if this looks good to merge? |
pip install dist/*.tar.gz""" | ||
|
||
# Figure out where the jars are we need to package with PySpark. | ||
JARS_PATH = glob.glob(os.path.join(SPARK_HOME, "assembly/target/scala-*/jars/")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we respect and use $SPARK_SCALA_VERSION
here if defined? We do that in bin/spark-class
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be defined if someone is just building there own sdist or manually installing from source rather than with the packaging scripts so I'd rather avoid assuming $SPARK_SCALA_VERSION
is present.
cc @jkbradley / @davies / @JoshRosen / @rxin - it seems like this should be an ok state to merge if one of you has the bandwidth? I'd really like to get this in before we cut RC1. |
I'm going to merge this to master now and will look into cherry-picking to branch-2.1. |
🙇 Thank you everyone! |
@JoshRosen thank you so much! That's awesome. I've had some off-line chats about why I think we should merge this for 2.1 as well (hopefully @rxin can chime in here with his thoughts) but I think the general argument is that: I'd like to argue for its inclusion in 2.1 as it seems to follow both the official and defacto conventions for inclusion: If we want to publish to PyPI for 2.2 (which I think would be awesome) - it would probably be best for us to have a chance to have the pip installable artifacts available in a prior release so that we can have an opportunity to work out any unexpected kinks before going ahead with full PyPI publishing which might reach a different audience. Thank you so much for everyone involved @rgbkrk, @nchammas , @jhlch, @felixcheung, @viirya, @davies , @minrk, @mariusvniekerk - this is the PR I'm most excited about getting in recently 💃 😄 📦 👯♀️ 🍀 👍 🍵 🐱 🐈 😹 😸 🐍 #2ammytimesorryfortheemoji #pip4life |
Agreed, so I'm going to cherry-pick this into branch-2.1. |
## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <[email protected]> Author: Juliet Hougland <[email protected]> Author: Juliet Hougland <[email protected]> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
Yay! :) |
Awesome, thanks @holdenk! |
@holdenk why are we introducing so many tarballs for Python?
Can we just have one with the latest Hadoop version? |
For the same reasons we have this in the current download, we could of
course limit this to a smaller subset of you want.
…On Mon, Nov 28, 2016 at 5:29 PM Reynold Xin ***@***.***> wrote:
@holdenk <https://github.com/holdenk> why are we introducing so many
tarballs for Python?
[ ] pyspark-2.1.0+hadoop2.3.tar.gz 2016-11-28 20:17 165M
[TXT] pyspark-2.1.0+hadoop2.3.tar.gz.asc 2016-11-28 20:17 490
[ ] pyspark-2.1.0+hadoop2.3.tar.gz.md5 2016-11-28 20:17 81
[ ] pyspark-2.1.0+hadoop2.3.tar.gz.sha 2016-11-28 20:17 272
[ ] pyspark-2.1.0+hadoop2.4.tar.gz 2016-11-28 20:17 166M
[TXT] pyspark-2.1.0+hadoop2.4.tar.gz.asc 2016-11-28 20:17 490
[ ] pyspark-2.1.0+hadoop2.4.tar.gz.md5 2016-11-28 20:17 81
[ ] pyspark-2.1.0+hadoop2.4.tar.gz.sha 2016-11-28 20:17 272
[ ] pyspark-2.1.0+hadoop2.6.tar.gz 2016-11-28 20:17 170M
[TXT] pyspark-2.1.0+hadoop2.6.tar.gz.asc 2016-11-28 20:17 490
[ ] pyspark-2.1.0+hadoop2.6.tar.gz.md5 2016-11-28 20:17 81
[ ] pyspark-2.1.0+hadoop2.6.tar.gz.sha 2016-11-28 20:17 272
[ ] pyspark-2.1.0+hadoop2.7.tar.gz 2016-11-28 20:17 172M
[TXT] pyspark-2.1.0+hadoop2.7.tar.gz.asc 2016-11-28 20:17 490
[ ] pyspark-2.1.0+hadoop2.7.tar.gz.md5 2016-11-28 20:17 81
[ ] pyspark-2.1.0+hadoop2.7.tar.gz.sha 2016-11-28 20:17 272
[ ] pyspark-2.1.0+without.hadoop.tar.gz 2016-11-28 20:17 103M
[TXT] pyspark-2.1.0+without.hadoop.tar.gz.asc 2016-11-28 20:17 490
[ ] pyspark-2.1.0+without.hadoop.tar.gz.md5 2016-11-28 20:17 123
[ ] pyspark-2.1.0+without.hadoop.tar.gz.sha 2016-11-28 20:17 292
Can we just have one with the latest Hadoop version?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15659 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADp9YOFbWL5CkYFJqX8ajq808NlhENEks5rC3_4gaJpZM4KiFhk>
.
|
The use case I intended for this one was to allow installing Spark on a laptop against some local file system, not using pip as a distributed mechanism across a cluster. For that use case I think limiting this to one would make more sense. This doubles now the release preparation time, which already takes a while. |
I think pip installing locally to connect to a cluster is something people
would want to use - but if people disagree then cutting down the artifacts
makes sense and I'm happy to do that in a follow up PR.
…On Mon, Nov 28, 2016 at 8:04 PM Reynold Xin ***@***.***> wrote:
The use case I intended for this one was to allow installing Spark on a
laptop against some local file system, not using pip as a distributed
mechanism across a cluster. For that use case I think limiting this to one
would make more sense. This doubles now the release preparation time, which
already takes a while.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15659 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADp9UIsgjRSdnNvJ1GU6epzBWFGJTYYks5rC6QzgaJpZM4KiFhk>
.
|
## What changes were proposed in this pull request? Fix the flags used to specify the hadoop version ## How was this patch tested? Manually tested as part of apache#15659 by having the build succeed. cc joshrosen Author: Holden Karau <[email protected]> Closes apache#15860 from holdenk/minor-fix-release-build-script.
## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <[email protected]> Author: Juliet Hougland <[email protected]> Author: Juliet Hougland <[email protected]> Closes apache#15659 from holdenk/SPARK-1267-pip-install-pyspark.
What changes were proposed in this pull request?
This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).
Done:
Possible follow up work:
Explicitly out of scope:
*I've done some work to test release-build locally but as a non-committer I've just done local testing.
How was this patch tested?
Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.
release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)