[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

holdenk · 2016-10-27T08:43:01Z

What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).

Done:

pip installable on conda [manual tested]
setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
Automated testing of this (virtualenv)
packaging and signing with release-build*

Possible follow up work:

release-build update to publish to PyPI (SPARK-18128)
figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
Windows support and or testing ( SPARK-18136 )
investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
consider how we want to number our dev/snapshot versions

Explicitly out of scope:

Using pip installed PySpark to start a standalone cluster
Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've just done local testing.

How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)

…tation.

… list of jars, fix extras_require decl, etc.)

…ject, so create symlinks so we can package the JARs with it

…x symlink farm issue, fix scripts issue, TODO: fix SPARK_HOME and find out why JARs aren't ending up in the install

…add pyspark.bin and pyspark.jars packages and set their package dirs as desired, make the spark scripts check and see if they are in a pip installed enviroment and if SPARK_HOME in unset then resolve it with Python [otherwise use the current behaviour]

…d spark home finder

…d_spark_home.py around

…y check file

SparkQA · 2016-11-14T17:57:51Z

Test build #68620 has finished for PR 15659 at commit df5a3f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-14T17:57:53Z

Test build #68619 has finished for PR 15659 at commit dd243a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-11-14T21:10:22Z

I think we will want to keep the release-build tagging so that the artifacts are correct for the different hadoop versions.

Ah, so the idea is that we'll make separate PyPi artifacts for each Hadoop version and want that to be reflected in the version shown in Python?

SparkQA · 2016-11-15T00:49:29Z

Test build #68638 has finished for PR 15659 at commit d753d80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-11-15T07:58:54Z

@JoshRosen - yes since we ship the jars with them we want people to be able to install the correct package for the hadoop distribution they are running with/against.

holdenk · 2016-11-15T08:43:30Z

Ping @JoshRosen / @rxin :)

nchammas · 2016-11-15T15:23:19Z

python/MANIFEST.in

+graft deps/bin
+recursive-include deps/examples *.py
+recursive-include lib *.zip
+include README.md


@holdenk - When you look at the packaged files, do you see a bunch of cruft like .pyc files and the like in there?

If so, you may want to add something like this here:

global-exclude *.py[cod] __pycache__ .DS_Store

Just checking since it's a common problem with Python packaging. I can check this for you myself later today, if you want.

So it wouldn't happen with the make release scripts since they use a fresh copy of the source, but if we were making the packages by hand those could certainly show up. I'll add the exclusion rule since it shouldn't break anything.

Actually even then it shouldn't happen "normally" (since we use recursive-include *.py as the inclusion rule for the python directory and our own graft directory is the bin directory). But still better to have the exclusion rule incase someone has pyc files in bin and is rolling their own package. Thanks for the suggestion :)

…sembly jar scala versions

SparkQA · 2016-11-15T18:12:15Z

Test build #68669 has finished for PR 15659 at commit e139855.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-11-16T02:41:00Z

I'v added the extra exclusion rules for extra safety in packaging :) cc @rxin if this looks good to merge?

viirya · 2016-11-16T08:12:03Z

python/setup.py

+      pip install dist/*.tar.gz"""
+
+# Figure out where the jars are we need to package with PySpark.
+JARS_PATH = glob.glob(os.path.join(SPARK_HOME, "assembly/target/scala-*/jars/"))


Should we respect and use $SPARK_SCALA_VERSION here if defined? We do that in bin/spark-class.

It might not be defined if someone is just building there own sdist or manually installing from source rather than with the packaging scripts so I'd rather avoid assuming $SPARK_SCALA_VERSION is present.

holdenk · 2016-11-16T15:14:55Z

cc @jkbradley / @davies / @JoshRosen / @rxin - it seems like this should be an ok state to merge if one of you has the bandwidth? I'd really like to get this in before we cut RC1.

JoshRosen · 2016-11-16T22:20:39Z

I'm going to merge this to master now and will look into cherry-picking to branch-2.1.

rgbkrk · 2016-11-16T22:30:50Z

🙇 Thank you everyone!

holdenk · 2016-11-17T00:41:31Z

@JoshRosen thank you so much! That's awesome. I've had some off-line chats about why I think we should merge this for 2.1 as well (hopefully @rxin can chime in here with his thoughts) but I think the general argument is that:

I'd like to argue for its inclusion in 2.1 as it seems to follow both the official and defacto conventions for inclusion:
Namely this PR existed before branch cut and is primarily additive, and the idea it's self has been proposed since 2014 in various forms. Also defacto we've added much of the streaming functionality between RCs during 2.0 which was even larger than this.

If we want to publish to PyPI for 2.2 (which I think would be awesome) - it would probably be best for us to have a chance to have the pip installable artifacts available in a prior release so that we can have an opportunity to work out any unexpected kinks before going ahead with full PyPI publishing which might reach a different audience.

Thank you so much for everyone involved @rgbkrk, @nchammas , @jhlch, @felixcheung, @viirya, @davies , @minrk, @mariusvniekerk - this is the PR I'm most excited about getting in recently 💃 😄 📦 👯‍♀️ 🍀 👍 🍵 🐱 🐈 😹 😸 🐍 #2ammytimesorryfortheemoji #pip4life

JoshRosen · 2016-11-17T04:13:55Z

Agreed, so I'm going to cherry-pick this into branch-2.1.

## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <[email protected]> Author: Juliet Hougland <[email protected]> Author: Juliet Hougland <[email protected]> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.

holdenk · 2016-11-17T06:37:57Z

Yay! :)

minrk · 2016-11-17T10:15:21Z

Awesome, thanks @holdenk!

rxin · 2016-11-29T01:28:29Z

@holdenk why are we introducing so many tarballs for Python?


[   ]	pyspark-2.1.0+hadoop2.3.tar.gz	2016-11-28 20:17	165M	 
[TXT]	pyspark-2.1.0+hadoop2.3.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.3.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.3.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+hadoop2.4.tar.gz	2016-11-28 20:17	166M	 
[TXT]	pyspark-2.1.0+hadoop2.4.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.4.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.4.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+hadoop2.6.tar.gz	2016-11-28 20:17	170M	 
[TXT]	pyspark-2.1.0+hadoop2.6.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.6.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.6.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+hadoop2.7.tar.gz	2016-11-28 20:17	172M	 
[TXT]	pyspark-2.1.0+hadoop2.7.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.7.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.7.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+without.hadoop.tar.gz	2016-11-28 20:17	103M	 
[TXT]	pyspark-2.1.0+without.hadoop.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+without.hadoop.tar.gz.md5	2016-11-28 20:17	123	 
[   ]	pyspark-2.1.0+without.hadoop.tar.gz.sha	2016-11-28 20:17	292

Can we just have one with the latest Hadoop version?

holdenk · 2016-11-29T02:32:34Z

For the same reasons we have this in the current download, we could of course limit this to a smaller subset of you want.

…

On Mon, Nov 28, 2016 at 5:29 PM Reynold Xin ***@***.***> wrote: @holdenk <https://github.com/holdenk> why are we introducing so many tarballs for Python? [ ] pyspark-2.1.0+hadoop2.3.tar.gz 2016-11-28 20:17 165M [TXT] pyspark-2.1.0+hadoop2.3.tar.gz.asc 2016-11-28 20:17 490 [ ] pyspark-2.1.0+hadoop2.3.tar.gz.md5 2016-11-28 20:17 81 [ ] pyspark-2.1.0+hadoop2.3.tar.gz.sha 2016-11-28 20:17 272 [ ] pyspark-2.1.0+hadoop2.4.tar.gz 2016-11-28 20:17 166M [TXT] pyspark-2.1.0+hadoop2.4.tar.gz.asc 2016-11-28 20:17 490 [ ] pyspark-2.1.0+hadoop2.4.tar.gz.md5 2016-11-28 20:17 81 [ ] pyspark-2.1.0+hadoop2.4.tar.gz.sha 2016-11-28 20:17 272 [ ] pyspark-2.1.0+hadoop2.6.tar.gz 2016-11-28 20:17 170M [TXT] pyspark-2.1.0+hadoop2.6.tar.gz.asc 2016-11-28 20:17 490 [ ] pyspark-2.1.0+hadoop2.6.tar.gz.md5 2016-11-28 20:17 81 [ ] pyspark-2.1.0+hadoop2.6.tar.gz.sha 2016-11-28 20:17 272 [ ] pyspark-2.1.0+hadoop2.7.tar.gz 2016-11-28 20:17 172M [TXT] pyspark-2.1.0+hadoop2.7.tar.gz.asc 2016-11-28 20:17 490 [ ] pyspark-2.1.0+hadoop2.7.tar.gz.md5 2016-11-28 20:17 81 [ ] pyspark-2.1.0+hadoop2.7.tar.gz.sha 2016-11-28 20:17 272 [ ] pyspark-2.1.0+without.hadoop.tar.gz 2016-11-28 20:17 103M [TXT] pyspark-2.1.0+without.hadoop.tar.gz.asc 2016-11-28 20:17 490 [ ] pyspark-2.1.0+without.hadoop.tar.gz.md5 2016-11-28 20:17 123 [ ] pyspark-2.1.0+without.hadoop.tar.gz.sha 2016-11-28 20:17 292 Can we just have one with the latest Hadoop version? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15659 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADp9YOFbWL5CkYFJqX8ajq808NlhENEks5rC3_4gaJpZM4KiFhk> .

rxin · 2016-11-29T04:03:39Z

The use case I intended for this one was to allow installing Spark on a laptop against some local file system, not using pip as a distributed mechanism across a cluster. For that use case I think limiting this to one would make more sense. This doubles now the release preparation time, which already takes a while.

holdenk · 2016-11-29T04:11:00Z

I think pip installing locally to connect to a cluster is something people would want to use - but if people disagree then cutting down the artifacts makes sense and I'm happy to do that in a follow up PR.

…

On Mon, Nov 28, 2016 at 8:04 PM Reynold Xin ***@***.***> wrote: The use case I intended for this one was to allow installing Spark on a laptop against some local file system, not using pip as a distributed mechanism across a cluster. For that use case I think limiting this to one would make more sense. This doubles now the release preparation time, which already takes a while. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15659 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADp9UIsgjRSdnNvJ1GU6epzBWFGJTYYks5rC6QzgaJpZM4KiFhk> .

## What changes were proposed in this pull request? Fix the flags used to specify the hadoop version ## How was this patch tested? Manually tested as part of apache#15659 by having the build succeed. cc joshrosen Author: Holden Karau <[email protected]> Closes apache#15860 from holdenk/minor-fix-release-build-script.

## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <[email protected]> Author: Juliet Hougland <[email protected]> Author: Juliet Hougland <[email protected]> Closes apache#15659 from holdenk/SPARK-1267-pip-install-pyspark.

Juliet Hougland and others added 30 commits October 11, 2016 22:44

Adds setup.py

7763f3c

Fix spacing.

30debc7

updUpdate py4j dependency. Add mllib to extas_require, fix some inden…

5155531

…tation.

Adds MANIFEST.in file.

2f0bf9b

Merge branch 'master' into SPARK-1267-pip-install-pyspark

4c00b98

Start working towards post-2.0 pip installable PypSpark (so including…

7ff8d0f

… list of jars, fix extras_require decl, etc.)

Merge branch 'master' into SPARK-1267-pip-install-pyspark

610b975

So MANIFEST and setup can't refer to things above the root of the pro…

cb2e06d

…ject, so create symlinks so we can package the JARs with it

Merge branch 'master' into SPARK-1267-pip-install-pyspark

01f791d

Keep the symlink

e2e4d1c

Some progress we need to use SDIST but is ok

fb15d7e

Reenable cleanup

aab7ee4

Try and provide a clear error message when pip installed directly, fi…

5a57620

…x symlink farm issue, fix scripts issue, TODO: fix SPARK_HOME and find out why JARs aren't ending up in the install

Add two scripts

646aa23

Use copyfile also check for jars dir too

a78754b

Check if pip installed when finding the shell file

955e92b

Check if jars dir exists rather than release file

2d88a40

Start working a bit on the docs

9e5c532

Merge branch 'master' into SPARK-1267-pip-install-pyspark

be7eadd

Try and include pyspark zip file for yarn use

07d3849

Copy pyspark zip for use in yarn cluster mode

11b5fa8

Start adding scripts to test pip installability

8791f82

Works on yarn, works with spark submit, still need to fix import base…

92837a3

…d spark home finder

Start updating find-spark-home to be available in many cases.

6947a85

Use Switch to find_spark_home.py

944160c

Move to under pyspark

5bf0746

Update to py4j 0.10.4 in the deps, also switch how we are copying fin…

435f842

…d_spark_home.py around

Update java gateway to use _find_spark_home function, add quick sanit…

27ca27e

…y check file

Lint fixes

df126cf

holdenk added 2 commits November 14, 2016 07:29

Drop the notice since the script does it now

df5a3f9

Fix the next version output and update the comment to be more precise

d753d80

nchammas reviewed Nov 15, 2016

View reviewed changes

Add a global-exclude and add a format to the setup.py for multiple as…

e139855

…sembly jar scala versions

viirya reviewed Nov 16, 2016

View reviewed changes

asfgit closed this in a36a76a Nov 16, 2016

oulenz mentioned this pull request Feb 1, 2019

[SPARK-26803][PYTHON] Add sbin subdirectory to pyspark #23715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

holdenk commented Oct 27, 2016 •

edited

Loading

SparkQA commented Nov 14, 2016

SparkQA commented Nov 14, 2016

JoshRosen commented Nov 14, 2016

SparkQA commented Nov 15, 2016

holdenk commented Nov 15, 2016

holdenk commented Nov 15, 2016

nchammas Nov 15, 2016

holdenk Nov 15, 2016 •

edited

Loading

holdenk Nov 15, 2016

SparkQA commented Nov 15, 2016

holdenk commented Nov 16, 2016

viirya Nov 16, 2016

holdenk Nov 16, 2016

holdenk commented Nov 16, 2016

JoshRosen commented Nov 16, 2016

rgbkrk commented Nov 16, 2016

holdenk commented Nov 17, 2016

JoshRosen commented Nov 17, 2016

holdenk commented Nov 17, 2016

minrk commented Nov 17, 2016

rxin commented Nov 29, 2016

holdenk commented Nov 29, 2016 via email

rxin commented Nov 29, 2016

holdenk commented Nov 29, 2016 via email

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

Conversation

holdenk commented Oct 27, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 14, 2016

SparkQA commented Nov 14, 2016

JoshRosen commented Nov 14, 2016

SparkQA commented Nov 15, 2016

holdenk commented Nov 15, 2016

holdenk commented Nov 15, 2016

nchammas Nov 15, 2016

Choose a reason for hiding this comment

holdenk Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

holdenk Nov 15, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 15, 2016

holdenk commented Nov 16, 2016

viirya Nov 16, 2016

Choose a reason for hiding this comment

holdenk Nov 16, 2016

Choose a reason for hiding this comment

holdenk commented Nov 16, 2016

JoshRosen commented Nov 16, 2016

rgbkrk commented Nov 16, 2016

holdenk commented Nov 17, 2016

JoshRosen commented Nov 17, 2016

holdenk commented Nov 17, 2016

minrk commented Nov 17, 2016

rxin commented Nov 29, 2016

holdenk commented Nov 29, 2016 via email

rxin commented Nov 29, 2016

holdenk commented Nov 29, 2016 via email

holdenk commented Oct 27, 2016 •

edited

Loading

holdenk Nov 15, 2016 •

edited

Loading