Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

Closed
wants to merge 109 commits into from

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Oct 27, 2016

What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).

Done:

  • pip installable on conda [manual tested]
  • setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
  • Automated testing of this (virtualenv)
  • packaging and signing with release-build*

Possible follow up work:

  • release-build update to publish to PyPI (SPARK-18128)
  • figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
  • Windows support and or testing ( SPARK-18136 )
  • investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
  • consider how we want to number our dev/snapshot versions

Explicitly out of scope:

  • Using pip installed PySpark to start a standalone cluster
  • Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've just done local testing.

How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)

Juliet Hougland and others added 30 commits October 11, 2016 22:44
… list of jars, fix extras_require decl, etc.)
…ject, so create symlinks so we can package the JARs with it
…x symlink farm issue, fix scripts issue, TODO: fix SPARK_HOME and find out why JARs aren't ending up in the install
…add pyspark.bin and pyspark.jars packages and set their package dirs as desired, make the spark scripts check and see if they are in a pip installed enviroment and if SPARK_HOME in unset then resolve it with Python [otherwise use the current behaviour]
@SparkQA
Copy link

SparkQA commented Nov 14, 2016

Test build #68620 has finished for PR 15659 at commit df5a3f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2016

Test build #68619 has finished for PR 15659 at commit dd243a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

I think we will want to keep the release-build tagging so that the artifacts are correct for the different hadoop versions.

Ah, so the idea is that we'll make separate PyPi artifacts for each Hadoop version and want that to be reflected in the version shown in Python?

@SparkQA
Copy link

SparkQA commented Nov 15, 2016

Test build #68638 has finished for PR 15659 at commit d753d80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 15, 2016

@JoshRosen - yes since we ship the jars with them we want people to be able to install the correct package for the hadoop distribution they are running with/against.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 15, 2016

Ping @JoshRosen / @rxin :)

graft deps/bin
recursive-include deps/examples *.py
recursive-include lib *.zip
include README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk - When you look at the packaged files, do you see a bunch of cruft like .pyc files and the like in there?

If so, you may want to add something like this here:

global-exclude *.py[cod] __pycache__ .DS_Store

Just checking since it's a common problem with Python packaging. I can check this for you myself later today, if you want.

Copy link
Contributor Author

@holdenk holdenk Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it wouldn't happen with the make release scripts since they use a fresh copy of the source, but if we were making the packages by hand those could certainly show up. I'll add the exclusion rule since it shouldn't break anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually even then it shouldn't happen "normally" (since we use recursive-include *.py as the inclusion rule for the python directory and our own graft directory is the bin directory). But still better to have the exclusion rule incase someone has pyc files in bin and is rolling their own package. Thanks for the suggestion :)

@SparkQA
Copy link

SparkQA commented Nov 15, 2016

Test build #68669 has finished for PR 15659 at commit e139855.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 16, 2016

I'v added the extra exclusion rules for extra safety in packaging :) cc @rxin if this looks good to merge?

pip install dist/*.tar.gz"""

# Figure out where the jars are we need to package with PySpark.
JARS_PATH = glob.glob(os.path.join(SPARK_HOME, "assembly/target/scala-*/jars/"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we respect and use $SPARK_SCALA_VERSION here if defined? We do that in bin/spark-class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be defined if someone is just building there own sdist or manually installing from source rather than with the packaging scripts so I'd rather avoid assuming $SPARK_SCALA_VERSION is present.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 16, 2016

cc @jkbradley / @davies / @JoshRosen / @rxin - it seems like this should be an ok state to merge if one of you has the bandwidth? I'd really like to get this in before we cut RC1.

@JoshRosen
Copy link
Contributor

I'm going to merge this to master now and will look into cherry-picking to branch-2.1.

@asfgit asfgit closed this in a36a76a Nov 16, 2016
@rgbkrk
Copy link
Contributor

rgbkrk commented Nov 16, 2016

🙇 Thank you everyone!

@holdenk
Copy link
Contributor Author

holdenk commented Nov 17, 2016

@JoshRosen thank you so much! That's awesome. I've had some off-line chats about why I think we should merge this for 2.1 as well (hopefully @rxin can chime in here with his thoughts) but I think the general argument is that:

I'd like to argue for its inclusion in 2.1 as it seems to follow both the official and defacto conventions for inclusion:
Namely this PR existed before branch cut and is primarily additive, and the idea it's self has been proposed since 2014 in various forms. Also defacto we've added much of the streaming functionality between RCs during 2.0 which was even larger than this.

If we want to publish to PyPI for 2.2 (which I think would be awesome) - it would probably be best for us to have a chance to have the pip installable artifacts available in a prior release so that we can have an opportunity to work out any unexpected kinks before going ahead with full PyPI publishing which might reach a different audience.

Thank you so much for everyone involved @rgbkrk, @nchammas , @jhlch, @felixcheung, @viirya, @davies , @minrk, @mariusvniekerk - this is the PR I'm most excited about getting in recently 💃 😄 📦 👯‍♀️ 🍀 👍 🍵 🐱 🐈 😹 😸 🐍 #2ammytimesorryfortheemoji #pip4life

@JoshRosen
Copy link
Contributor

Agreed, so I'm going to cherry-pick this into branch-2.1.

asfgit pushed a commit that referenced this pull request Nov 17, 2016
## What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).

Done:
- pip installable on conda [manual tested]
- setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
- Automated testing of this (virtualenv)
- packaging and signing with release-build*

Possible follow up work:
- release-build update to publish to PyPI (SPARK-18128)
- figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
- Windows support and or testing ( SPARK-18136 )
- investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
- consider how we want to number our dev/snapshot versions

Explicitly out of scope:
- Using pip installed PySpark to start a standalone cluster
- Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've just done local testing.
## How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)

Author: Holden Karau <[email protected]>
Author: Juliet Hougland <[email protected]>
Author: Juliet Hougland <[email protected]>

Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
@holdenk
Copy link
Contributor Author

holdenk commented Nov 17, 2016

Yay! :)

@minrk
Copy link

minrk commented Nov 17, 2016

Awesome, thanks @holdenk!

@rxin
Copy link
Contributor

rxin commented Nov 29, 2016

@holdenk why are we introducing so many tarballs for Python?


[   ]	pyspark-2.1.0+hadoop2.3.tar.gz	2016-11-28 20:17	165M	 
[TXT]	pyspark-2.1.0+hadoop2.3.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.3.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.3.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+hadoop2.4.tar.gz	2016-11-28 20:17	166M	 
[TXT]	pyspark-2.1.0+hadoop2.4.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.4.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.4.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+hadoop2.6.tar.gz	2016-11-28 20:17	170M	 
[TXT]	pyspark-2.1.0+hadoop2.6.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.6.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.6.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+hadoop2.7.tar.gz	2016-11-28 20:17	172M	 
[TXT]	pyspark-2.1.0+hadoop2.7.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+hadoop2.7.tar.gz.md5	2016-11-28 20:17	81	 
[   ]	pyspark-2.1.0+hadoop2.7.tar.gz.sha	2016-11-28 20:17	272	 
[   ]	pyspark-2.1.0+without.hadoop.tar.gz	2016-11-28 20:17	103M	 
[TXT]	pyspark-2.1.0+without.hadoop.tar.gz.asc	2016-11-28 20:17	490	 
[   ]	pyspark-2.1.0+without.hadoop.tar.gz.md5	2016-11-28 20:17	123	 
[   ]	pyspark-2.1.0+without.hadoop.tar.gz.sha	2016-11-28 20:17	292

Can we just have one with the latest Hadoop version?

@holdenk
Copy link
Contributor Author

holdenk commented Nov 29, 2016 via email

@rxin
Copy link
Contributor

rxin commented Nov 29, 2016

The use case I intended for this one was to allow installing Spark on a laptop against some local file system, not using pip as a distributed mechanism across a cluster. For that use case I think limiting this to one would make more sense. This doubles now the release preparation time, which already takes a while.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 29, 2016 via email

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Fix the flags used to specify the hadoop version

## How was this patch tested?

Manually tested as part of apache#15659 by having the build succeed.

cc joshrosen

Author: Holden Karau <[email protected]>

Closes apache#15860 from holdenk/minor-fix-release-build-script.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).

Done:
- pip installable on conda [manual tested]
- setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
- Automated testing of this (virtualenv)
- packaging and signing with release-build*

Possible follow up work:
- release-build update to publish to PyPI (SPARK-18128)
- figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
- Windows support and or testing ( SPARK-18136 )
- investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
- consider how we want to number our dev/snapshot versions

Explicitly out of scope:
- Using pip installed PySpark to start a standalone cluster
- Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've just done local testing.
## How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)

Author: Holden Karau <[email protected]>
Author: Juliet Hougland <[email protected]>
Author: Juliet Hougland <[email protected]>

Closes apache#15659 from holdenk/SPARK-1267-pip-install-pyspark.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.