-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659
Changes from 98 commits
7763f3c
30debc7
5155531
2f0bf9b
4c00b98
7ff8d0f
610b975
cb2e06d
01f791d
e2e4d1c
fb15d7e
aab7ee4
5a57620
646aa23
36c9d45
a78754b
955e92b
2d88a40
9e5c532
be7eadd
07d3849
11b5fa8
8791f82
92837a3
6947a85
944160c
5bf0746
435f842
27ca27e
df126cf
70a78a0
555d443
051abe5
b345bdb
28da44b
0f16c08
574c1f0
6299744
17104c1
0447ea2
c335c80
146567b
0e2223d
849ded0
cf5ab7e
4b69871
3788bfb
308a168
74b79c4
3056553
125ae2a
d2da8b0
595409f
cf421b0
31ac8e2
0e9cb8d
264b253
802f682
fba37a0
8ba499f
1c177f3
6ace070
ab8ca53
77f8eca
f590898
f956a5d
489d4e3
9e4fdb5
e668af6
3bf961e
c9d48d3
e9f1e8e
1cdcf61
7af912a
c77d9fd
7b1d8b7
298bda6
9770260
99940ee
f6806b2
b0cd655
6bb422e
b5b4713
2b808dc
b958f7e
577554b
9cf2ec9
154a287
b478bdf
fb62a8a
6540964
d2389ed
48cd1ad
23109a4
49fc6db
7001f90
8d74672
210c9d4
9efca67
fd3e89c
587c0eb
2904998
3345eb9
f86574a
05fc25f
dd243a2
df5a3f9
d753d80
e139855
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
#!/usr/bin/env bash | ||
|
||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
# Attempts to find a proper value for SPARK_HOME. Should be included using "source" directive. | ||
|
||
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "`dirname "$0"`"; pwd)/find_spark_home.py" | ||
|
||
# Short cirtuit if the user already has this set. | ||
if [ ! -z "${SPARK_HOME}" ]; then | ||
exit 0 | ||
elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I'll just paste
Some of these aren't super-important, but the word-splitting ones are. |
||
# If we are not in the same directory as find_spark_home.py we are not pip installed so we don't | ||
# need to search the different Python directories for a Spark installation. | ||
# Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or | ||
# spark-submit in another directory we want to use that version of PySpark rather than the | ||
# pip installed version of PySpark. | ||
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" | ||
else | ||
# We are pip installed, use the Python script to resolve a reasonable SPARK_HOME | ||
# Default to standard python interpreter unless told otherwise | ||
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then | ||
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}" | ||
fi | ||
export SPARK_HOME=`$PYSPARK_DRIVER_PYTHON $FIND_SPARK_HOME_PYTHON_SCRIPT` | ||
fi |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,7 @@ | |
|
||
# Figure out where Spark is installed | ||
if [ -z "${SPARK_HOME}" ]; then | ||
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" | ||
source `dirname $0`/find-spark-home | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's also apply the same fix for Shellcheck complaints here and at all other occurrences of this line. |
||
fi | ||
|
||
if [ -z "$SPARK_ENV_LOADED" ]; then | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,7 +18,7 @@ | |
# | ||
|
||
if [ -z "${SPARK_HOME}" ]; then | ||
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" | ||
source `dirname $0`/find-spark-home | ||
fi | ||
|
||
. "${SPARK_HOME}"/bin/load-spark-env.sh | ||
|
@@ -36,7 +36,7 @@ else | |
fi | ||
|
||
# Find Spark jars. | ||
if [ -f "${SPARK_HOME}/RELEASE" ]; then | ||
if [ -d "${SPARK_HOME}/jars" ]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did this get changed from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because both pip installed PySpark and RELEASE Spark have jars in the jars directory, it seems more reasonable to just check if the jars directory exists directly rather than checking for a file which indicates that the JARs directory is present. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. This seems reasonable to me. |
||
SPARK_JARS_DIR="${SPARK_HOME}/jars" | ||
else | ||
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,7 +29,7 @@ esac | |
set -o posix | ||
|
||
if [ -z "${SPARK_HOME}" ]; then | ||
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" | ||
source `dirname $0`/find-spark-home | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Many files are changed with this. Do those scripts need to find SPARK_HOME with pip installed Spark? I assume they should be shipped with Spark release? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So for pip installed PySpark we need to do a bit more work to find the SPARK_HOME, using a common script to determine if we need to do more hunting rather than duplicating the logic. |
||
fi | ||
|
||
export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then | |
export ZINC_PORT=$ZINC_PORT | ||
echo "Creating distribution: $NAME ($FLAGS)" | ||
|
||
# Write out the NAME and VERSION to PySpark version info we rewrite the - into a . and SNAPSHOT | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to have a version string that's slightly different from the "original", just for Python? I'm thinking about what will happen if people, for example, want to do the same for R. Having 3 slightly different ways of showing the version string seems unnecessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup we do, otherwise we can't publish on PyPI which is the end goal. |
||
# to dev0 to be closer to PEP440. We use the NAME as a "local version". | ||
PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" | sed -r "s/-/./" | sed -r "s/SNAPSHOT/dev0/"` | ||
echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need it if the user runs setup.py sdist on their on own and as part of the packaging tests during any Python change in Jenkins. If the only way to build a pip installable package was make_release then yes - but that isn't the case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we can, I'd like to consolidate this logic into the |
||
|
||
# Get maven home set by MVN | ||
MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'` | ||
|
||
./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz $FLAGS \ | ||
echo "Creating distribution" | ||
./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz --pip $FLAGS \ | ||
-DzincPort=$ZINC_PORT 2>&1 > ../binary-release-$NAME.log | ||
cd .. | ||
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz . | ||
|
||
echo "Copying and signing python distribution" | ||
PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Without specifying format, seems the output distribution will be zip file on windows? Because in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think packaging on windows can be considered a future todo There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also realistically I don't think packaging on windows is super supported right now given how we are doing it with shell scripts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, actually I have this question is because There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The goal is to eventually support building sdists on Windows - but I think porting the entire release process to windows is out of scope. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We shouldn't port the release bash scripts to Windows. It's going to be a huge pain with little obvious benefit. Windows users who want to make release builds can just run this version of the script in a |
||
cp spark-$SPARK_VERSION-bin-$NAME/python/dist/$PYTHON_DIST_NAME . | ||
|
||
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \ | ||
--output $PYTHON_DIST_NAME.asc \ | ||
--detach-sig $PYTHON_DIST_NAME | ||
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \ | ||
MD5 $PYTHON_DIST_NAME > \ | ||
$PYTHON_DIST_NAME.md5 | ||
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \ | ||
SHA512 $PYTHON_DIST_NAME > \ | ||
$PYTHON_DIST_NAME.sha | ||
|
||
echo "Copying and signing regular binary distribution" | ||
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz . | ||
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \ | ||
--output spark-$SPARK_VERSION-bin-$NAME.tgz.asc \ | ||
--detach-sig spark-$SPARK_VERSION-bin-$NAME.tgz | ||
|
@@ -187,10 +208,10 @@ if [[ "$1" == "package" ]]; then | |
# We increment the Zinc port each time to avoid OOM's and other craziness if multiple builds | ||
# share the same Zinc server. | ||
FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos" | ||
make_binary_release "hadoop2.3" "-Phadoop2.3 $FLAGS" "3033" & | ||
make_binary_release "hadoop2.4" "-Phadoop2.4 $FLAGS" "3034" & | ||
make_binary_release "hadoop2.6" "-Phadoop2.6 $FLAGS" "3035" & | ||
make_binary_release "hadoop2.7" "-Phadoop2.7 $FLAGS" "3036" & | ||
make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" & | ||
make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" & | ||
make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" & | ||
make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" & | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Looks like these changes are not related and can be separated to another small pr? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the packaging doesn't seem to currently build without this - and I wanted to test the packaging as part of this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious why this has not been found before... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yah I don't know if it's just I have a slightly older maven or some weird plugin. I can take a look some more at this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a new issue which was introduced in https://github.com/apache/spark/pull/14637/files#diff-01ca42240614718522afde4d4885b40dL189. I'd be in favor of fixing this separately. Do you mind splitting this change into a separate small PR which I'll merge right away? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done - #15860 |
||
make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" & | ||
make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" & | ||
wait | ||
|
@@ -208,6 +229,7 @@ if [[ "$1" == "package" ]]; then | |
# Re-upload a second time and leave the files in the timestamped upload directory: | ||
LFTP mkdir -p $dest_dir | ||
LFTP mput -O $dest_dir 'spark-*' | ||
LFTP mput -O $dest_dir 'pyspark-*' | ||
exit 0 | ||
fi | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
from __future__ import print_function | ||
|
||
from pyspark.sql import SparkSession | ||
import sys | ||
|
||
if __name__ == "__main__": | ||
spark = SparkSession\ | ||
.builder\ | ||
.appName("PipSanityCheck")\ | ||
.getOrCreate() | ||
sc = spark.sparkContext | ||
rdd = sc.parallelize(range(100), 10) | ||
value = rdd.reduce(lambda x, y: x + y) | ||
if (value != 4950): | ||
print("Value {0} did not match expected value.".format(value), file=sys.stderr) | ||
sys.exit(-1) | ||
print("Successfully ran pip sanity check") | ||
|
||
spark.stop() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
#!/usr/bin/env bash | ||
|
||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
|
||
FWDIR="$(cd "`dirname $0`"/..; pwd)" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
cd "$FWDIR" | ||
|
||
# Run the tests, we wrap the underlying test script for cleanup and because early exit | ||
# doesn't always properly exit a virtualenv. | ||
$FWDIR/dev/run-pip-tests-2 | ||
export success=$? | ||
|
||
# Clean up the virtual env enviroment used if we created one. | ||
if [ -f ./virtual_env_tmp_dir ]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that you could combine both this and the
and putting that at the top of the script before you actually create the temporary directory / virtualenv. |
||
rm -rf `cat ./virtual_env_temp_dir` | ||
rm ./virtaul_env_tmp_dir | ||
fi | ||
|
||
exit $success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with making that change - just point out that we were previously using backtick execution.