Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed #15659

Closed
wants to merge 109 commits into from
Closed
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
7763f3c
Adds setup.py
Apr 14, 2016
30debc7
Fix spacing.
Apr 14, 2016
5155531
updUpdate py4j dependency. Add mllib to extas_require, fix some inden…
Oct 12, 2016
2f0bf9b
Adds MANIFEST.in file.
Oct 12, 2016
4c00b98
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Oct 12, 2016
7ff8d0f
Start working towards post-2.0 pip installable PypSpark (so including…
holdenk Oct 12, 2016
610b975
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Oct 16, 2016
cb2e06d
So MANIFEST and setup can't refer to things above the root of the pro…
holdenk Oct 16, 2016
01f791d
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Oct 18, 2016
e2e4d1c
Keep the symlink
holdenk Oct 18, 2016
fb15d7e
Some progress we need to use SDIST but is ok
holdenk Oct 18, 2016
aab7ee4
Reenable cleanup
holdenk Oct 18, 2016
5a57620
Try and provide a clear error message when pip installed directly, fi…
holdenk Oct 19, 2016
646aa23
Add two scripts
holdenk Oct 19, 2016
36c9d45
package_data doesn't work so well with nested directories so instead …
holdenk Oct 19, 2016
a78754b
Use copyfile also check for jars dir too
holdenk Oct 20, 2016
955e92b
Check if pip installed when finding the shell file
holdenk Oct 20, 2016
2d88a40
Check if jars dir exists rather than release file
holdenk Oct 20, 2016
9e5c532
Start working a bit on the docs
holdenk Oct 23, 2016
be7eadd
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Oct 23, 2016
07d3849
Try and include pyspark zip file for yarn use
holdenk Oct 23, 2016
11b5fa8
Copy pyspark zip for use in yarn cluster mode
holdenk Oct 23, 2016
8791f82
Start adding scripts to test pip installability
holdenk Oct 24, 2016
92837a3
Works on yarn, works with spark submit, still need to fix import base…
holdenk Oct 24, 2016
6947a85
Start updating find-spark-home to be available in many cases.
holdenk Oct 24, 2016
944160c
Use Switch to find_spark_home.py
holdenk Oct 24, 2016
5bf0746
Move to under pyspark
holdenk Oct 24, 2016
435f842
Update to py4j 0.10.4 in the deps, also switch how we are copying fin…
holdenk Oct 24, 2016
27ca27e
Update java gateway to use _find_spark_home function, add quick sanit…
holdenk Oct 24, 2016
df126cf
Lint fixes
holdenk Oct 24, 2016
70a78a0
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Oct 24, 2016
555d443
More progress on running the pip installability tests
holdenk Oct 24, 2016
051abe5
Try and unify path used for shell script file, add a README.md file f…
holdenk Oct 25, 2016
b345bdb
Add README file
holdenk Oct 25, 2016
28da44b
Switch version to a PEP440 version otherwise it can't go on PyPiTest,…
holdenk Oct 25, 2016
0f16c08
More notes
holdenk Oct 25, 2016
574c1f0
Add pip-sanity-check.py to the linter list and add a note that we sho…
holdenk Oct 25, 2016
6299744
Fix handling of long_description, add check for existing artifacts in…
holdenk Oct 25, 2016
17104c1
Fix check for number of sdists
holdenk Oct 25, 2016
0447ea2
Typo fixes, make sure SPARK_HOME isn't being set based on PWD during …
holdenk Oct 25, 2016
c335c80
More typo fixes
holdenk Oct 25, 2016
146567b
We are python 2 and 3 compat :)
holdenk Oct 25, 2016
0e2223d
Use more standard version.py file, check sys version is greater than …
holdenk Oct 25, 2016
849ded0
First pass at updating the release-build script
holdenk Oct 25, 2016
cf5ab7e
consider handling being inside a release
holdenk Oct 25, 2016
4b69871
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Oct 26, 2016
3788bfb
Fix up make-distribution to build the python artifacts, update releas…
holdenk Oct 26, 2016
308a168
Fix python lint errors and add linting to setup.py
holdenk Oct 26, 2016
74b79c4
Add python packaging tests to run-tests script
holdenk Oct 26, 2016
3056553
Add license header to setup.cfg
holdenk Oct 26, 2016
125ae2a
Fix typo PyPi to PyPI
holdenk Oct 26, 2016
d2da8b0
Fix typo PyPi to PyPI (2)
holdenk Oct 26, 2016
595409f
Use copytree and rmtree on windows - note: still not explicitly teste…
holdenk Oct 27, 2016
cf421b0
Fix style issues
holdenk Oct 27, 2016
31ac8e2
Add license header to version.py and manifest.in
holdenk Oct 27, 2016
0e9cb8d
newer version of numpy are fine
holdenk Oct 27, 2016
264b253
Add BLOCK_PYSPARK_PIP_TESTS to jenkins test error codes
holdenk Oct 27, 2016
802f682
Add README.md as description file to metadata in setup.cfg
holdenk Oct 27, 2016
fba37a0
We store version in a different file now
holdenk Oct 27, 2016
8ba499f
Early PR feedback, switch to os.path.join rather than strings, add a …
holdenk Oct 27, 2016
1c177f3
Add BLOCK_PYSPARK_PIP_TESTS to error code set
holdenk Oct 27, 2016
6ace070
Fix path used to run the pip tests in jenkins
holdenk Oct 28, 2016
ab8ca53
Fix typo
holdenk Oct 28, 2016
77f8eca
Show how to build the sdist in building-spark.md
holdenk Oct 30, 2016
f590898
Have clearer messages (as suggested by @viirya)
holdenk Oct 30, 2016
f956a5d
Try and improve the wording a little bit
holdenk Oct 30, 2016
489d4e3
Fix typo
holdenk Oct 31, 2016
9e4fdb5
Drop extra .gz
holdenk Oct 31, 2016
e668af6
Drop '
holdenk Oct 31, 2016
3bf961e
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Oct 31, 2016
c9d48d3
Make packaging PySpark as pip optional part of make-distirbution the …
holdenk Nov 1, 2016
e9f1e8e
Fix indentation and clarify error message (since we still technically…
holdenk Nov 1, 2016
1cdcf61
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Nov 1, 2016
7af912a
Move Python version check up earlier.
holdenk Nov 2, 2016
c77d9fd
Fix python3 setup
holdenk Nov 2, 2016
7b1d8b7
test both python/python3 if they are installed on the system for pip …
holdenk Nov 2, 2016
298bda6
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Nov 3, 2016
9770260
Actually run the python3 packaging tests and fix path finding
holdenk Nov 3, 2016
99940ee
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Nov 4, 2016
f6806b2
Break up sentence in setup.py error message, drop 3.0-3.3 tags from s…
holdenk Nov 5, 2016
b0cd655
Just copy shell in advance because the setup time copy has issues wit…
holdenk Nov 6, 2016
6bb422e
Change shell symlink
holdenk Nov 6, 2016
b5b4713
Move the copy up earlier for python3 venv install issue
holdenk Nov 6, 2016
2b808dc
Fix normalizaiton of paths
holdenk Nov 6, 2016
b958f7e
Handle edit mode based installations
holdenk Nov 6, 2016
577554b
Just skip caching rather than cleaning up the wheels
holdenk Nov 6, 2016
9cf2ec9
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Nov 6, 2016
154a287
Remove % formatting and replace with format and os.path.join
holdenk Nov 6, 2016
b478bdf
s/True/pass/ in the places where it makes sense, fix a formatting issue
holdenk Nov 7, 2016
fb62a8a
Test both edit mode and regular installs
holdenk Nov 7, 2016
6540964
Add exit(-1)
holdenk Nov 7, 2016
d2389ed
CR feedback - switch symlink support checking into a function and use…
holdenk Nov 8, 2016
48cd1ad
Add a docstring comment just cause
holdenk Nov 8, 2016
23109a4
Fix support_symlinks / docstring
holdenk Nov 8, 2016
49fc6db
use update to usr bin env python
holdenk Nov 8, 2016
7001f90
s/deps/TEMP_PATH/ incase we change it later
holdenk Nov 9, 2016
8d74672
Merge branch 'master' into SPARK-1267-pip-install-pyspark
holdenk Nov 10, 2016
210c9d4
drop usr/bin/env python since we don't want MANIFEST to run as a script
holdenk Nov 11, 2016
9efca67
Use python2 if available and fallback to python
holdenk Nov 12, 2016
fd3e89c
Fix more shell check issues
holdenk Nov 12, 2016
587c0eb
Fix shellcheck issues - note most of these were prexisting but since …
holdenk Nov 12, 2016
2904998
Move pip tests into a self cleaning up script instead of 2
holdenk Nov 12, 2016
3345eb9
Clarify what is required to build the PySpark pip installable artifacts.
holdenk Nov 12, 2016
f86574a
Make messaging more consistent
holdenk Nov 12, 2016
05fc25f
Switch to "s cause its easier to do that with sed rewrites
holdenk Nov 14, 2016
dd243a2
Update release tagging script
holdenk Nov 14, 2016
df5a3f9
Drop the notice since the script does it now
holdenk Nov 14, 2016
d753d80
Fix the next version output and update the comment to be more precise
holdenk Nov 14, 2016
e139855
Add a global-exclude and add a format to the setup.py for multiple as…
holdenk Nov 15, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ project/plugins/project/build.properties
project/plugins/src_managed/
project/plugins/target/
python/lib/pyspark.zip
python/deps
reports/
scalastyle-on-compile.generated.xml
scalastyle-output.xml
Expand Down
2 changes: 1 addition & 1 deletion bin/beeline
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ set -o posix

# Figure out if SPARK_HOME is set
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In bin/beeline line 28:
  source `dirname $0`/find-spark-home
  ^-- SC1090: Can't follow non-constant source. Use a directive to specify location.
         ^-- SC2046: Quote this to prevent word splitting.
         ^-- SC2006: Use $(..) instead of legacy `..`.
                  ^-- SC2086: Double quote to prevent globbing and word splitting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with making that change - just point out that we were previously using backtick execution.

fi

CLASS="org.apache.hive.beeline.BeeLine"
Expand Down
41 changes: 41 additions & 0 deletions bin/find-spark-home
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Attempts to find a proper value for SPARK_HOME. Should be included using "source" directive.

FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "`dirname "$0"`"; pwd)/find_spark_home.py"

# Short cirtuit if the user already has this set.
if [ ! -z "${SPARK_HOME}" ]; then
exit 0
elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shellcheck complains that FIND_SPARK_HOME_PYTHON_SCRIPT should be quoted here to avoid word splitting issues:

In find-spark-home line 27:
elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then
            ^-- SC2086: Double quote to prevent globbing and word splitting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'll just paste shellcheck's full output:

In find-spark-home line 22:
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "`dirname "$0"`"; pwd)/find_spark_home.py"
                                 ^-- SC2164: Use cd ... || exit in case cd fails.
                                     ^-- SC2006: Use $(..) instead of legacy `..`.


In find-spark-home line 27:
elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then
            ^-- SC2086: Double quote to prevent globbing and word splitting.


In find-spark-home line 33:
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
         ^-- SC2155: Declare and assign separately to avoid masking return values.
                       ^-- SC2164: Use cd ... || exit in case cd fails.
                           ^-- SC2006: Use $(..) instead of legacy `..`.


In find-spark-home line 40:
  export SPARK_HOME=`$PYSPARK_DRIVER_PYTHON $FIND_SPARK_HOME_PYTHON_SCRIPT`
         ^-- SC2155: Declare and assign separately to avoid masking return values.
                    ^-- SC2006: Use $(..) instead of legacy `..`.
                                            ^-- SC2086: Double quote to prevent globbing and word splitting.

Some of these aren't super-important, but the word-splitting ones are.

# If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
# need to search the different Python directories for a Spark installation.
# Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
# spark-submit in another directory we want to use that version of PySpark rather than the
# pip installed version of PySpark.
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
else
# We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
# Default to standard python interpreter unless told otherwise
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
fi
export SPARK_HOME=`$PYSPARK_DRIVER_PYTHON $FIND_SPARK_HOME_PYTHON_SCRIPT`
fi
2 changes: 1 addition & 1 deletion bin/load-spark-env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@

# Figure out where Spark is installed
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also apply the same fix for Shellcheck complaints here and at all other occurrences of this line.

fi

if [ -z "$SPARK_ENV_LOADED" ]; then
Expand Down
2 changes: 1 addition & 1 deletion bin/pyspark
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#

if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
fi

source "${SPARK_HOME}"/bin/load-spark-env.sh
Expand Down
2 changes: 1 addition & 1 deletion bin/run-example
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#

if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
fi

export _SPARK_CMD_USAGE="Usage: ./bin/run-example [options] example-class [example args]"
Expand Down
4 changes: 2 additions & 2 deletions bin/spark-class
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#

if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
fi

. "${SPARK_HOME}"/bin/load-spark-env.sh
Expand All @@ -36,7 +36,7 @@ else
fi

# Find Spark jars.
if [ -f "${SPARK_HOME}/RELEASE" ]; then
if [ -d "${SPARK_HOME}/jars" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this get changed from RELEASE to jars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because both pip installed PySpark and RELEASE Spark have jars in the jars directory, it seems more reasonable to just check if the jars directory exists directly rather than checking for a file which indicates that the JARs directory is present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. This seems reasonable to me.

SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
Expand Down
2 changes: 1 addition & 1 deletion bin/spark-shell
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ esac
set -o posix

if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many files are changed with this. Do those scripts need to find SPARK_HOME with pip installed Spark? I assume they should be shipped with Spark release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for pip installed PySpark we need to do a bit more work to find the SPARK_HOME, using a common script to determine if we need to do more hunting rather than duplicating the logic.

fi

export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
Expand Down
2 changes: 1 addition & 1 deletion bin/spark-sql
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#

if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
fi

export _SPARK_CMD_USAGE="Usage: ./bin/spark-sql [options] [cli option]"
Expand Down
2 changes: 1 addition & 1 deletion bin/spark-submit
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#

if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
Expand Down
2 changes: 1 addition & 1 deletion bin/sparkR
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#

if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
source `dirname $0`/find-spark-home
fi

source "${SPARK_HOME}"/bin/load-spark-env.sh
Expand Down
32 changes: 27 additions & 5 deletions dev/create-release/release-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
export ZINC_PORT=$ZINC_PORT
echo "Creating distribution: $NAME ($FLAGS)"

# Write out the NAME and VERSION to PySpark version info we rewrite the - into a . and SNAPSHOT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have a version string that's slightly different from the "original", just for Python?

I'm thinking about what will happen if people, for example, want to do the same for R. Having 3 slightly different ways of showing the version string seems unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup we do, otherwise we can't publish on PyPI which is the end goal.

# to dev0 to be closer to PEP440. We use the NAME as a "local version".
PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" | sed -r "s/-/./" | sed -r "s/SNAPSHOT/dev0/"`
echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks like python/pyspark/version.py is only used when building release and will be overwritten, should we list it as ignored file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it if the user runs setup.py sdist on their on own and as part of the packaging tests during any Python change in Jenkins. If the only way to build a pip installable package was make_release then yes - but that isn't the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can, I'd like to consolidate this logic into the release-tag shell script mentioned upthread.


# Get maven home set by MVN
MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`

echo "Creating distribution"
./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz $FLAGS \
-DzincPort=$ZINC_PORT 2>&1 > ../binary-release-$NAME.log
cd ..
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .

echo "Copying and signing python distribution"
PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without specifying format, seems the output distribution will be zip file on windows? Because in setup.py it has support for windows, so I am wondering if this is an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think packaging on windows can be considered a future todo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also realistically I don't think packaging on windows is super supported right now given how we are doing it with shell scripts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually I have this question is because setup.py has few codes for windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to eventually support building sdists on Windows - but I think porting the entire release process to windows is out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't port the release bash scripts to Windows. It's going to be a huge pain with little obvious benefit. Windows users who want to make release builds can just run this version of the script in a *nix VM.

cp spark-$SPARK_VERSION-bin-$NAME/python/dist/$PYTHON_DIST_NAME .

echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \
--output $PYTHON_DIST_NAME.asc \
--detach-sig $PYTHON_DIST_NAME
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \
MD5 $PYTHON_DIST_NAME.gz > \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a wrongly appended .gz here?

$PYTHON_DIST_NAME.md5
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \
SHA512 $PYTHON_DIST_NAME > \
$PYTHON_DIST_NAME.sha

echo "Copying and signing regular binary distribution"
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \
--output spark-$SPARK_VERSION-bin-$NAME.tgz.asc \
--detach-sig spark-$SPARK_VERSION-bin-$NAME.tgz
Expand All @@ -187,10 +208,10 @@ if [[ "$1" == "package" ]]; then
# We increment the Zinc port each time to avoid OOM's and other craziness if multiple builds
# share the same Zinc server.
FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos"
make_binary_release "hadoop2.3" "-Phadoop2.3 $FLAGS" "3033" &
make_binary_release "hadoop2.4" "-Phadoop2.4 $FLAGS" "3034" &
make_binary_release "hadoop2.6" "-Phadoop2.6 $FLAGS" "3035" &
make_binary_release "hadoop2.7" "-Phadoop2.7 $FLAGS" "3036" &
make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &
make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" &
make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" &
make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" &
Copy link
Member

@viirya viirya Oct 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks like these changes are not related and can be separated to another small pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the packaging doesn't seem to currently build without this - and I wanted to test the packaging as part of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this has not been found before...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah I don't know if it's just I have a slightly older maven or some weird plugin. I can take a look some more at this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a new issue which was introduced in https://github.com/apache/spark/pull/14637/files#diff-01ca42240614718522afde4d4885b40dL189. I'd be in favor of fixing this separately. Do you mind splitting this change into a separate small PR which I'll merge right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - #15860

make_binary_release "hadoop2.4-without-hive" "-Psparkr -Phadoop-2.4 -Pyarn -Pmesos" "3037" &
make_binary_release "without-hadoop" "-Psparkr -Phadoop-provided -Pyarn -Pmesos" "3038" &
wait
Expand All @@ -208,6 +229,7 @@ if [[ "$1" == "package" ]]; then
# Re-upload a second time and leave the files in the timestamped upload directory:
LFTP mkdir -p $dest_dir
LFTP mput -O $dest_dir 'spark-*'
LFTP mput -O $dest_dir 'pyspark-*'
exit 0
fi

Expand Down
4 changes: 3 additions & 1 deletion dev/lint-python
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"
SPARK_ROOT_DIR="$(dirname "$SCRIPT_DIR")"
PATHS_TO_CHECK="./python/pyspark/ ./examples/src/main/python/ ./dev/sparktestsupport"
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/run-tests.py ./python/run-tests.py ./dev/run-tests-jenkins.py"
# TODO: fix pep8 errors with the rest of the Python scripts under dev
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/run-tests.py ./python/*.py ./dev/run-tests-jenkins.py"
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/pip-sanity-check.py"
PEP8_REPORT_PATH="$SPARK_ROOT_DIR/dev/pep8-report.txt"
PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt"
PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt"
Expand Down
6 changes: 6 additions & 0 deletions dev/make-distribution.sh
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,12 @@ fi
# Copy data files
cp -r "$SPARK_HOME/data" "$DISTDIR"

# Make pip package
echo "Building python distribution package"
cd $SPARK_HOME/python
python setup.py sdist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a parameter so we only build this if the parameter is given? I think we may not need to build this by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost everything else is built by default (except for maven profiles and the big tgz file). I don't think adding a parameter would really improve it (although if on some machines people have difficulty or its slow we can revist - but so far building the Python distribution is very fast).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be sure that the machine building distribution always has python installed? If the users doesn't think about to use pyspark, original procedure of making distribution doesn't ask them to do anything with pyspark. But with this change, it seems making assumption all users want to have python distribution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speed might not be a problem as you said. My concern is this step might need the users to meet some requirements they don't expect and don't want to use, e.g., some python modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a user is publishing a release it's reasonable to assume they have Python and at least pandas and numpy. But if this isn't the case working around it with a environment variable same as TGZ sounds like a good suggestions.

cd ..

# Copy other things
mkdir "$DISTDIR"/conf
cp "$SPARK_HOME"/conf/*.template "$DISTDIR"/conf
Expand Down
36 changes: 36 additions & 0 deletions dev/pip-sanity-check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from __future__ import print_function

from pyspark.sql import SparkSession
import sys

if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PipSanityCheck")\
.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(100), 10)
value = rdd.reduce(lambda x, y: x + y)
if (value != 4950):
print("Value %d did not match expected value." % value, file=sys.stderr)
sys.exit(-1)
print("Successfuly ran pip sanity check")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Successfully.


spark.stop()
35 changes: 35 additions & 0 deletions dev/run-pip-tests
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#


FWDIR="$(cd "`dirname $0`"/..; pwd)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dev/run-pip-tests line 21:
FWDIR="$(cd "`dirname $0`"/..; pwd)"
         ^-- SC2164: Use cd ... || exit in case cd fails.
             ^-- SC2006: Use $(..) instead of legacy `..`.
                      ^-- SC2086: Double quote to prevent globbing and word splitting.


In dev/run-pip-tests line 22:
cd "$FWDIR"
^-- SC2164: Use cd ... || exit in case cd fails.


In dev/run-pip-tests line 26:
$FWDIR/dev/run-pip-tests-2
^-- SC2086: Double quote to prevent globbing and word splitting.


In dev/run-pip-tests line 31:
  rm -rf `cat ./virtual_env_temp_dir`
         ^-- SC2046: Quote this to prevent word splitting.
         ^-- SC2006: Use $(..) instead of legacy `..`.

cd "$FWDIR"

# Run the tests, we wrap the underlying test script for cleanup and because early exit
# doesn't always properly exit a virtualenv.
$FWDIR/dev/run-pip-tests-2
export success=$?

# Clean up the virtual env enviroment used if we created one.
if [ -f ./virtual_env_tmp_dir ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you could combine both this and the run-pip-tests-2 into a single script if you used Bash exit traps, e.g.

function delete_virtualenv() {
  echo "Deleting temp directory $tmpdir"
  rm -rf "$tmpdir"
}
trap delete_virtualenv EXIT

and putting that at the top of the script before you actually create the temporary directory / virtualenv.

rm -rf `cat ./virtual_env_temp_dir`
rm ./virtaul_env_tmp_dir
fi

exit $success
77 changes: 77 additions & 0 deletions dev/run-pip-tests-2
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Stop on error
set -e
# Set nullglob for when we are checking existence based on globs
shopt -s nullglob

FWDIR="$(cd "`dirname $0`"/..; pwd)"
cd "$FWDIR"
# Some systems don't have pip or virtualenv - in those cases our tests won't work.
if ! hash virtualenv 2>/dev/null; then
echo "Missing virtualenv skipping pip installability tests."
exit 0
fi
if ! hash pip 2>/dev/null; then
echo "Missing pip, skipping pip installability tests."
exit 0
fi

if [ -d ~/.cache/pip/wheels/ ]; then
echo "Cleaning up pip wheel cache so we install the fresh package"
rm -rf ~/.cache/pip/wheels/
fi

# Create a temp directory for us to work in and save its name to a file for cleanup
echo "Constucting virtual env for testing"
mktemp -d > ./virtual_env_temp_dir
VIRTUALENV_BASE=`cat ./virtual_env_temp_dir`
echo "Using $VIRTUALENV_BASE for virtualenv"
virtualenv $VIRTUALENV_BASE
source $VIRTUALENV_BASE/bin/activate
# Upgrade pip
pip install --upgrade pip

echo "Creating pip installable source dist"
cd python
python setup.py sdist


echo "Installing dist into virtual env"
cd dist
# Verify that the dist directory only contains one thing to install
sdists=(*.tar.gz)
if [ ${#sdists[@]} -ne 1 ]; then
echo "Unexpected number of targets found in dist directory - please cleanup existing sdists first."
exit -1
fi
# Do the actual installation
pip install --upgrade --force-reinstall *.tar.gz

cd /

echo "Run basic sanity check on pip installed version with spark-submit"
spark-submit $FWDIR/dev/pip-sanity-check.py
echo "Run basic sanity check with import based"
python $FWDIR/dev/pip-sanity-check.py
echo "Run the tests for context.py"
python $FWDIR/python/pyspark/context.py

exit 0
7 changes: 7 additions & 0 deletions dev/run-tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,12 @@ def run_python_tests(test_modules, parallelism):
run_cmd(command)


def run_python_packaging_tests():
set_title_and_block("Running PySpark packaging tests", "BLOCK_PYSPARK_PIP_TESTS")
command = [os.path.join(SPARK_HOME, "dev", "./dev/run-pip-tests")]
run_cmd(command)


def run_build_tests():
set_title_and_block("Running build tests", "BLOCK_BUILD_TESTS")
run_cmd([os.path.join(SPARK_HOME, "dev", "test-dependencies.sh")])
Expand Down Expand Up @@ -583,6 +589,7 @@ def main():
modules_with_python_tests = [m for m in test_modules if m.python_test_goals]
if modules_with_python_tests:
run_python_tests(modules_with_python_tests, opts.parallelism)
run_python_packaging_tests()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to run pip packaging test every time? Would be better if we can choose to run it or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really hard to determine if the Python change is one thats going to break packaging - and the packaging tests are really quite fast. I think for now erring on the side of testing slightly more often than we need is the best course of action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would +1 on this given the logic in setup.py that should be checked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 as well; this seems cheap to run and it's better to err on the side of running things more often.

if any(m.should_run_r_tests for m in test_modules):
run_sparkr_tests()

Expand Down
Loading