[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake #28152

HyukjinKwon · 2020-04-08T08:23:01Z

What changes were proposed in this pull request?

This PR proposes to show a better error message when a user mistakenly installs pyspark from PIP but the default python does not point out the corresponding pip. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example.

It can be reproduced as below:

I have two Python executables. python is Python 3.7, pip binds with Python 3.7 and python2.7 is Python 2.7.

pip install pyspark

pyspark

...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.7.3 (default, Mar 27 2019 09:23:15)
SparkSession available as 'spark'.
...

PYSPARK_PYTHON=python2.7 pyspark

Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin']
/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory
/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory
/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory

Why are the changes needed?

There are multiple questions outside about this error and they have no idea what's going on. See:

The answer is usually setting SPARK_HOME; however this isn't completely correct.

It works if you set SPARK_HOME because pyspark executable script directly imports the library by using SPARK_HOME (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via python executable. So, this way you use a package installed in a different Python, which isn't ideal.

Does this PR introduce any user-facing change?

Yes, it fixes the error message better.

Before:

Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin']
...

After:

Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin']

Did you install PySpark via a package manager such as pip or Conda? If so,
PySpark was not found in your Python environment. It is possible your
Python environment does not properly bind with your package manager.

Please check your default 'python' and if you set PYSPARK_PYTHON and/or
PYSPARK_DRIVER_PYTHON environment variables, and see if you can import
PySpark, for example, 'python -c 'import pyspark'.

If you cannot import, you can install by using the Python executable directly,
for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also
explicitly set the Python executable, that has PySpark installed, to
PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example,
'PYSPARK_PYTHON=python3 pyspark'.
...

How was this patch tested?

Manually tested as described above.

HyukjinKwon · 2020-04-08T08:23:32Z

@holdenk, @srowen, @BryanCutler, @viirya can you take a look when you guys find some time?

… mistake

SparkQA · 2020-04-08T09:04:52Z

Test build #120960 has finished for PR 28152 at commit 784637e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-08T09:10:04Z

Test build #120962 has finished for PR 28152 at commit ac740e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-08T09:28:45Z

Test build #120965 has finished for PR 28152 at commit ccb4101.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-08T09:41:58Z

Test build #120966 has finished for PR 28152 at commit a2238ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-04-08T18:47:03Z

I have faced to the issue. The recommended command python -m pip install pyspark failed with

nstalling collected packages: py4j, pyspark
Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/py4j-0.10.7.dist-info'

I had to provide the --user option:

$ python -m pip install pyspark --user
$ pyspark
WARNING: Python 2.7 is not recommended.
This version is included in macOS for compatibility with legacy software.
Future versions of macOS will not include Python 2.7.
Instead, it is recommended that you transition to using 'python3' from within Terminal.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 2.7.16 (default, Feb 29 2020 01:55:37)
SparkSession available as 'spark'.
>>>

python/pyspark/find_spark_home.py

BryanCutler

LGTM, just had a couple minor suggestions

python/pyspark/find_spark_home.py

HyukjinKwon · 2020-04-09T01:16:01Z

Thanks guys!

SparkQA · 2020-04-09T01:57:41Z

Test build #120989 has finished for PR 28152 at commit 099b137.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-09T02:04:27Z

Merged to master, branch-3.0, and branch-2.4.

… and pip installation mistake ### What changes were proposed in this pull request? This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example. It can be reproduced as below: I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7. ```bash pip install pyspark ``` ```bash pyspark ``` ``` ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. ... ``` ```bash PYSPARK_PYTHON=python2.7 pyspark ``` ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory ``` ### Why are the changes needed? There are multiple questions outside about this error and they have no idea what's going on. See: - https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 - https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching - https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows - ContinuumIO/anaconda-issues#8076 The answer is usually setting `SPARK_HOME`; however this isn't completely correct. It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal. ### Does this PR introduce any user-facing change? Yes, it fixes the error message better. **Before:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] ... ``` **After:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] Did you install PySpark via a package manager such as pip or Conda? If so, PySpark was not found in your Python environment. It is possible your Python environment does not properly bind with your package manager. Please check your default 'python' and if you set PYSPARK_PYTHON and/or PYSPARK_DRIVER_PYTHON environment variables, and see if you can import PySpark, for example, 'python -c 'import pyspark'. If you cannot import, you can install by using the Python executable directly, for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also explicitly set the Python executable, that has PySpark installed, to PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example, 'PYSPARK_PYTHON=python3 pyspark'. ... ``` ### How was this patch tested? Manually tested as described above. Closes #28152 from HyukjinKwon/SPARK-31382. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0248b32) Signed-off-by: HyukjinKwon <[email protected]>

… and pip installation mistake ### What changes were proposed in this pull request? This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example. It can be reproduced as below: I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7. ```bash pip install pyspark ``` ```bash pyspark ``` ``` ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. ... ``` ```bash PYSPARK_PYTHON=python2.7 pyspark ``` ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory ``` ### Why are the changes needed? There are multiple questions outside about this error and they have no idea what's going on. See: - https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 - https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching - https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows - ContinuumIO/anaconda-issues#8076 The answer is usually setting `SPARK_HOME`; however this isn't completely correct. It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal. ### Does this PR introduce any user-facing change? Yes, it fixes the error message better. **Before:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] ... ``` **After:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] Did you install PySpark via a package manager such as pip or Conda? If so, PySpark was not found in your Python environment. It is possible your Python environment does not properly bind with your package manager. Please check your default 'python' and if you set PYSPARK_PYTHON and/or PYSPARK_DRIVER_PYTHON environment variables, and see if you can import PySpark, for example, 'python -c 'import pyspark'. If you cannot import, you can install by using the Python executable directly, for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also explicitly set the Python executable, that has PySpark installed, to PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example, 'PYSPARK_PYTHON=python3 pyspark'. ... ``` ### How was this patch tested? Manually tested as described above. Closes apache#28152 from HyukjinKwon/SPARK-31382. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

HyukjinKwon force-pushed the SPARK-31382 branch from 784637e to ac740e1 Compare April 8, 2020 08:24

Show a better error message for different python and pip installation…

a2238ec

… mistake

HyukjinKwon force-pushed the SPARK-31382 branch from ccb4101 to a2238ec Compare April 8, 2020 08:49

srowen approved these changes Apr 8, 2020

View reviewed changes

MaxGekk reviewed Apr 8, 2020

View reviewed changes

python/pyspark/find_spark_home.py Outdated Show resolved Hide resolved

BryanCutler approved these changes Apr 8, 2020

View reviewed changes

python/pyspark/find_spark_home.py Outdated Show resolved Hide resolved

python/pyspark/find_spark_home.py Outdated Show resolved Hide resolved

python/pyspark/find_spark_home.py Outdated Show resolved Hide resolved

Address comments

099b137

viirya approved these changes Apr 9, 2020

View reviewed changes

HyukjinKwon closed this in 0248b32 Apr 9, 2020

HyukjinKwon deleted the SPARK-31382 branch July 27, 2020 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake #28152

[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake #28152

HyukjinKwon commented Apr 8, 2020 •

edited

Loading

HyukjinKwon commented Apr 8, 2020

SparkQA commented Apr 8, 2020

SparkQA commented Apr 8, 2020

SparkQA commented Apr 8, 2020

SparkQA commented Apr 8, 2020

MaxGekk commented Apr 8, 2020

BryanCutler left a comment

HyukjinKwon commented Apr 9, 2020

SparkQA commented Apr 9, 2020

HyukjinKwon commented Apr 9, 2020

[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake #28152

[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake #28152

Conversation

HyukjinKwon commented Apr 8, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Apr 8, 2020

SparkQA commented Apr 8, 2020

SparkQA commented Apr 8, 2020

SparkQA commented Apr 8, 2020

SparkQA commented Apr 8, 2020

MaxGekk commented Apr 8, 2020

BryanCutler left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 9, 2020

SparkQA commented Apr 9, 2020

HyukjinKwon commented Apr 9, 2020

HyukjinKwon commented Apr 8, 2020 •

edited

Loading