-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake #28152
Conversation
@holdenk, @srowen, @BryanCutler, @viirya can you take a look when you guys find some time? |
Test build #120960 has finished for PR 28152 at commit
|
Test build #120962 has finished for PR 28152 at commit
|
Test build #120965 has finished for PR 28152 at commit
|
Test build #120966 has finished for PR 28152 at commit
|
I have faced to the issue. The recommended command
I had to provide the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just had a couple minor suggestions
Thanks guys! |
Test build #120989 has finished for PR 28152 at commit
|
Merged to master, branch-3.0, and branch-2.4. |
… and pip installation mistake ### What changes were proposed in this pull request? This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example. It can be reproduced as below: I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7. ```bash pip install pyspark ``` ```bash pyspark ``` ``` ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. ... ``` ```bash PYSPARK_PYTHON=python2.7 pyspark ``` ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory ``` ### Why are the changes needed? There are multiple questions outside about this error and they have no idea what's going on. See: - https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 - https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching - https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows - ContinuumIO/anaconda-issues#8076 The answer is usually setting `SPARK_HOME`; however this isn't completely correct. It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal. ### Does this PR introduce any user-facing change? Yes, it fixes the error message better. **Before:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] ... ``` **After:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] Did you install PySpark via a package manager such as pip or Conda? If so, PySpark was not found in your Python environment. It is possible your Python environment does not properly bind with your package manager. Please check your default 'python' and if you set PYSPARK_PYTHON and/or PYSPARK_DRIVER_PYTHON environment variables, and see if you can import PySpark, for example, 'python -c 'import pyspark'. If you cannot import, you can install by using the Python executable directly, for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also explicitly set the Python executable, that has PySpark installed, to PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example, 'PYSPARK_PYTHON=python3 pyspark'. ... ``` ### How was this patch tested? Manually tested as described above. Closes #28152 from HyukjinKwon/SPARK-31382. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0248b32) Signed-off-by: HyukjinKwon <[email protected]>
… and pip installation mistake ### What changes were proposed in this pull request? This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example. It can be reproduced as below: I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7. ```bash pip install pyspark ``` ```bash pyspark ``` ``` ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. ... ``` ```bash PYSPARK_PYTHON=python2.7 pyspark ``` ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory ``` ### Why are the changes needed? There are multiple questions outside about this error and they have no idea what's going on. See: - https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 - https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching - https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows - ContinuumIO/anaconda-issues#8076 The answer is usually setting `SPARK_HOME`; however this isn't completely correct. It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal. ### Does this PR introduce any user-facing change? Yes, it fixes the error message better. **Before:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] ... ``` **After:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] Did you install PySpark via a package manager such as pip or Conda? If so, PySpark was not found in your Python environment. It is possible your Python environment does not properly bind with your package manager. Please check your default 'python' and if you set PYSPARK_PYTHON and/or PYSPARK_DRIVER_PYTHON environment variables, and see if you can import PySpark, for example, 'python -c 'import pyspark'. If you cannot import, you can install by using the Python executable directly, for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also explicitly set the Python executable, that has PySpark installed, to PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example, 'PYSPARK_PYTHON=python3 pyspark'. ... ``` ### How was this patch tested? Manually tested as described above. Closes #28152 from HyukjinKwon/SPARK-31382. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0248b32) Signed-off-by: HyukjinKwon <[email protected]>
… and pip installation mistake ### What changes were proposed in this pull request? This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example. It can be reproduced as below: I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7. ```bash pip install pyspark ``` ```bash pyspark ``` ``` ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. ... ``` ```bash PYSPARK_PYTHON=python2.7 pyspark ``` ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory ``` ### Why are the changes needed? There are multiple questions outside about this error and they have no idea what's going on. See: - https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 - https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching - https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows - ContinuumIO/anaconda-issues#8076 The answer is usually setting `SPARK_HOME`; however this isn't completely correct. It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal. ### Does this PR introduce any user-facing change? Yes, it fixes the error message better. **Before:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] ... ``` **After:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] Did you install PySpark via a package manager such as pip or Conda? If so, PySpark was not found in your Python environment. It is possible your Python environment does not properly bind with your package manager. Please check your default 'python' and if you set PYSPARK_PYTHON and/or PYSPARK_DRIVER_PYTHON environment variables, and see if you can import PySpark, for example, 'python -c 'import pyspark'. If you cannot import, you can install by using the Python executable directly, for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also explicitly set the Python executable, that has PySpark installed, to PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example, 'PYSPARK_PYTHON=python3 pyspark'. ... ``` ### How was this patch tested? Manually tested as described above. Closes apache#28152 from HyukjinKwon/SPARK-31382. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR proposes to show a better error message when a user mistakenly installs
pyspark
from PIP but the defaultpython
does not point out the correspondingpip
. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example.It can be reproduced as below:
I have two Python executables.
python
is Python 3.7,pip
binds with Python 3.7 andpython2.7
is Python 2.7.Why are the changes needed?
There are multiple questions outside about this error and they have no idea what's going on. See:
The answer is usually setting
SPARK_HOME
; however this isn't completely correct.It works if you set
SPARK_HOME
becausepyspark
executable script directly imports the library by usingSPARK_HOME
(see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified viapython
executable. So, this way you use a package installed in a different Python, which isn't ideal.Does this PR introduce any user-facing change?
Yes, it fixes the error message better.
Before:
After:
How was this patch tested?
Manually tested as described above.