Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark installation issue #1186

Merged
merged 4 commits into from
Sep 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ To install the PySpark environment:

> Additionally, if you want to test a particular version of spark, you may pass the --pyspark-version argument:
>
> python tools/generate_conda_file.py --pyspark-version 2.4.0
> python tools/generate_conda_file.py --pyspark-version 2.4.5

Then, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.

Expand All @@ -94,29 +94,29 @@ Click on the following menus to see details:
<summary><strong><em>Set PySpark environment variables on Linux or MacOS</em></strong></summary>

To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux).

First, get the path of the environment `reco_pyspark` is installed:

RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}')
mkdir -p $RECO_ENV/etc/conda/activate.d
mkdir -p $RECO_ENV/etc/conda/deactivate.d

You also need to find where Spark is installed and set `SPARK_HOME` variable, on the DSVM, `SPARK_HOME=/dsvm/tools/spark/current`.

Then, create the file `$RECO_ENV/etc/conda/activate.d/env_vars.sh` and add:

#!/bin/sh
RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}')
export PYSPARK_PYTHON=$RECO_ENV/bin/python
export PYSPARK_DRIVER_PYTHON=$RECO_ENV/bin/python
export SPARK_HOME_BACKUP=$SPARK_HOME
unset SPARK_HOME
export SPARK_HOME=/dsvm/tools/spark/current

This will export the variables every time we do `conda activate reco_pyspark`.
To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add:
This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add:

#!/bin/sh
unset PYSPARK_PYTHON
unset PYSPARK_DRIVER_PYTHON
export SPARK_HOME=$SPARK_HOME_BACKUP
unset SPARK_HOME_BACKUP


</details>

Expand Down
6 changes: 3 additions & 3 deletions tools/generate_conda_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# For generating a conda file for running python gpu and pyspark:
# $ python generate_conda_file.py --gpu --pyspark
# For generating a conda file for running python gpu and pyspark with a particular version:
# $ python generate_conda_file.py --gpu --pyspark-version 2.4.0
# $ python generate_conda_file.py --gpu --pyspark-version 2.4.5

import argparse
import textwrap
Expand Down Expand Up @@ -61,7 +61,7 @@
"tqdm": "tqdm>=4.31.1",
}

CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark==2.4.3"}
CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark==2.4.5"}

CONDA_GPU = {
"fastai": "fastai==1.0.46",
Expand Down Expand Up @@ -134,7 +134,7 @@
"PySpark version input must be valid numeric format (e.g. --pyspark-version=2.3.1)"
)
else:
args.pyspark_version = "2.4.3"
args.pyspark_version = "2.4.5"

# set name for environment and output yaml file
conda_env = "reco_base"
Expand Down