Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/container #492

Merged
merged 15 commits into from
Oct 30, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -114,18 +114,21 @@ The latest containers are updated periodically. If you have trouble using contai


## Troubleshooting Common Issues
1. **Permission Denied Error**: If you encounter permission errors during the build

- **Permission Denied Error**: If you encounter permission errors during the build
- Check your quota and delete any unnecessary files.
- Clean-up apptainer cache, `~/.apptainer/cache`, and set the apptainer tmp and cache directories as below:
- Clean-up apptainer cache, `~/.apptainer/cache`, and set the apptainer tmp and cache directories as below. If your home directory is full and if you are building your container on a compute node, then set the tmpdir and cachedir to local scratch
atanikan marked this conversation as resolved.
Show resolved Hide resolved
```bash
export APPTAINER_TMPDIR=/tmp/apptainer-tmpdir
mkdir $APPTAINER_TMPDIR
export APPTAINER_CACHEDIR=/tmp/apptainer-cachedir/
mkdir $APPTAINER_CACHEDIR
```
export APPTAINER_TMPDIR=/local/scratch/apptainer-tmpdir
mkdir $APPTAINER_TMPDIR
export APPTAINER_CACHEDIR=/local/scratch apptainer-cachedir/
mkdir $APPTAINER_CACHEDIR
```
- Make sure you are not on a directory accessed with a symlink, i.e. check if `pwd` and `pwd -P` returns the same path.
- If any of the above doesn't work, try running the build in your home directory.

2. **Mapping to rank 0 on all nodes**: Ensure that the container's MPI aligns with the system MPI. Follow the additional steps outlined in the [container registry documentation for MPI on Polaris](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris)
- **Mapping to rank 0 on all nodes**: Ensure that the container's MPI aligns with the system MPI. Follow the additional steps outlined in the [container registry documentation for MPI on Polaris](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris)

- **libmpi.so.40 not found**: This can happen if the container's application has an OpenMPI dependency which is not currently supported on Polaris. It can also spring up if the containers base environment is not debian architecture like Ubuntu. Ensure the application has an MPICH implementation as well. Also try removing .conda, .cache, and .local folders from your home directory and rebuild the container.
atanikan marked this conversation as resolved.
Show resolved Hide resolved

3. **libmpi.so.40 not found**: This can happen if the container's application has an OpenMPI dependency which is not currently supported on Polaris. It can also spring up if the containers base environment is not debian architecture like Ubuntu. Ensure the application has an MPICH implementation as well. Also try removing .conda, .cache, and .local folders from your home directory and rebuild the container.
- **Disabled Port mapping, user namespace and [network virtualization]** [Network virtualization](https://apptainer.org/docs/user/main/networking.html) is disabled for the container due to security constraints. See issue [#2533](https://github.com/apptainer/apptainer/issues/2553)
2 changes: 1 addition & 1 deletion docs/polaris/data-science-workflows/frameworks/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ $ echo $CUDA_HOME

If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the PyTorch release, though updates will come in the form of new versions of the `conda` module.

PyTorch is also available through nvidia containers that have been translated to Singularity containers. For more information about containers, please see the [containers](../containers/containers.md) documentation page.
PyTorch is also available through nvidia containers that have been translated to Singularity containers. For more information about containers, please see the [containers](../../containers/containers.md) documentation page.
atanikan marked this conversation as resolved.
Show resolved Hide resolved

## PyTorch Best Practices on Polaris

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ $ echo $CUDA_HOME

If you need to build applications that use this version of TensorFlow and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the TensorFlow release, though updates will come in the form of new versions of the `conda` module.

TensorFlow is also available through NVIDIA containers that have been translated to Singularity containers. For more information about containers, please see the [Containers](../containers/containers.md) documentation page.
TensorFlow is also available through NVIDIA containers that have been translated to Singularity containers. For more information about containers, please see the [Containers](../../containers/containers.md) documentation page.
atanikan marked this conversation as resolved.
Show resolved Hide resolved

## TensorFlow Best Practices on Polaris

Expand Down
55 changes: 55 additions & 0 deletions docs/sophia/containers/containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Containers on Sophia
Sophia, powered by NVIDIA A100 GPUs, benefits from container-based workloads for seamless compatibility across NVIDIA systems. This guide details the use of containers on Sophia, including custom container creation, large-scale execution, and common pitfalls.

## Apptainer Setup

Sophia employs Apptainer (formerly known as Singularity) for container management. To set up Apptainer, run:

```bash
module use /soft/spack/base/0.7.1/install/modulefiles/Core/module load apptainer
apptainer version #1.3.3
```

The Apptainer version on Sophia is 1.3.3. Detailed user documentation is available [here](https://apptainer.org/docs/user/1.3/)

## Building from Docker or Argonne GitHub Container Registry

Containers on Sophia can be built by writing Dockerfiles on a local machine and then publish the container to DockerHub, or by directly building them on ALCF compute node by writing an Apptainer recipe file. If you prefer to use existing containers, you can pull them from various registries like DockerHub and run them on Sophia.

Since Docker requires root privileges, which users do not have on Sophia, existing Docker containers must be converted to Apptainer. To build a Docker-based container on Sophia, use the following as an example:


```bash
qsub -I -A <Project> -l select=1:ngpus=8:ncpus=256 -l walltime=01:00:00 -l filesystems=home:eagle -l singularity_fakeroot=True -q by-node -k doe
export HTTP_PROXY=http://proxy.alcf.anl.gov:3128
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
export http_proxy=http://proxy.alcf.anl.gov:3128
export https_proxy=http://proxy.alcf.anl.gov:3128
module use /soft/spack/base/0.7.1/install/modulefiles/Core/module load apptainer
apptainer build --fakeroot pytorch:22.06-py3.sing docker://nvcr.io/nvidia/pytorch:22.06-py3
```
You can find the latest prebuilt Nvidia PyTorch containers [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). The Tensorflow containers are [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (though note that LCF doesn't prebuild the TF-1 containers typically). You can search the full container registry [here](https://catalog.ngc.nvidia.com/containers). For custom containers tailored for Sophia, visit [ALCF's GitHub container registry](https://github.com/argonne-lcf/container-registry/tree/main)
atanikan marked this conversation as resolved.
Show resolved Hide resolved

> **Note:** Currently container build and executions are only supported on the Sophia compute nodes

## Recipe-Based Container Building

As mentioned earlier, you can build Apptainer containers from recipe files. Instructions are available [here](https://apptainer.org/docs/user/1.3/build_a_container.html#building-containers-from-apptainer-definition-files).

> Note: You can also build custom recipes by bootstrapping from prebuilt images. For e.g the first two lines in a recipe to use our custom Tensorflow implementation would be `Bootstrap: oras` followed by `From: ghcr.io/argonne-lcf/tf2-mpich-nvidia-gpu:latest`

## Troubleshooting Common Issues

- **Permission Denied Error**: If you encounter permission errors during the build
- Check your quota and delete any unnecessary files.
- Clean-up apptainer cache, `~/.apptainer/cache`, and set the apptainer tmp and cache directories as below. If your home directory is full and if you are building your container on a compute node, then set the tmpdir and cachedir to local scratch
atanikan marked this conversation as resolved.
Show resolved Hide resolved
```bash
export APPTAINER_TMPDIR=/local/scratch/apptainer-tmpdir
mkdir $APPTAINER_TMPDIR
export APPTAINER_CACHEDIR=/local/scratch apptainer-cachedir/
mkdir $APPTAINER_CACHEDIR
```
- Make sure you are not on a directory accessed with a symlink, i.e. check if `pwd` and `pwd -P` returns the same path.
- If any of the above doesn't work, try running the build in your home directory.

- **Disabled Port mapping, user namespace and [network virtualization]** [Network virtualization](https://apptainer.org/docs/user/main/networking.html) is disabled for the container due to security constraints. See issue [#2533](https://github.com/apptainer/apptainer/issues/2553)
135 changes: 107 additions & 28 deletions docs/sophia/data-science/python.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,118 @@
# Setting Up a Python Virtual Environment on Sophia
# Python

## Default Python Version
The default Python on Sophia is located at `/usr/bin/python` with version 3.9.18.
We provide prebuilt `conda` environments containing GPU-supported builds of
`torch`, `tensorflow` (both with `horovod` support for multi-node
calculations), `jax`, and many other commonly-used Python modules.

## Creating a Virtual Environment
Creating a virtual environment allows you to manage dependencies for your Python projects independently.
This is particularly useful when working on multiple projects with different dependencies or versions of packages.
Follow these steps to set up a virtual environment on Sophia:
Users can activate this environment by first loading the `conda` module, and
then activating the base environment.

1. **Create a Virtual Environment**:
```bash
python -m venv myenv
```
Replace `myenv` with your preferred name for the virtual environment.
This will create a directory with the specified name containing the virtual environment.
Explicitly (either from an interactive job, or inside a job script):

2. **Activate the Virtual Environment**:
```bash
source myenv/bin/activate
```
After activation, your command prompt will change to indicate the virtual environment is active.
```bash
module use /soft/modulefiles; module load conda ; conda activate base
```

3. **Upgrade `pip`**:

While this step is optional, it is a good habit to use the latest version of `pip`:
```bash
pip install --upgrade pip
```
This will load and activate the base environment.

## Virtual environments via `venv`

To install additional packages that are missing from the `base` environment,
we can build a `venv` on top of it.

!!! success "Conda `base` environment + `venv`"

If you need a package that is **not** already
installed in the `base` environment,
this is generally the recommended approach.

We can create a `venv` on top of the base
Anaconda environment (with
`#!bash --system-site-packaes` to inherit
the `base` packaes):

4. **Install Packages**:
Use `pip` to install necessary packages:
```bash
pip install package_name
module use /soft/modulefiles ; module load conda; conda activate base
CONDA_NAME=$(echo ${CONDA_PREFIX} | tr '\/' '\t' | sed -E 's/mconda3|\/base//g' | awk '{print $NF}')
VENV_DIR="$(pwd)/venvs/${CONDA_NAME}"
mkdir -p "${VENV_DIR}"
python -m venv "${VENV_DIR}" --system-site-packages
source "${VENV_DIR}/bin/activate"
```
Replace `package_name` with the name of the package you want to install.

You can always retroactively change the `#!bash --system-site-packages` flag
state for this virtual environment by editing `#!bash ${VENV_DIR}/pyvenv.cfg` and
changing the value of the line `#!bash include-system-site-packages=false`.

To install a different version of a package that is already installed in the
base environment, you can use:

```bash
python3 pip install --ignore-installed <package> # or -I
```

The shared base environment is not writable, so it is impossible to remove or
uninstall packages from it. The packages installed with the above `pip` command
should shadow those installed in the base environment.

## Cloning the base Anaconda environment

!!! warning

This approach is generally not recommended as it can be quite slow and can
use significant storage space.

If you need more flexibility, you can clone the conda environment into a custom
path, which would then allow for root-like installations via `#!bash conda install
<module>` or `#!bash pip install <module>`.

Unlike the `venv` approach, using a cloned Anaconda environment requires you to
copy the entirety of the base environment, which can use significant storage
space.

To clone the `base` environment:

```bash
module load conda ; conda activate base
conda create --clone base --prefix /path/to/envs/base-clone
conda activate /path/to/envs/base-clone
```

where, `#!bash path/to/envs/base-clone` should be replaced by a suitably chosen
path.

**Note**: The cloning process can be _quite_ slow.

## Using `pip install --user` (not recommended)

!!! danger

This is typically _not_ recommended.

With the conda environment setup, one can install common Python modules using
`#!bash python3 pip install --users '<module-name>'` which will install
packages in `#!bash $PYTHONUSERBASE/lib/pythonX.Y/site-packages`.

The `#!bash $PYTHONUSERBASE` environment variable is automatically set when you
load the base conda module, and is equal to `#!bash
/home/$USER/.local/polaris/conda/YYYY-MM-DD`.

Note, Python modules installed this way that contain command line binaries will
not have those binaries automatically added to the shell's `#!bash $PATH`. To
manually add the path:

```bash
export PATH="$PYTHONUSERBASE/bin:$PATH"
```

Be sure to remove this location from `#!bash $PATH` if you deactivate the base
Anaconda environment or unload the module.

Cloning the Anaconda environment, or using `venv` are both more flexible and
transparent when compared to `#!bash --user` installs.

## Default Python Version
The default Python on Sophia is located at `/usr/bin/python` with version 3.9.18.

## Creating a Jupyter Kernel

Expand Down
6 changes: 4 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ nav:
- Cabana: polaris/applications-and-libraries/libraries/cabana-polaris.md
- Spack PE: polaris/applications-and-libraries/libraries/spack-pe.md
- XALT: polaris/applications-and-libraries/libraries/xalt.md
- Containers: polaris/containers/containers.md
- Data Science:
- Julia: polaris/data-science-workflows/julia.md
- Python: polaris/data-science-workflows/python.md
- Containers: polaris/data-science-workflows/containers/containers.md
- Frameworks:
- TensorFlow: polaris/data-science-workflows/frameworks/tensorflow.md
- PyTorch: polaris/data-science-workflows/frameworks/pytorch.md
Expand Down Expand Up @@ -109,7 +109,9 @@ nav:
- Getting Started: sophia/getting-started.md
- Running Jobs: sophia/queueing-and-running-jobs/running-jobs.md
- Compiling and Linking: sophia/compiling-and-linking/compiling-and-linking-overview.md
- Data Science: sophia/data-science/python.md
- Containers: sophia/containers/containers.md
- Data Science:
- Python: sophia/data-science/python.md
- AI Testbed:
- Getting Started: ai-testbed/getting-started.md
- Cerebras:
Expand Down
Loading