Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated frameworks documentation pages #502

Merged
merged 21 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 79 additions & 8 deletions docs/aurora/data-science/frameworks/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ the frameworks module. To use it from a compute node, please load the following

```
module use /soft/modulefiles/
module load frameworks/2023.12.15.001
module load frameworks
```
Then you can `import` PyTorch as usual, the following is an output from the
`frameworks/2023.12.15.001` module
`frameworks` module

```
>>> import torch
>>> torch.__version__
'2.0.1a0+cxx11.abi'
'2.3.1+cxx11.abi'
```
A simple but useful check could be to use PyTorch to get device information on
a compute node. You can do this the following way:
Expand Down Expand Up @@ -128,6 +128,10 @@ Some of the Aurora specific details might be helpful to you:
The following environmental variables should be set on the batch submission
script (PBSPro script) in the case of attempting to run beyond 16 nodes.

**oneCCL optimal setup**

Please refer to [oneCCL](./oneCCL.md) for details.

```shell
# This is a fix for running over 16 nodes:
export FI_CXI_DEFAULT_CQ_SIZE=131072
Expand All @@ -138,13 +142,49 @@ export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export MPIR_CVAR_ENABLE_GPU=0
# This is to disable certain GPU optimizations like the use of XeLinks between
# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
# overheads. Benefits are application dependent.
export CCL_KVS_GET_TIMEOUT=600

export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
export CPATH=$CCL_ROOT/include:$CPATH
export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH

export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE=topo
export CCL_ALLREDUCE_SCALEOUT=rabenseifner # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale

export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600

export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1
```

Other optional setup for oneCCL.
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved

```bash
export FI_MR_ZE_CACHE_MONITOR_ENABLED=0
export FI_MR_CACHE_MONITOR=disabled
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_DEFAULT_CQ_SIZE=1048576
export FI_CXI_CQ_FILL_PERCENT=30
export MPI_PROVIDER=$FI_PROVIDER
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
export INTELGT_AUTO_ATTACH_DISABLE=1
export PALS_PING_PERIOD=240
export PALS_RPC_TIMEOUT=240
export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
export CCL_OP_SYNC=1 #to avoid potential hang at large scale
```

These setup will probably be included in the framework module file in future. But for now, users need to explicitly set these in the submission script.
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved

In order to run an application with `TF32` precision type, one must set the
following environmental parameter:

Expand Down Expand Up @@ -314,7 +354,7 @@ export IPEX_FP32_MATH_MODE=TF32
#####################################################################

module use /soft/modulefiles
module load frameworks/2023.12.15.001
module load frameworks

export NUMEXPR_NUM_THREADS=64
# This is to resolve an issue due to a package called "numexpr".
Expand All @@ -333,6 +373,37 @@ export NUMEXPR_NUM_THREADS=64
# JOB LAUNCH
######################################################################


## CCL setup
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OVFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export CCL_KVS_GET_TIMEOUT=600

export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
export CPATH=$CCL_ROOT/include:$CPATH
export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH

export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE=topo
export CCL_ALLREDUCE_SCALEOUT=rabenseifner # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale

export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600

export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1


export CCL_LOG_LEVEL="WARN"
export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"
Expand Down
97 changes: 93 additions & 4 deletions docs/aurora/data-science/frameworks/tensorflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,19 @@ module. To use it from a compute node, please do:

```
module use /soft/modulefiles/
module load frameworks/2023.12.15.001
module load frameworks
```

Then you can `import` TensorFlow as usual, the following is an output from the
`frameworks/2023.12.15.001` module:
`frameworks` module:

```
>>> import tensorflow as tf
>>> tf.__version__
'2.14.1'
```
This import will fail on login nodes because there is no XPU on login nodes.

A simple but useful check could be to use TensorFlow to get device information
on a compute node. You can do this the following way:

Expand Down Expand Up @@ -213,9 +215,66 @@ export MPIR_CVAR_ENABLE_GPU=0
# This is to disable certain GPU optimizations like the use of XeLinks between
# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
# overheads. Benefits are application dependent.
```
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved

**oneCCL optimal setup**

Please refer to [oneCCL](./oneCCL.md) for details.

```shell
# This is a fix for running over 16 nodes:
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OVFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export CCL_KVS_GET_TIMEOUT=600

export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
export CPATH=$CCL_ROOT/include:$CPATH
export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH

export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE=topo
export CCL_ALLREDUCE_SCALEOUT=rabenseifner # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale

export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600

export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1
```

Other optional setup for oneCCL.

```bash
export FI_MR_ZE_CACHE_MONITOR_ENABLED=0
export FI_MR_CACHE_MONITOR=disabled
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_OFLOW_BUF_SIZE=8388608
export FI_CXI_DEFAULT_CQ_SIZE=1048576
export FI_CXI_CQ_FILL_PERCENT=30
export MPI_PROVIDER=$FI_PROVIDER
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
export INTELGT_AUTO_ATTACH_DISABLE=1
export PALS_PING_PERIOD=240
export PALS_RPC_TIMEOUT=240
export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
export CCL_OP_SYNC=1 #to avoid potential hang at large scale
```

These setup will probably be included in the framework module file in future. But for now, users need to explicitly set these in the submission script.


### CPU Affinity

The CPU affinity should be set manually through mpiexec.
Expand Down Expand Up @@ -309,7 +368,6 @@ export FI_LOG_PROV=cxi
# These allow for logging from a specific provider (libfabric)

export MPIR_CVAR_ENABLE_GPU=0
export CCL_KVS_GET_TIMEOUT=600

#####################################################################
# FRAMEWORK Variables that make a performance difference
Expand All @@ -327,7 +385,7 @@ export ITEX_FP32_MATH_MODE=TF32
#####################################################################

module use /soft/modulefiles
module load frameworks/2023.12.15.001
module load frameworks

export NUMEXPR_NUM_THREADS=64
# This is to resolve an issue due to a package called "numexpr".
Expand All @@ -338,6 +396,36 @@ export NUMEXPR_NUM_THREADS=64
# or equal to '64' or to increase the 'NUMEXPR_MAX_THREADS' to the available
# number of threads. Both of these variables can be set manually.


## CCL setup
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_OVFLOW_BUF_SIZE=8388608
export FI_CXI_CQ_FILL_PERCENT=20

export FI_LOG_LEVEL=warn
#export FI_LOG_PROV=tcp
export FI_LOG_PROV=cxi

export CCL_KVS_GET_TIMEOUT=600

export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
export CPATH=$CCL_ROOT/include:$CPATH
export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH

export CCL_PROCESS_LAUNCHER=pmix
export CCL_ATL_TRANSPORT=mpi
export CCL_ALLREDUCE=topo
export CCL_ALLREDUCE_SCALEOUT=rabenseifner # currently best allreduce algorithm at large scale
export CCL_BCAST=double_tree # currently best bcast algorithm at large scale

export CCL_KVS_MODE=mpi
export CCL_CONFIGURATION_PATH=""
export CCL_CONFIGURATION=cpu_gpu_dpcpp
export CCL_KVS_CONNECTION_TIMEOUT=600

export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
export CCL_KVS_USE_MPI_RANKS=1

#####################################################################
# End of environment setup section
#####################################################################
Expand All @@ -346,6 +434,7 @@ export NUMEXPR_NUM_THREADS=64
# JOB LAUNCH
######################################################################


export CCL_LOG_LEVEL="WARN"
export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"
Expand Down
27 changes: 27 additions & 0 deletions docs/polaris/applications-and-libraries/libraries/nccl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# NCCL

NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved

NCCL is a key library for scaling AI applications on Nvidia system. The conda module on Polaris are built with NCCL as the communication backend for distributed training. But HPC applications can also chose NCCL for communication over MPI. The library is available in the following folder: ```/soft/libraries/nccl```.

We have done extensive performance tests and identified the following best environment setup.
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved

```bash
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
```
This setup will lead to 2-3x performance improvement. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests.

As of now (October 29, 2029), these setup has been included in the conda module on Polaris as one can confirm as follows:
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved
```bash
module load conda
env | grep NCCL
env | grep FI
```
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 3 additions & 3 deletions docs/polaris/data-science-workflows/frameworks/jax.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,16 @@ JAX is installed on Polaris via the `conda` module, available with:
module load conda; conda activate
```

Then, you can load JAX in `python` as usual (below showing results from the `conda/2022-07-19` module):
Then, you can load JAX in `python` as usual (below showing results from the `conda/2024-04-29` module):

```python
>>> import jax
>>> jax.__version__
'0.3.15'
'0.4.26'
>>>
```

## Notes on JAX 0.3.15
## Notes on JAX 0.4.26

On Polaris, due to a bug, an environment variable must be set to use JAX on GPUs. The following code will crash:
```python
Expand Down
34 changes: 25 additions & 9 deletions docs/polaris/data-science-workflows/frameworks/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,20 @@ module load conda
conda activate
```

Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2022-07-19` module):
Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2024-04-29` module):

```python
>>> import torch
>>> torch.__version__
'1.12.0a0+git67ece03'
'2.3.0'
>>>
```

This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2022-07-19` module):
This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2024-04-29` module):

```bash
$ echo $CUDA_HOME
/soft/datascience/cuda/cuda_11.5.2_495.29.05_linux
/soft/compilers/cudatoolkit/cuda-12.4.1/
```

If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the PyTorch release, though updates will come in the form of new versions of the `conda` module.
Expand All @@ -48,10 +48,20 @@ When running PyTorch applications, we have found the following practices to be g
PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with PyTorch has been seen with both DDP and Horovod. For details, please see the [Horovod documentation](https://horovod.readthedocs.io/en/stable/pytorch.html) or the [Distributed Data Parallel documentation](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). Some Polaris-specific details that may be helpful to you:

1. CPU affinity and NCCL settings can improve scaling performance, particularly at the largest scales. In particular, we encourage users to try their scaling measurements with the following settings:
- Set the environment variable `NCCL_COLLNET_ENABLE=1`
- Set the environment varialbe `NCCL_NET_GDR_LEVEL=PHB`
- Manually set the CPU affinity via mpiexec, such as with `--cpu-bind verbose,list:0,8,16,24
`
- We have also included the following optimal NCCL setting in the conda module. Please see
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved
```bash
export NCCL_NET_GDR_LEVEL=PHB
export NCCL_CROSS_NIC=1
export NCCL_COLLNET_ENABLE=1
export NCCL_NET="AWS Libfabric"
export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_MR_CACHE_MONITOR=userfaultfd
export FI_CXI_DEFAULT_CQ_SIZE=131072
```
Users do not have to set the above environment variables anymore. Also, we do not suggest to modify any of those environment variables. For more info on NCCL, please see [NCCL](../../applications-and-libraries/libraries/nccl.md)

2. Horovod and DDP work best when you limit the visible devices to only one GPU. Note that if you import `mpi4py` or `horovod`, and then do something like `os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()`, it may not actually work! You must set the `CUDA_VISIBLE_DEVICES` environment variable prior to doing `MPI.COMM_WORLD.init()`, which is done in `horovod.init()` as well as implicitly in `from mpi4py import MPI`. On Polaris specifically, you can use the environment variable `PMI_LOCAL_RANK` (as well as `PMI_LOCAL_SIZE`) to learn information about the node-local MPI ranks.

Expand All @@ -61,6 +71,12 @@ DeepSpeed is also available and usable on Polaris. For more information, please

## PyTorch `DataLoader` and multi-node Horovod

Please note there is a bug that causes a hang when using PyTorch's multithreaded data loaders with distributed training across multiple nodes. To workaround this, NVIDIA recommends setting `num_workers=0` in the dataloader configuration, which serializes data loading.
For best performance, it is crucial to enable multiple workers in the data loader to avoid compute and I/O overlap and concurrent loading of dataset. This can be set by tunning "num_workers" parameter in ```DataLoader``` (see https://pytorch.org/docs/stable/data.html). Accordingly to our experience, generally, one can set 4 or 8 for best performance. Due to the total number of CPU cores available on a node, the maximum number of workers one can choose is 16. It is always to tune this value and find the optimal setup for your own application.

For more details, see [Polaris Known Issues](../../known-issues.md).
Aside from this, one also have to make sure that the worker threads spread over different CPU codes. To do this one has to specify the cpu bindingto be depth and choose a depth value larger than ```num_workers``` through the following flag in the ```mpiexec``` command:
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved

```
mpiexec -np $NUM_GPUS -ppn 4 --cpu-bind depth -d 16 python3 ...
```

Enabling multiple workers used to cause a hang in the past, but this has been addressed after OS upgrade on Polaris.
zhenghh04 marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading