argonne-lcf · felker · Nov 8, 2024 · Oct 29, 2024 · Oct 29, 2024 · Oct 29, 2024
diff --git a/docs/aurora/data-science/frameworks/pytorch.md b/docs/aurora/data-science/frameworks/pytorch.md
@@ -12,15 +12,15 @@ the frameworks module. To use it from a compute node, please load the following
 
 ```
 module use /soft/modulefiles/
-module load frameworks/2023.12.15.001
+module load frameworks
 ```
 Then you can `import` PyTorch as usual, the following is an output from the
-`frameworks/2023.12.15.001` module
+`frameworks` module
 
 ```
 >>> import torch
 >>> torch.__version__
-'2.0.1a0+cxx11.abi'
+'2.3.1+cxx11.abi'
 ```
 A simple but useful check could be to use PyTorch to get device information on
 a compute node. You can do this the following way:
@@ -128,6 +128,10 @@ Some of the Aurora specific details might be helpful to you:
 The following environmental variables should be set on the batch submission 
 script (PBSPro script) in the case of attempting to run beyond 16 nodes.
 
+**oneCCL optimal setup**
+
+Please refer to [oneCCL](./oneCCL.md) for details. 
+
 ```shell
 # This is a fix for running over 16 nodes:
 export FI_CXI_DEFAULT_CQ_SIZE=131072
@@ -138,13 +142,49 @@ export FI_LOG_LEVEL=warn
 #export FI_LOG_PROV=tcp
 export FI_LOG_PROV=cxi
 
-export MPIR_CVAR_ENABLE_GPU=0
-# This is to disable certain GPU optimizations like the use of XeLinks between
-# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
-# overheads. Benefits are application dependent.
 export CCL_KVS_GET_TIMEOUT=600
+
+export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
+export CPATH=$CCL_ROOT/include:$CPATH
+export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+
+export CCL_PROCESS_LAUNCHER=pmix  
+export CCL_ATL_TRANSPORT=mpi
+export CCL_ALLREDUCE=topo
+export CCL_ALLREDUCE_SCALEOUT=rabenseifner  # currently best allreduce algorithm at large scale
+export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
+
+export CCL_KVS_MODE=mpi
+export CCL_CONFIGURATION_PATH=""
+export CCL_CONFIGURATION=cpu_gpu_dpcpp
+export CCL_KVS_CONNECTION_TIMEOUT=600 
+
+export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
+export CCL_KVS_USE_MPI_RANKS=1
+```
+
+Other optional setup for oneCCL. 
+
+```bash
+export FI_MR_ZE_CACHE_MONITOR_ENABLED=0
+export FI_MR_CACHE_MONITOR=disabled
+export FI_CXI_RX_MATCH_MODE=hybrid
+export FI_CXI_OFLOW_BUF_SIZE=8388608
+export FI_CXI_DEFAULT_CQ_SIZE=1048576
+export FI_CXI_CQ_FILL_PERCENT=30
+export MPI_PROVIDER=$FI_PROVIDER
+unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
+unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
+export INTELGT_AUTO_ATTACH_DISABLE=1
+export PALS_PING_PERIOD=240
+export PALS_RPC_TIMEOUT=240
+export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
+export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
+export CCL_OP_SYNC=1 #to avoid potential hang at large scale
 ```
 
+These setup will probably be included in the framework module file in future. But for now, users need to explicitly set these in the submission script. 
+
 In order to run an application with `TF32` precision type, one must set the 
 following environmental parameter:
 
@@ -314,7 +354,7 @@ export IPEX_FP32_MATH_MODE=TF32
 #####################################################################
 
 module use /soft/modulefiles
-module load frameworks/2023.12.15.001
+module load frameworks
 
 export NUMEXPR_NUM_THREADS=64
 # This is to resolve an issue due to a package called "numexpr". 
@@ -333,6 +373,37 @@ export NUMEXPR_NUM_THREADS=64
 # JOB LAUNCH
 ######################################################################
 
+
+## CCL setup
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+export FI_CXI_OVFLOW_BUF_SIZE=8388608
+export FI_CXI_CQ_FILL_PERCENT=20
+
+export FI_LOG_LEVEL=warn
+#export FI_LOG_PROV=tcp
+export FI_LOG_PROV=cxi
+
+export CCL_KVS_GET_TIMEOUT=600
+
+export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
+export CPATH=$CCL_ROOT/include:$CPATH
+export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+
+export CCL_PROCESS_LAUNCHER=pmix  
+export CCL_ATL_TRANSPORT=mpi
+export CCL_ALLREDUCE=topo
+export CCL_ALLREDUCE_SCALEOUT=rabenseifner  # currently best allreduce algorithm at large scale
+export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
+
+export CCL_KVS_MODE=mpi
+export CCL_CONFIGURATION_PATH=""
+export CCL_CONFIGURATION=cpu_gpu_dpcpp
+export CCL_KVS_CONNECTION_TIMEOUT=600 
+
+export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
+export CCL_KVS_USE_MPI_RANKS=1
+
+
 export CCL_LOG_LEVEL="WARN"
 export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
 HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"

diff --git a/docs/aurora/data-science/frameworks/tensorflow.md b/docs/aurora/data-science/frameworks/tensorflow.md
@@ -13,17 +13,19 @@ module. To use it from a compute node, please do:
 
 ```
 module use /soft/modulefiles/
-module load frameworks/2023.12.15.001
+module load frameworks
 ```
 
 Then you can `import` TensorFlow as usual, the following is an output from the 
-`frameworks/2023.12.15.001` module:
+`frameworks` module:
 
 ```
 >>> import tensorflow as tf
 >>> tf.__version__
 '2.14.1'
 ```
+This import will fail on login nodes because there is no XPU on login nodes. 
+
 A simple but useful check could be to use TensorFlow to get device information 
 on a compute node. You can do this the following way:
 
@@ -213,9 +215,66 @@ export MPIR_CVAR_ENABLE_GPU=0
 # This is to disable certain GPU optimizations like the use of XeLinks between
 # GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
 # overheads. Benefits are application dependent.
+```
+
+**oneCCL optimal setup**
+
+Please refer to [oneCCL](./oneCCL.md) for details. 
+
+```shell
+# This is a fix for running over 16 nodes:
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+export FI_CXI_OVFLOW_BUF_SIZE=8388608
+export FI_CXI_CQ_FILL_PERCENT=20
+
+export FI_LOG_LEVEL=warn
+#export FI_LOG_PROV=tcp
+export FI_LOG_PROV=cxi
+
 export CCL_KVS_GET_TIMEOUT=600
+
+export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
+export CPATH=$CCL_ROOT/include:$CPATH
+export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+
+export CCL_PROCESS_LAUNCHER=pmix  
+export CCL_ATL_TRANSPORT=mpi
+export CCL_ALLREDUCE=topo
+export CCL_ALLREDUCE_SCALEOUT=rabenseifner  # currently best allreduce algorithm at large scale
+export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
+
+export CCL_KVS_MODE=mpi
+export CCL_CONFIGURATION_PATH=""
+export CCL_CONFIGURATION=cpu_gpu_dpcpp
+export CCL_KVS_CONNECTION_TIMEOUT=600 
+
+export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
+export CCL_KVS_USE_MPI_RANKS=1
 ```
 
+Other optional setup for oneCCL. 
+
+```bash
+export FI_MR_ZE_CACHE_MONITOR_ENABLED=0
+export FI_MR_CACHE_MONITOR=disabled
+export FI_CXI_RX_MATCH_MODE=hybrid
+export FI_CXI_OFLOW_BUF_SIZE=8388608
+export FI_CXI_DEFAULT_CQ_SIZE=1048576
+export FI_CXI_CQ_FILL_PERCENT=30
+export MPI_PROVIDER=$FI_PROVIDER
+unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
+unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
+export INTELGT_AUTO_ATTACH_DISABLE=1
+export PALS_PING_PERIOD=240
+export PALS_RPC_TIMEOUT=240
+export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
+export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
+export CCL_OP_SYNC=1 #to avoid potential hang at large scale
+```
+
+These setup will probably be included in the framework module file in future. But for now, users need to explicitly set these in the submission script. 
+
+
 ### CPU Affinity
 
 The CPU affinity should be set manually through mpiexec. 
@@ -309,7 +368,6 @@ export FI_LOG_PROV=cxi
 # These allow for logging from a specific provider (libfabric)
 
 export MPIR_CVAR_ENABLE_GPU=0
-export CCL_KVS_GET_TIMEOUT=600
 
 #####################################################################
 # FRAMEWORK Variables that make a performance difference
@@ -327,7 +385,7 @@ export ITEX_FP32_MATH_MODE=TF32
 #####################################################################
 
 module use /soft/modulefiles
-module load frameworks/2023.12.15.001
+module load frameworks
 
 export NUMEXPR_NUM_THREADS=64
 # This is to resolve an issue due to a package called "numexpr".
@@ -338,6 +396,36 @@ export NUMEXPR_NUM_THREADS=64
 # or equal to '64' or to increase the 'NUMEXPR_MAX_THREADS' to the available
 # number of threads. Both of these variables can be set manually.
 
+
+## CCL setup
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+export FI_CXI_OVFLOW_BUF_SIZE=8388608
+export FI_CXI_CQ_FILL_PERCENT=20
+
+export FI_LOG_LEVEL=warn
+#export FI_LOG_PROV=tcp
+export FI_LOG_PROV=cxi
+
+export CCL_KVS_GET_TIMEOUT=600
+
+export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
+export CPATH=$CCL_ROOT/include:$CPATH
+export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+
+export CCL_PROCESS_LAUNCHER=pmix  
+export CCL_ATL_TRANSPORT=mpi
+export CCL_ALLREDUCE=topo
+export CCL_ALLREDUCE_SCALEOUT=rabenseifner  # currently best allreduce algorithm at large scale
+export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
+
+export CCL_KVS_MODE=mpi
+export CCL_CONFIGURATION_PATH=""
+export CCL_CONFIGURATION=cpu_gpu_dpcpp
+export CCL_KVS_CONNECTION_TIMEOUT=600 
+
+export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
+export CCL_KVS_USE_MPI_RANKS=1
+
 #####################################################################
 # End of environment setup section
 #####################################################################
@@ -346,6 +434,7 @@ export NUMEXPR_NUM_THREADS=64
 # JOB LAUNCH
 ######################################################################
 
+
 export CCL_LOG_LEVEL="WARN"
 export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
 HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"

diff --git a/docs/polaris/applications-and-libraries/libraries/nccl.md b/docs/polaris/applications-and-libraries/libraries/nccl.md
@@ -0,0 +1,27 @@
+# NCCL 
+
+NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.
+
+NCCL is a key library for scaling AI applications on Nvidia system. The conda module on Polaris are built with NCCL as the communication backend for distributed training. But HPC applications can also chose NCCL for communication over MPI. The library is available in the following folder: ```/soft/libraries/nccl```. 
+
+We have done extensive performance tests and identified the following best environment setup. 
+
+```bash
+export NCCL_NET_GDR_LEVEL=PHB
+export NCCL_CROSS_NIC=1
+export NCCL_COLLNET_ENABLE=1
+export NCCL_NET="AWS Libfabric"
+export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
+export FI_CXI_DISABLE_HOST_REGISTER=1
+export FI_MR_CACHE_MONITOR=userfaultfd
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+```
+This setup will lead to 2-3x performance improvement. For details, please refer to: https://github.com/argonne-lcf/alcf-nccl-tests. 
+
+As of now (October 29, 2029), these setup has been included in the conda module on Polaris as one can confirm as follows: 
+```bash 
+module load conda
+env | grep NCCL
+env | grep FI
+```
diff --git a/docs/polaris/data-science-workflows/frameworks/jax.md b/docs/polaris/data-science-workflows/frameworks/jax.md
@@ -9,16 +9,16 @@ JAX is installed on Polaris via the `conda` module, available with:
 module load conda; conda activate
 ```
 
-Then, you can load JAX in `python` as usual (below showing results from the `conda/2022-07-19` module):
+Then, you can load JAX in `python` as usual (below showing results from the `conda/2024-04-29` module):
 
 ```python
 >>> import jax
 >>> jax.__version__
-'0.3.15'
+'0.4.26'
 >>>
 ```
 
-## Notes on JAX 0.3.15
+## Notes on JAX 0.4.26
 
 On Polaris, due to a bug, an environment variable must be set to use JAX on GPUs.  The following code will crash:
 ```python

diff --git a/docs/polaris/data-science-workflows/frameworks/pytorch.md b/docs/polaris/data-science-workflows/frameworks/pytorch.md
@@ -12,20 +12,20 @@ module load conda
 conda activate
 ```
 
-Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2022-07-19` module):
+Then, you can load PyTorch in `python` as usual (below showing results from the `conda/2024-04-29` module):
 
 ```python
 >>> import torch
 >>> torch.__version__
-'1.12.0a0+git67ece03'
+'2.3.0'
 >>>
 ```
 
-This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2022-07-19` module):
+This installation of PyTorch was built from source and the cuda libraries it uses are found via the `CUDA_HOME` environment variable (below showing results from the `conda/2024-04-29` module):
 
 ```bash
 $ echo $CUDA_HOME
-/soft/datascience/cuda/cuda_11.5.2_495.29.05_linux
+/soft/compilers/cudatoolkit/cuda-12.4.1/
 ```
 
 If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility.  We periodically update the PyTorch release, though updates will come in the form of new versions of the `conda` module.
@@ -48,10 +48,20 @@ When running PyTorch applications, we have found the following practices to be g
 PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes.  Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs.  Good performance with PyTorch has been seen with both DDP and Horovod.  For details, please see the [Horovod documentation](https://horovod.readthedocs.io/en/stable/pytorch.html) or the [Distributed Data Parallel documentation](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).  Some Polaris-specific details that may be helpful to you:
 
 1. CPU affinity and NCCL settings can improve scaling performance, particularly at the largest scales.  In particular, we encourage users to try their scaling measurements with the following settings:
- - Set the environment variable `NCCL_COLLNET_ENABLE=1`
- - Set the environment varialbe `NCCL_NET_GDR_LEVEL=PHB`
  - Manually set the CPU affinity via mpiexec, such as with `--cpu-bind verbose,list:0,8,16,24
-`
+ - We have also included the following optimal NCCL setting in the conda module. Please see 
+    ```bash
+    export NCCL_NET_GDR_LEVEL=PHB
+    export NCCL_CROSS_NIC=1
+    export NCCL_COLLNET_ENABLE=1
+    export NCCL_NET="AWS Libfabric"
+    export LD_LIBRARY_PATH=/soft/libraries/aws-ofi-nccl/v1.9.1-aws/lib:$LD_LIBRARY_PATH
+    export LD_LIBRARY_PATH=/soft/libraries/hwloc/lib/:$LD_LIBRARY_PATH
+    export FI_CXI_DISABLE_HOST_REGISTER=1
+    export FI_MR_CACHE_MONITOR=userfaultfd
+    export FI_CXI_DEFAULT_CQ_SIZE=131072
+    ```
+    Users do not have to set the above environment variables anymore. Also, we do not suggest to modify any of those environment variables. For more info on NCCL, please see [NCCL](../../applications-and-libraries/libraries/nccl.md)
 
 2. Horovod and DDP work best when you limit the visible devices to only one GPU.  Note that if you import `mpi4py` or `horovod`, and then do something like `os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank()`, it may not actually work!  You must set the `CUDA_VISIBLE_DEVICES` environment variable prior to doing `MPI.COMM_WORLD.init()`, which is done in `horovod.init()` as well as implicitly in `from mpi4py import MPI`.   On Polaris specifically, you can use the environment variable `PMI_LOCAL_RANK` (as well as `PMI_LOCAL_SIZE`) to learn information about the node-local MPI ranks.  
 
@@ -61,6 +71,12 @@ DeepSpeed is also available and usable on Polaris.  For more information, please
 
 ## PyTorch `DataLoader` and multi-node Horovod
 
-Please note there is a bug that causes a hang when using PyTorch's multithreaded data loaders with distributed training across multiple nodes. To workaround this, NVIDIA recommends setting `num_workers=0` in the dataloader configuration, which serializes data loading. 
+For best performance, it is crucial to enable multiple workers in the data loader to avoid compute and I/O overlap and concurrent loading of dataset. This can be set by tunning "num_workers" parameter in ```DataLoader``` (see https://pytorch.org/docs/stable/data.html). Accordingly to our experience, generally, one can set 4 or 8 for best performance. Due to the total number of CPU cores available on a node, the maximum number of workers one can choose is 16. It is always to tune this value and find the optimal setup for your own application. 
 
-For more details, see [Polaris Known Issues](../../known-issues.md).
+Aside from this, one also have to make sure that the worker threads spread over different CPU codes. To do this one has to specify the cpu bindingto be depth and choose a depth value larger than ```num_workers``` through the following flag in the ```mpiexec``` command: 
+
+```
+mpiexec -np $NUM_GPUS -ppn 4 --cpu-bind depth -d 16 python3 ...
+```
+
+Enabling multiple workers used to cause a hang in the past, but this has been addressed after OS upgrade on Polaris.