Merge branch 'main' into smartsim

argonne-lcf · Nov 8, 2024 · c5ceb1d · c5ceb1d
2 parents 99a1a60 + a680538
commit c5ceb1d
Show file tree

Hide file tree

Showing 14 changed files with 291 additions and 119 deletions.
diff --git a/docs/aurora/data-management/daos/daos-overview.md b/docs/aurora/data-management/daos/daos-overview.md
@@ -10,7 +10,7 @@ DAOS is fully integrated with the wider Aurora compute fabric as can be seen in
 
 
 
-# DAOS Overview
+## DAOS Overview
 
 The first step in using DAOS is to get DAOS POOL space allocated for your project. 
 Users should submit a request as noted below to have a DAOS pool created for your project.
@@ -66,20 +66,18 @@ Total size: 6.0 TB
 Rebuild done, 4 objs, 0 recs
 ```
 
-
-
 ## DAOS Container
 
 The container is the basic unit of storage. A POSIX container can contain hundreds of millions of files, you can use it to store all of your data.
 You only need a small set of containers; perhaps just one per major unit of project work is sufficient.
 
 There are 3 modes with which we can operate with the DAOS containers
-1. Posix container Posix Mode
-2. Posix Container MPI-IO Mode
+1. POSIX container POSIX Mode
+2. POSIX Container MPI-IO Mode
 3. DFS container through DAOS APIs.
 
 
-### Create a posix container
+### Create a POSIX container
 
 
 ```bash
@@ -128,11 +126,11 @@ daos container check --pool=$DAOS_POOL_NAME --cont=$DAOS_CONT_NAME
 ```
 
 
-### Mount a posix container
+### Mount a POSIX container
 Currently, you must manually mount your container prior to use on any node you are working on.
 In the future, we hope to automate some of this via additional `qsub` options.
 
-#### To mount a posix container on a login node
+#### To mount a POSIX container on a login node
 
 
 ```bash
@@ -151,7 +149,7 @@ fusermount3 -u /tmp/${DAOS_POOL}/${DAOS_CONT} # To unmount
 
 ```
  
-#### To mount a posix container on Compute Nodes
+#### To mount a POSIX container on Compute Nodes
 
 You need to mount the container on all compute nodes.
 
@@ -196,7 +194,7 @@ CPU_BINDING1=list:4:9:14:19:20:25:56:61:66:71:74:79
 
 
 
-## Interception library for posix containers
+## Interception library for POSIX containers
 
 The interception library (IL) is a next step in improving DAOS performance. This provides kernel-bypass for I/O data, leading to improved performance.
 The libioil IL will intercept basic read and write POSIX calls while all metadata calls still go through dFuse. The libpil4dfs IL should be used for both data and metadata calls to go through dFuse.
@@ -377,7 +375,7 @@ Each DAOS server nodes is based on the Intel Coyote Pass platform.
 ## Darshan profiler for DAOS 
 
 Currently, you need to install your own local darshan-daos profiler
-You need to use DFS mode (3) or Posix with interception library to profile
+You need to use DFS mode (3) or POSIX with interception library to profile
 
 ```bash
 module use /soft/modulefiles
@@ -403,7 +401,7 @@ cd /home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-logs
 
 ```
 
-Preload darshan first then daos interception library
+Preload darshan first then DAOS interception library:
 
 ```
 mpiexec --env LD_PRELOAD=~/soft/profilers/darshan-daos/darshan-install/lib/libdarshan.so:/usr/lib64/libpil4dfs.so   
@@ -413,7 +411,7 @@ mpiexec --env LD_PRELOAD=~/soft/profilers/darshan-daos/darshan-install/lib/libda
 ```
 
 
-install darshan-util from laptop
+Install darshan-util from laptop:
 
 
 ```bash
@@ -432,7 +430,7 @@ python3 -m darshan summary ~/Downloads/kaushikv_ior_id917110-44437_10-23-55830-6
 
 ## Cluster Size
 
-DAOS Cluster size is the number of available DAOS servers. While we are working towards bringing up the entire 1024 daos server available users, currently different number of daos nodes could be up. Please check with support or run an IOR test to get an estimate on the current number of daos servers available. 
+DAOS cluster size is the number of available DAOS servers. While we are working towards bringing up the entire 1024 DAOS server available users, currently different number of DAOS nodes could be up. Please check with support or run an IOR test to get an estimate on the current number of DAOS servers available. 
 
 
 ![expected Bandwidth](expectedBW.png "Expected number of daos servers and its approximate expected bandwidth")
@@ -441,18 +439,25 @@ DAOS Cluster size is the number of available DAOS servers. While we are working
 ## Best practices
 
 ```bash
-Check 					                          qsub –l daos=default
-Daos sanity checks mentioned above
-Did you load DAOS module? 		            module load daos
-Do you have your DAOS pool allocated? 		daos pool query datascience
-Is Daos client running on all your nodes? ps –ef | grep daos   
-Is your container mounted on all nodes? 	mount | grep dfuse  
-Can you ls in your container?  			      ls /tmp/${DAOS_POOL}/${DAOS_CONT}  
-Did your I/O Actually fail?
-What is the health property in your container?  daos container get-prop $DAOS_POOL $CONT	
-Is your space full? Min and max				    daos pool query datascience
-Does your query show failed targets or rebuild in process?	daos pool query datascience
-daos pool      autotest
-Daos container check 
-
+Check that you requested DAOS
+    qsub –l daos=default
+Did you load DAOS module?
+    module load daos
+Do you have your DAOS pool allocated?
+    daos pool query datascience
+Is DAOS client running on all your nodes? 
+    ps –ef | grep daos   
+Is your container mounted on all nodes?
+    mount | grep dfuse  
+Can you ls in your container?  			      
+    ls /tmp/${DAOS_POOL}/${DAOS_CONT}  
+Did your I/O actually fail?
+What is the health property in your container?  
+    daos container get-prop $DAOS_POOL $CONT	
+Is your space full? Min and max
+    daos pool query datascience
+Does your query show failed targets or rebuild in process?
+    daos pool query datascience
+	daos pool autotest
+	daos container check 
 ```
diff --git a/docs/aurora/data-management/lustre/IO-optimization_mcpheeters.pdf b/docs/aurora/data-management/lustre/IO-optimization_mcpheeters.pdf
diff --git a/docs/aurora/data-management/lustre/flare.md b/docs/aurora/data-management/lustre/flare.md
@@ -4,3 +4,4 @@
 
 Home is 12 PB **Gecko** Lustre Filesystem with 32 OSTs and 12 MDTs. 
 
+[Follow this link for more basic information on I/O optimization for the Lustre Filesystem I/O](https://anl.box.com/s/uqmgnkn7i3z22c9xrwef8nn702wl22uy)
diff --git a/docs/aurora/data-science/frameworks/oneCCL.md b/docs/aurora/data-science/frameworks/oneCCL.md
@@ -18,16 +18,12 @@ kaushikvelusamy@aurora-uan-0012:~>  module load frameworks
 /opt/aurora/24.180.0/CNDA/oneapi/ccl/2021.13.1_20240808.145507
 ```
 
-
+<!-- --8<-- [start:onecclenv] -->
 **OneCCL mandatory environment variables**
 
-```bash
-module load frameworks
-echo $CCL_ROOT
-export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
-export CPATH=$CCL_ROOT/include:$CPATH
-export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+The parameters below are recommended to be set all the time as it eigher gives the best performance for all applications or are requires to address potential hang / crash at large scale. 
 
+```bash
 export CCL_PROCESS_LAUNCHER=pmix  
 export CCL_ATL_TRANSPORT=mpi
 export CCL_ALLREDUCE=topo
@@ -41,9 +37,15 @@ export CCL_KVS_CONNECTION_TIMEOUT=600
 
 export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
 export CCL_KVS_USE_MPI_RANKS=1
+
+export MPI_PROVIDER=$FI_PROVIDER
+unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
+unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
+unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
 ```
 
 **OneCCL optional environment variables**
+The impact of the following environment variable might be application dependent. Users are encourage to try to set them and see whether they help their applications. 
 
 ```bash
 ulimit -c unlimited
@@ -53,17 +55,14 @@ export FI_CXI_RX_MATCH_MODE=hybrid
 export FI_CXI_OFLOW_BUF_SIZE=8388608
 export FI_CXI_DEFAULT_CQ_SIZE=1048576
 export FI_CXI_CQ_FILL_PERCENT=30
-export MPI_PROVIDER=$FI_PROVIDER
-unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
-unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
 export INTELGT_AUTO_ATTACH_DISABLE=1
 export PALS_PING_PERIOD=240
 export PALS_RPC_TIMEOUT=240
 export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault
 export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale
 export CCL_OP_SYNC=1 #to avoid potential hang at large scale
 ```
-
+<!-- --8<-- [end:onecclenv] -->
 
 **Algorithm selection**
 

diff --git a/docs/aurora/data-science/frameworks/pytorch.md b/docs/aurora/data-science/frameworks/pytorch.md
@@ -12,15 +12,15 @@ the frameworks module. To use it from a compute node, please load the following
 
 ```
 module use /soft/modulefiles/
-module load frameworks/2023.12.15.001
+module load frameworks
 ```
 Then you can `import` PyTorch as usual, the following is an output from the
-`frameworks/2023.12.15.001` module
+`frameworks` module
 
 ```
 >>> import torch
 >>> torch.__version__
-'2.0.1a0+cxx11.abi'
+'2.3.1+cxx11.abi'
 ```
 A simple but useful check could be to use PyTorch to get device information on
 a compute node. You can do this the following way:
@@ -128,22 +128,12 @@ Some of the Aurora specific details might be helpful to you:
 The following environmental variables should be set on the batch submission 
 script (PBSPro script) in the case of attempting to run beyond 16 nodes.
 
-```shell
-# This is a fix for running over 16 nodes:
-export FI_CXI_DEFAULT_CQ_SIZE=131072
-export FI_CXI_OFLOW_BUF_SIZE=8388608
-export FI_CXI_CQ_FILL_PERCENT=20
+<!-- --8<-- [start:commononecclenv] -->
+#### oneCCL environment variable
+--8<-- "./docs/aurora/data-science/frameworks/oneCCL.md:onecclenv"
 
-export FI_LOG_LEVEL=warn
-#export FI_LOG_PROV=tcp
-export FI_LOG_PROV=cxi
-
-export MPIR_CVAR_ENABLE_GPU=0
-# This is to disable certain GPU optimizations like the use of XeLinks between
-# GPUs, collectives with GPU-placed data etc., in order to reduce `MPI_Init`
-# overheads. Benefits are application dependent.
-export CCL_KVS_GET_TIMEOUT=600
-```
+These environment variable settings will probably be included in the framework module file in the future. But for now, users need to explicitly set these in the submission script. 
+<!-- --8<-- [end:commononecclenv] -->
 
 In order to run an application with `TF32` precision type, one must set the 
 following environmental parameter:
@@ -314,7 +304,7 @@ export IPEX_FP32_MATH_MODE=TF32
 #####################################################################
 
 module use /soft/modulefiles
-module load frameworks/2023.12.15.001
+module load frameworks
 
 export NUMEXPR_NUM_THREADS=64
 # This is to resolve an issue due to a package called "numexpr". 
@@ -333,6 +323,37 @@ export NUMEXPR_NUM_THREADS=64
 # JOB LAUNCH
 ######################################################################
 
+
+## CCL setup
+export FI_CXI_DEFAULT_CQ_SIZE=131072
+export FI_CXI_OVFLOW_BUF_SIZE=8388608
+export FI_CXI_CQ_FILL_PERCENT=20
+
+export FI_LOG_LEVEL=warn
+#export FI_LOG_PROV=tcp
+export FI_LOG_PROV=cxi
+
+export CCL_KVS_GET_TIMEOUT=600
+
+export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH
+export CPATH=$CCL_ROOT/include:$CPATH
+export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH
+
+export CCL_PROCESS_LAUNCHER=pmix  
+export CCL_ATL_TRANSPORT=mpi
+export CCL_ALLREDUCE=topo
+export CCL_ALLREDUCE_SCALEOUT=rabenseifner  # currently best allreduce algorithm at large scale
+export CCL_BCAST=double_tree # currently best bcast algorithm at large scale
+
+export CCL_KVS_MODE=mpi
+export CCL_CONFIGURATION_PATH=""
+export CCL_CONFIGURATION=cpu_gpu_dpcpp
+export CCL_KVS_CONNECTION_TIMEOUT=600 
+
+export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024
+export CCL_KVS_USE_MPI_RANKS=1
+
+
 export CCL_LOG_LEVEL="WARN"
 export CPU_BIND="verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96"
 HOROVOD_THREAD_AFFINITY="4,12,20,28,36,44,56,64,72,80,88,96"
Original file line number	Diff line number	Diff line change
Expand Up		@@ -4,3 +4,4 @@

		Home is 12 PB Gecko Lustre Filesystem with 32 OSTs and 12 MDTs.

		[Follow this link for more basic information on I/O optimization for the Lustre Filesystem I/O](https://anl.box.com/s/uqmgnkn7i3z22c9xrwef8nn702wl22uy)