- Intel Solutions for the HPC Toolkit
The pfs-daos.yaml blueprint describes an environment with
- Two DAOS server instances
- Two DAOS client instances
The pfs-daos.yaml blueprint uses a Packer template and Terraform modules from the Google Cloud DAOS repository. Please review the introduction to image building for general information on building custom images using the Toolkit.
Identify a project to work in and substitute its unique id wherever you see
<<PROJECT_ID>>
in the instructions below.
Before provisioning the DAOS cluster you must follow the steps listed in the Google Cloud DAOS Pre-deployment Guide.
Skip the "Build DAOS Images" step at the end of the Pre-deployment Guide. The pfs-daos.yaml blueprint will build the images as part of the deployment.
The Pre-deployment Guide provides instructions for:
- installing the Google Cloud CLI
- enabling service accounts
- enabling APIs
- establishing minimum resource quotas
- creating a Cloud NAT to allow instances without public IPs to access the DAOS Yum Repository repository.
After completing the steps in the Pre-deployment Guide use ghpc
to provision the blueprint
ghpc create community/examples/intel/pfs-daos.yaml \
--vars project_id=<<PROJECT_ID>> \
[--backend-config bucket=<GCS tf backend bucket>]
This will create the deployment directory containing Terraform modules and
Packer templates. The --backend-config
option is not required but recommended.
It will save the terraform state in a pre-existing Google Cloud Storage
bucket. For more information see Setting up a remote terraform
state. Use ghpc deploy
to provision your DAOS storage cluster:
ghpc deploy pfs-daos --auto-approve
-
Open the following URL in a new tab.
https://console.cloud.google.com/compute
This will take you to Compute Engine > VM instances in the Google Cloud Console.
Select the project in which the DAOS cluster will be provisioned.
-
Click on the SSH button associated with the daos-client-0001 instance to open a window with a terminal into the first DAOS client instance.
The community/examples/intel/pfs-daos.yaml
blueprint does not contain configuration for DAOS pools and containers. Therefore, pools and containers will need to be created manually.
Before pools and containers can be created the storage system must be formatted. Formatting the storage is done automatically by the startup script that runs on the daos-server-0001 instance. The startup script will run the dmg storage format command. It may take a few minutes for all daos server instances to join.
Verify that the storage system has been formatted and that the daos-server instances have joined.
sudo dmg system query -v
The command will not return output until the system is ready.
The output will look similar to
Rank UUID Control Address Fault Domain State Reason
---- ---- --------------- ------------ ----- ------
0 225a0a51-d4ed-4ac3-b1a5-04b31c08b559 10.128.0.51:10001 /daos-server-0001 Joined
1 553ab1dc-99af-460e-a57c-3350611d1d09 10.128.0.43:10001 /daos-server-0002 Joined
Both daos-server instances should show a state of Joined.
The DAOS Management tool dmg
is used by System Administrators to manage the DAOS storage system and DAOS pools. Therefore, sudo
must be used when running dmg
.
The DAOS CLI daos
is used by both users and System Administrators to create and manage containers. It is not necessary to use sudo
with the daos
command.
View how much free space is available.
sudo dmg storage query usage
Create a single pool owned by root which uses 100% of the available free space.
sudo dmg pool create --size=100% --user=root pool1
Set ACLs to allow any user to create a container in pool1.
sudo dmg pool update-acl -e A::EVERYONE@:rcta pool1
See the Pool Operations section of the DAOS Administration Guide for more information about creating pools.
At this point it is necessary to determine who will need to access the container and how it will be used. The ACLs will need to be set properly to allow users and/or groups to access the container.
For the purpose of this demo create the container without specifying ACLs. The container will be owned by your user account and you will have full access to the container.
daos container create --type=POSIX --properties=rf:0 pool1 cont1
See the Container Management section of the DAOS User Guide for more information about creating containers.
Mount the container with dfuse (DAOS Fuse)
mkdir -p "${HOME}/daos/cont1"
dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1"
Verify that the container is mounted
df -h -t fuse.daos
The cont1
container is now mounted on ${HOME}/daos/cont1
Create a 20GiB file which will be stored in the DAOS filesystem.
time LD_PRELOAD=/usr/lib64/libioil.so \
dd if=/dev/zero of="${HOME}/daos/cont1/test20GiB.img" iflag=fullblock bs=1G count=20
Known Issue:
When you run ls -lh "${HOME}/daos/cont1"
you may see that the test20GiB.img
file shows a size of 0 bytes.
If you unmount the container and mount it again, the file size will show as 20G.
fusermount3 -u "${HOME}/daos/cont1"
dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1"
ls -lh "${HOME}/daos/cont1"
A work-around for this issue to disable caching when mounting the container.
dfuse --singlethread --disable-caching --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1"
See the File System section of the DAOS User Guide for more information about DFuse.
The container will need to be unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again.
Verify that the container is unmounted
df -h -t fuse.daos
Logout of the DAOS client instance.
logout
See the DFuse (DAOS FUSE) section of the DAOS User Guide for more information about mounting POSIX containers.
NOTE: Data stored in the DAOS container will be permanently lost after cluster deletion.
Delete the remaining infrastructure
ghpc destroy pfs-daos --auto-approve
The hpc-slurm-daos.yaml blueprint can be used to deploy a Slurm cluster and four DAOS server instances. The Slurm compute instances are configured as DAOS clients.
The blueprint uses modules from
- google-cloud-daos
- community/modules/compute/schedmd-slurm-gcp-v6-nodeset
- community/modules/compute/schedmd-slurm-gcp-v6-partition
- community/modules/scheduler/schedmd-slurm-gcp-v6-login
- community/modules/scheduler/schedmd-slurm-gcp-v6-controller
The blueprint also uses a Packer template from the Google Cloud DAOS repository. Please review the introduction to image building for general information on building custom images using the Toolkit.
Substitute your project ID wherever you see <<PROJECT_ID>>
in the instructions below.
Before provisioning the DAOS cluster you must follow the steps listed in the Google Cloud DAOS Pre-deployment Guide.
Skip the "Build DAOS Images" step at the end of the Pre-deployment Guide. The hpc-slurm-daos.yaml blueprint will build the DAOS server image as part of the deployment.
The Pre-deployment Guide provides instructions for enabling service accounts, APIs, establishing minimum resource quotas and other necessary steps to prepare your project for DAOS server deployment.
Follow the Toolkit guidance to enable APIs and establish minimum resource quotas for Slurm.
The following available quota is required in the region used by Slurm:
- Filestore: 2560GB
- C2 CPUs: 6000 (fully-scaled "compute" partition)
- This quota is not necessary at initial deployment, but will be required to successfully scale the partition to its maximum size
- C2 CPUs: 4 (login node)
Use ghpc
to provision the blueprint, supplying your project ID
ghpc create community/examples/intel/hpc-slurm-daos.yaml \
--vars project_id=<<PROJECT_ID>> \
[--backend-config bucket=<GCS tf backend bucket>]
This will create a set of directories containing Terraform modules and Packer templates.
The --backend-config
option is not required but recommended. It will save the terraform state in a pre-existing Google Cloud Storage bucket. For more information see Setting up a remote terraform state.
Follow ghpc
instructions to deploy the environment
ghpc deploy hpc-slurm-daos --auto-approve
Once the startup script has completed and Slurm reports readiness, connect to the login node.
-
Open the following URL in a new tab.
https://console.cloud.google.com/compute
This will take you to Compute Engine > VM instances in the Google Cloud Console
Select the project in which the cluster will be provisionsd.
-
Click on the
SSH
button associated with thehpcslurmda-login-login-001
instance.This will open a separate pop up window with a terminal into our newly created Slurm login VM.
The community/examples/intel/hpc-slurm-daos.yaml blueprint defines a single DAOS pool named pool1
. The pool will be created when the daos-server instances are provisioned.
You will need to create your own DAOS container in the pool that can be used by your Slurm jobs.
While logged into the login node create a container named cont1
in the pool1
pool:
daos cont create --type=POSIX --properties=rf:0 pool1 cont1
NOTE: If you encounter an error daos: command not found
, it's likely that the startup scripts have not finished running yet. Wait a few minutes and try again.
Since the cont1
container is owned by your account, your Slurm jobs will need to run as your user account to access the container.
Create a mount point for the container and mount it with dfuse (DAOS Fuse)
mkdir -p ${HOME}/daos/cont1
dfuse --singlethread \
--pool=pool1 \
--container=cont1 \
--mountpoint=${HOME}/daos/cont1
Verify that the container is mounted
df -h -t fuse.daos
On the login node create a daos_job.sh
file with the following content
#!/bin/bash
JOB_HOSTNAME="$(hostname)"
TIMESTAMP="$(date '+%Y%m%d%H%M%S')"
echo "Timestamp = ${TIMESTAMP}"
echo "Date = $(date)"
echo "Hostname = $(hostname)"
echo "User = $(whoami)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"
MOUNT_DIR="${HOME}/daos/cont1"
LOG_FILE="${MOUNT_DIR}/${JOB_HOSTNAME}.log"
echo "${JOB_HOSTNAME} : Creating directory: ${MOUNT_DIR}"
mkdir -p "${MOUNT_DIR}"
echo "${JOB_HOSTNAME} : Mounting with dfuse"
dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${MOUNT_DIR}"
sleep 5
echo "${JOB_HOSTNAME} : Creating log file"
echo "Job ${SLURM_JOB_ID} running on ${JOB_HOSTNAME}" | tee "${MOUNT_DIR}/${TIMESTAMP}_${JOB_HOSTNAME}.log"
echo "${JOB_HOSTNAME} : Unmounting dfuse"
fusermount3 -u "${MOUNT_DIR}"
Run the daos_job.sh
script in an interactive Slurm job on 4 nodes
srun --nodes=4 \
--ntasks-per-node=1 \
--time=00:10:00 \
--job-name=daos \
--output=srunjob_%j.log \
--partition=compute \
daos_job.sh &
Run squeue
to see the status of the job. The daos_job.sh
script will run once on each of the 4 nodes. Each time it runs it creates a log file which is stored in the cont1
DAOS container.
Wait for the job to complete and then view the files that were created in the cont1
DAOS container mounted on ${HOME}/daos/cont1
.
ls -l ${HOME}/daos/cont1/*.log
cat ${HOME}/daos/cont1/*.log
The container will need to by unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again.
fusermount3 -u ${HOME}/daos/cont1
Verify that the container is unmounted
df -h -t fuse.daos
See the DFuse (DAOS FUSE) section of the DAOS User Guide for more information about mounting POSIX containers.
NOTE:
- Data on the DAOS file system will be permanently lost after cluster deletion.
- If the Slurm controller is shut down before the auto-scale instances are destroyed, those compute instances will be left running.
Open your browser to the VM instances page and ensure that instances named "compute" have been shutdown and deleted by the Slurm autoscaler.
Delete the remaining infrastructure:
ghpc destroy hpc-slurm-daos --auto-approve