Skip to content

Latest commit

 

History

History
 
 

intel

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Intel Solutions for the HPC Toolkit

DAOS Cluster

The pfs-daos.yaml blueprint describes an environment with

  • Two DAOS server instances
  • Two DAOS client instances

The pfs-daos.yaml blueprint uses a Packer template and Terraform modules from the Google Cloud DAOS repository. Please review the introduction to image building for general information on building custom images using the Toolkit.

Identify a project to work in and substitute its unique id wherever you see <<PROJECT_ID>> in the instructions below.

Initial Setup for DAOS Cluster

Before provisioning the DAOS cluster you must follow the steps listed in the Google Cloud DAOS Pre-deployment Guide.

Skip the "Build DAOS Images" step at the end of the Pre-deployment Guide. The pfs-daos.yaml blueprint will build the images as part of the deployment.

The Pre-deployment Guide provides instructions for:

  • installing the Google Cloud CLI
  • enabling service accounts
  • enabling APIs
  • establishing minimum resource quotas
  • creating a Cloud NAT to allow instances without public IPs to access the DAOS Yum Repository repository.

Deploy the DAOS Cluster

After completing the steps in the Pre-deployment Guide use ghpc to provision the blueprint

ghpc create community/examples/intel/pfs-daos.yaml  \
  --vars project_id=<<PROJECT_ID>> \
  [--backend-config bucket=<GCS tf backend bucket>]

This will create the deployment directory containing Terraform modules and Packer templates. The --backend-config option is not required but recommended. It will save the terraform state in a pre-existing Google Cloud Storage bucket. For more information see Setting up a remote terraform state. Use ghpc deploy to provision your DAOS storage cluster:

ghpc deploy pfs-daos --auto-approve

Connect to a client node

  1. Open the following URL in a new tab.

    https://console.cloud.google.com/compute

    This will take you to Compute Engine > VM instances in the Google Cloud Console.

    Select the project in which the DAOS cluster will be provisioned.

  2. Click on the SSH button associated with the daos-client-0001 instance to open a window with a terminal into the first DAOS client instance.

Verify the DAOS storage system

The community/examples/intel/pfs-daos.yaml blueprint does not contain configuration for DAOS pools and containers. Therefore, pools and containers will need to be created manually.

Before pools and containers can be created the storage system must be formatted. Formatting the storage is done automatically by the startup script that runs on the daos-server-0001 instance. The startup script will run the dmg storage format command. It may take a few minutes for all daos server instances to join.

Verify that the storage system has been formatted and that the daos-server instances have joined.

sudo dmg system query -v

The command will not return output until the system is ready.

The output will look similar to

Rank UUID                                 Control Address   Fault Domain      State  Reason
---- ----                                 ---------------   ------------      -----  ------
0    225a0a51-d4ed-4ac3-b1a5-04b31c08b559 10.128.0.51:10001 /daos-server-0001 Joined
1    553ab1dc-99af-460e-a57c-3350611d1d09 10.128.0.43:10001 /daos-server-0002 Joined

Both daos-server instances should show a state of Joined.

Create a DAOS Pool and Container

About the DAOS Command Line Tools

The DAOS Management tool dmg is used by System Administrators to manage the DAOS storage system and DAOS pools. Therefore, sudo must be used when running dmg.

The DAOS CLI daos is used by both users and System Administrators to create and manage containers. It is not necessary to use sudo with the daos command.

View Free Space

View how much free space is available.

sudo dmg storage query usage

Create a Pool

Create a single pool owned by root which uses 100% of the available free space.

sudo dmg pool create --size=100% --user=root  pool1

Set ACLs to allow any user to create a container in pool1.

sudo dmg pool update-acl -e A::EVERYONE@:rcta pool1

See the Pool Operations section of the DAOS Administration Guide for more information about creating pools.

Create a Container

At this point it is necessary to determine who will need to access the container and how it will be used. The ACLs will need to be set properly to allow users and/or groups to access the container.

For the purpose of this demo create the container without specifying ACLs. The container will be owned by your user account and you will have full access to the container.

daos container create --type=POSIX --properties=rf:0 pool1 cont1

See the Container Management section of the DAOS User Guide for more information about creating containers.

Mount the DAOS Container

Mount the container with dfuse (DAOS Fuse)

mkdir -p "${HOME}/daos/cont1"
dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1"

Verify that the container is mounted

df -h -t fuse.daos

Use DAOS Storage

The cont1 container is now mounted on ${HOME}/daos/cont1

Create a 20GiB file which will be stored in the DAOS filesystem.

time LD_PRELOAD=/usr/lib64/libioil.so \
dd if=/dev/zero of="${HOME}/daos/cont1/test20GiB.img" iflag=fullblock bs=1G count=20

Known Issue:

When you run ls -lh "${HOME}/daos/cont1" you may see that the test20GiB.img file shows a size of 0 bytes.

If you unmount the container and mount it again, the file size will show as 20G.

fusermount3 -u "${HOME}/daos/cont1"
dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1"
ls -lh "${HOME}/daos/cont1"

A work-around for this issue to disable caching when mounting the container.

dfuse --singlethread --disable-caching --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1"

See the File System section of the DAOS User Guide for more information about DFuse.

Unmount the DAOS Container

The container will need to be unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again.

Verify that the container is unmounted

df -h -t fuse.daos

Logout of the DAOS client instance.

logout

See the DFuse (DAOS FUSE) section of the DAOS User Guide for more information about mounting POSIX containers.

Delete the DAOS infrastructure when not in use

NOTE: Data stored in the DAOS container will be permanently lost after cluster deletion.

Delete the remaining infrastructure

ghpc destroy pfs-daos --auto-approve

DAOS Server with Slurm cluster

The hpc-slurm-daos.yaml blueprint can be used to deploy a Slurm cluster and four DAOS server instances. The Slurm compute instances are configured as DAOS clients.

The blueprint uses modules from

The blueprint also uses a Packer template from the Google Cloud DAOS repository. Please review the introduction to image building for general information on building custom images using the Toolkit.

Substitute your project ID wherever you see <<PROJECT_ID>> in the instructions below.

Initial Setup for the DAOS/Slurm cluster

Before provisioning the DAOS cluster you must follow the steps listed in the Google Cloud DAOS Pre-deployment Guide.

Skip the "Build DAOS Images" step at the end of the Pre-deployment Guide. The hpc-slurm-daos.yaml blueprint will build the DAOS server image as part of the deployment.

The Pre-deployment Guide provides instructions for enabling service accounts, APIs, establishing minimum resource quotas and other necessary steps to prepare your project for DAOS server deployment.

Follow the Toolkit guidance to enable APIs and establish minimum resource quotas for Slurm.

The following available quota is required in the region used by Slurm:

  • Filestore: 2560GB
  • C2 CPUs: 6000 (fully-scaled "compute" partition)
    • This quota is not necessary at initial deployment, but will be required to successfully scale the partition to its maximum size
  • C2 CPUs: 4 (login node)

Deploy the DAOS/Slurm Cluster

Use ghpc to provision the blueprint, supplying your project ID

ghpc create community/examples/intel/hpc-slurm-daos.yaml \
  --vars project_id=<<PROJECT_ID>> \
  [--backend-config bucket=<GCS tf backend bucket>]

This will create a set of directories containing Terraform modules and Packer templates.

The --backend-config option is not required but recommended. It will save the terraform state in a pre-existing Google Cloud Storage bucket. For more information see Setting up a remote terraform state.

Follow ghpc instructions to deploy the environment

ghpc deploy hpc-slurm-daos --auto-approve

Connect to the DAOS/Slurm Cluster login node

Once the startup script has completed and Slurm reports readiness, connect to the login node.

  1. Open the following URL in a new tab.

    https://console.cloud.google.com/compute

    This will take you to Compute Engine > VM instances in the Google Cloud Console

    Select the project in which the cluster will be provisionsd.

  2. Click on the SSH button associated with the hpcslurmda-login-login-001 instance.

    This will open a separate pop up window with a terminal into our newly created Slurm login VM.

Create and Mount a DAOS Container

The community/examples/intel/hpc-slurm-daos.yaml blueprint defines a single DAOS pool named pool1. The pool will be created when the daos-server instances are provisioned.

You will need to create your own DAOS container in the pool that can be used by your Slurm jobs.

While logged into the login node create a container named cont1 in the pool1 pool:

daos cont create --type=POSIX --properties=rf:0 pool1 cont1

NOTE: If you encounter an error daos: command not found, it's likely that the startup scripts have not finished running yet. Wait a few minutes and try again.

Since the cont1 container is owned by your account, your Slurm jobs will need to run as your user account to access the container.

Create a mount point for the container and mount it with dfuse (DAOS Fuse)

mkdir -p ${HOME}/daos/cont1

dfuse --singlethread \
--pool=pool1 \
--container=cont1 \
--mountpoint=${HOME}/daos/cont1

Verify that the container is mounted

df -h -t fuse.daos

Run a Job that uses the DAOS Container

On the login node create a daos_job.sh file with the following content

#!/bin/bash
JOB_HOSTNAME="$(hostname)"
TIMESTAMP="$(date '+%Y%m%d%H%M%S')"

echo "Timestamp         = ${TIMESTAMP}"
echo "Date              = $(date)"
echo "Hostname          = $(hostname)"
echo "User              = $(whoami)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"

MOUNT_DIR="${HOME}/daos/cont1"
LOG_FILE="${MOUNT_DIR}/${JOB_HOSTNAME}.log"

echo "${JOB_HOSTNAME} : Creating directory: ${MOUNT_DIR}"
mkdir -p "${MOUNT_DIR}"

echo "${JOB_HOSTNAME} : Mounting with dfuse"
dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${MOUNT_DIR}"
sleep 5

echo "${JOB_HOSTNAME} : Creating log file"
echo "Job ${SLURM_JOB_ID} running on ${JOB_HOSTNAME}" | tee "${MOUNT_DIR}/${TIMESTAMP}_${JOB_HOSTNAME}.log"

echo "${JOB_HOSTNAME} : Unmounting dfuse"
fusermount3 -u "${MOUNT_DIR}"

Run the daos_job.sh script in an interactive Slurm job on 4 nodes

srun --nodes=4 \
  --ntasks-per-node=1 \
  --time=00:10:00 \
  --job-name=daos \
  --output=srunjob_%j.log \
  --partition=compute \
  daos_job.sh &

Run squeue to see the status of the job. The daos_job.sh script will run once on each of the 4 nodes. Each time it runs it creates a log file which is stored in the cont1 DAOS container.

Wait for the job to complete and then view the files that were created in the cont1 DAOS container mounted on ${HOME}/daos/cont1.

ls -l ${HOME}/daos/cont1/*.log
cat ${HOME}/daos/cont1/*.log

Unmount the Container

The container will need to by unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again.

fusermount3 -u ${HOME}/daos/cont1

Verify that the container is unmounted

df -h -t fuse.daos

See the DFuse (DAOS FUSE) section of the DAOS User Guide for more information about mounting POSIX containers.

Delete the DAOS/Slurm Cluster infrastructure when not in use

NOTE:

  • Data on the DAOS file system will be permanently lost after cluster deletion.
  • If the Slurm controller is shut down before the auto-scale instances are destroyed, those compute instances will be left running.

Open your browser to the VM instances page and ensure that instances named "compute" have been shutdown and deleted by the Slurm autoscaler.

Delete the remaining infrastructure:

ghpc destroy hpc-slurm-daos --auto-approve