profile_gpu

NCSA Configuration specific to Hosts with GPUs.

Description

This puppet profile handles configuration specific to Hosts with GPUs

Currently DCGM Metrics collection is enabled by default but DCGM is specific to NVIDIA GPUs. See Usage Section

Setup

Include profile_gpu in a puppet profile file:

include ::profile_gpu

Usage

DCGM Telegraf Metrics

Note : DCGM telegraf Metrics is specific to NVIDIA GPUs, if you have a node with GPUs but not NVIDIA you should set these hiera variables (we have a TODO to make this use a custom fact):

profile_gpu::dcgm::install::install_dcgm: false
profile_gpu::dcgm::telegraf::enable: false

To collect telegraf metrics you must define the hiera value profile_gpu::dcgm::install::bind_mount_install.

This is set to no value in data/common.yaml and must be defined in your project control-repo.
See REFERENCE.md for details

In order to enable Nvidia performance counters on Ampere and older cards (Hopper may not require this work around), DCGM must not be running and collecting data. Disabling DCGM and Telegraf can be done via a Slurm prolog/epilog (an example is listed below. To make this profile not restart the services, a fact has been created to look for a file. This file is hardcoded to look at '/var/spool/slurmd/nvperfenabled'. If this file is found, DCGM and Telegraf will not be restarted.

Prolog:

#!/bin/bash

touch /var/spool/slurmd/nvperfenabled

IFS=',' read -ra features <<< "$SLURM_JOB_CONSTRAINTS"

for feature in "${features[@]}"; do
   echo $feature
   if [ "$feature" = "nvperf" ]; then
      /usr/bin/systemctl stop nvidia-dcgm.service
      /usr/bin/systemctl stop nvidia-persistenced.service
      /usr/sbin/modprobe -rf nvidia_uvm nvidia_drm nvidia_modeset nvidia
      /usr/sbin/modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0
      /usr/bin/modprobe nvidia_uvm nvidia_drm nvidia_modeset
      /usr/bin/systemctl start nvidia-persistenced.service
   fi
done

Epilog:

#!/bin/bash

rm -f /var/spool/slurmd/nvperfenabled

IFS=',' read -ra features <<< "$SLURM_JOB_CONSTRAINTS"

for feature in "${features[@]}"; do
   if [ "$feature" = "nvperf" ]; then
      /usr/bin/systemctl stop nvidia-dcgm.service
      /usr/bin/systemctl stop nvidia-persistenced.service
      /usr/sbin/modprobe -rf nvidia_uvm nvidia_drm nvidia_modeset nvidia
      /usr/sbin/modprobe nvidia
      /usr/sbin/modprobe nvidia_uvm nvidia_drm nvidia_modeset
      /usr/bin/systemctl start nvidia-persistenced.service
   fi
done

Dependencies

puppet/systemd

If collecting DCGM telegraf metrics, telegraf must be installed (no dependency on a particular telegraf module, only that telegraf is installed and working)

Reference

See: REFERENCE.md

Limitations

n/a

Development

This Common Puppet Profile is managed by NCSA for internal usage.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
data		data
files		files
lib/facter		lib/facter
manifests		manifests
spec		spec
.fixtures.yml		.fixtures.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pdkignore		.pdkignore
.puppet-lint.rc		.puppet-lint.rc
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.sync.yml		.sync.yml
.travis.yml		.travis.yml
.yamllint.yaml		.yamllint.yaml
.yardopts		.yardopts
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
README.md		README.md
REFERENCE.md		REFERENCE.md
Rakefile		Rakefile
appveyor.yml		appveyor.yml
hiera.yaml		hiera.yaml
metadata.json		metadata.json
pdk.yaml		pdk.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

profile_gpu

Table of Contents

Description

Setup

Usage

DCGM Telegraf Metrics

Dependencies

Reference

Limitations

Development

About

Releases

Packages

Contributors 6

Languages

ncsa/puppet-profile_gpu

Folders and files

Latest commit

History

Repository files navigation

profile_gpu

Table of Contents

Description

Setup

Usage

DCGM Telegraf Metrics

Dependencies

Reference

Limitations

Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages