-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling metrics not being collected #22
Comments
Hello, The DCP metrics (fieldid 1001-1012) are supported on Volta and newer architectures only. Kepler is not supported. WBR, |
Hello, I have Ampere A40 GPU, but I also have the same error:
What could be reason of this? |
There may be several reasons. Could you provide us the debug logs from the nv-hostengine? WBR, |
HI @nikkon-dev, The only related I found in log
Full version of log I uploaded here: https://fex.net/s/2p0p1bm Will be grateful for any help! |
Could you confirm that the persistence mode is enabled on the GPU? |
Hi @nikkon-dev, I'm currently running dcgm-exporter 2.3.5-2.6.5 without any problem except DCP metrics for MIG. Any help would be appreciated. Env
Apps related NVIDIA
dcgm-exporter log
|
Running the nv-hostengine inside a docker container if MIG is enabled may be tricky. The nv-hostengine uses MIG management API to get MIG profiles information (this is privileged functionality). By default a container would not have the proper capability to access MIG profiles information.
Usually, when MIG is enabled, we recommend running nv-hostengine on bare metal and letting dcgm-exporter connect to it instead of running an embedded hostengine. I hope that would help. WBR, |
It works well. I solved the issue by connecting dcgm-exporter with nv-hostengine which is running on host. Thank you! |
Hi @yh0413, |
Could you provide more information about your setup? Do you use passthrough or vgpu? |
@nikkon-dev |
I'm a bit confused. vGPUs do not allow MIG configurations until you are using the passthrough approach (aka grant exclusive access to the whole GPU to the VM). What hypervisor are you using? |
hi,I get the same error, even if I start dcgm-exporter with nv-hostengine root@release-name-dcgm-exporter-b2xrs:/# ENV |
To determine the cause of the profiling module load failure, we must analyze the nv-hostengine debug logs. The reasons could be varied, ranging from unsupported GPUs to insufficient privileges. To obtain the debug logs, you can restart the nv-hostengine with the following arguments: |
I have the same problem. Environment nvidia-smi cmd result: +---------------------------------------------------------------------------------------+ I can't seem to get profiling metrics to show up, though other metrics show up fine.
Any help would be appreciated. |
In your case, you need to update the dcgm-exporter to a newer version. You are using DCGM 2.4.6, which is quite outdated and does not support L40S GPUs. Try using dcgm-exporter based on the 3.2.x or 3.3.x releases. |
@nikkon-dev The environment information is as follows: dcgmi profile -l What is the reason for the failure? Cuda is not installed. Does profile metrics depend on cuda? |
Encountered this problem on GKE's Nvidia L4 machine, fixed by upgrade the docker image of dcgm-exporter & dcgm to 3.3.0 |
I've discovered some driver/CUDA compatibility issues when collecting DCP metrics. NVIDIA Driver 460.73.01, which was shipped with CUDA 11.2, is not compatible with nvidia-dcgm-exporter 3.0.4-3.0.0 as it was built on CUDA 11.7. In my case, I resolved this issue by using an older image that was built on CUDA 11.2. |
Hi, @nikkon-dev We're using dcgm-exporter:3.1.8-3.1.5-ubuntu20.04 on Kubernetes(v1.26.6). Additionally, we are utilizing GRID-A100D-7-80C-MIG-7g.80gb. We have observed some errors in the dcgm-exporter pod logs.
It looks like the profiling module fails to load:
We encountered an error with code -33 while running the `dcgmi dmon -e 1010`` command.
We tried the command you shared on the dcgm exporter daemonset as shown below.spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: nvidia-dcgm-exporter
template:
metadata:
creationTimestamp: null
labels:
app: nvidia-dcgm-exporter
app.kubernetes.io/managed-by: gpu-operator
helm.sh/chart: gpu-operator-v23.6.1
spec:
containers:
- env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_MIG_CONFIG_DEVICES
value: all
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
imagePullPolicy: IfNotPresent
name: nvidia-dcgm-exporter
ports:
- containerPort: 9400
name: metrics
protocol: TCP
securityContext:
capabilities:
add:
- SYS_ADMIN
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true
dnsConfig:
options:
- name: ndots
value: "2"
dnsPolicy: ClusterFirst
initContainers:
- args:
- until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for
nvidia container stack to be setup; sleep 5; done
command:
- sh
- -c
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.1
imagePullPolicy: IfNotPresent
name: toolkit-validation
securityContext:
capabilities:
add:
- SYS_ADMIN
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /run/nvidia
mountPropagation: HostToContainer
name: run-nvidia
nodeSelector:
nvidia.com/gpu.deploy.dcgm-exporter: "true"
priorityClassName: system-node-critical
restartPolicy: Always
runtimeClassName: nvidia
schedulerName: default-scheduler
serviceAccount: nvidia-dcgm-exporter
serviceAccountName: nvidia-dcgm-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources
type: ""
name: pod-gpu-resources
- hostPath:
path: /run/nvidia
type: ""
name: run-nvidia We also want to share the node label added by the gpu-operator. Labels:
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true
feature.node.kubernetes.io/cpu-cpuid.AVXVNNIINT8=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FSRM=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.GFNI=true
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.PSFD=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.VAES=true
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-hardware_multithreading=false
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=106
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/custom-rdma.available=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=5.15.0-91-generic
feature.node.kubernetes.io/kernel-version.major=5
feature.node.kubernetes.io/kernel-version.minor=15
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/memory-numa=true
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-15ad.present=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
kubernetes.io/arch=amd64
kubernetes.io/os=linux
node-role.kubernetes.io/gpu-operator=
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=154
nvidia.com/cuda.driver.rev=05
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/device-plugin.config=a100d-7-80c-mig-7g-80gb
nvidia.com/gfd.timestamp=1707994189
nvidia.com/gpu-driver-upgrade-state=upgrade-done
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.mig-manager=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.nvsm=paused-for-mig-change
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.engines.copy=7
nvidia.com/gpu.engines.decoder=5
nvidia.com/gpu.engines.encoder=0
nvidia.com/gpu.engines.jpeg=1
nvidia.com/gpu.engines.ofa=1
nvidia.com/gpu.family=ampere
nvidia.com/gpu.memory=81920
nvidia.com/gpu.multiprocessors=98
nvidia.com/gpu.present=true
nvidia.com/gpu.product=GRID-A100D-7-80C-MIG-7g.80gb
nvidia.com/gpu.replicas=1
nvidia.com/gpu.slices.ci=7
nvidia.com/gpu.slices.gi=7
nvidia.com/mig.capable=true
nvidia.com/mig.config=all-7g.80gb
nvidia.com/mig.config.state=success
nvidia.com/mig.strategy=single
nvidia.com/vgpu.host-driver-branch=r538_10
nvidia.com/vgpu.host-driver-version=535.154.02
nvidia.com/vgpu.present=true We also ran nv-hostengine and obtained debug logs.nv-hostengine -f host.log --log-level debug 2024-02-16 08:59:03.773 ERROR [91:93] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:502] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2024-02-16 08:59:03.773 ERROR [91:93] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-16 08:59:03.773 ERROR [91:93] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1740] [DcgmHostEngineHandler::LoadModule]
2024-02-16 08:59:03.773 ERROR [91:93] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3542] [DcgmHostEngineHandler::WatchFieldGroup]
2024-02-16 08:59:03.773 WARN [91:93] Skipping loading of module 8 in status 2 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1664] [DcgmHostEngineHandler::LoadModule]
2024-02-16 08:59:03.773 ERROR [91:93] DCGM_PROFILING_SR_UNWATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3636] [DcgmHostEngineHandler::UnwatchFieldGroup] Nvidia-smi output[root@nvidia-device-plugin-daemonset-w7q69 /]# nvidia-smi
Fri Feb 16 13:20:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID A100D-7-80C On | 00000000:02:00.0 Off | On |
| N/A N/A P0 N/A / N/A | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 0 0 0 | 0MiB / 76011MiB | 98 0 | 7 0 5 1 1 |
| | 0MiB / 4096MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
[root@nvidia-device-plugin-daemonset-w7q69 /]# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Fri Feb 16 13:20:35 2024
Driver Version : 535.154.05
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : GRID A100D-7-80C
Product Brand : NVIDIA Virtual Compute Server
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Enabled
Pending : Enabled
MIG Device
Index : 0
GPU Instance ID : 0
Compute Instance ID : 0
Device Attributes
Shared
Multiprocessor count : 98
Copy Engine count : 7
Encoder count : 0
Decoder count : 5
OFA count : 1
JPG count : 1
ECC Errors
Volatile
SRAM Uncorrectable : 0
FB Memory Usage
Total : 76011 MiB
Reserved : 0 MiB
Used : 0 MiB
Free : 76011 MiB
BAR1 Memory
Total : 4096 MiB
Used : 0 MiB
Free : 4096 MiB
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
Minor Number : 0
MultiGPU Board : No
FRU Part Number : N/A
Module ID : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : VGPU
Host VGPU Mode : N/A
vGPU Software Licensed Product
Product Name : NVIDIA Virtual Compute Server
GPU Reset Status
Reset Required : N/A
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Device Current : N/A
Device Max : N/A
Host Max : N/A
Link Width
Max : N/A
Current : N/A
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : N/A
Replay Number Rollovers : N/A
Tx Throughput : N/A
Rx Throughput : N/A
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons : N/A
FB Memory Usage
Total : 81920 MiB
Reserved : 5908 MiB
Used : 0 MiB
Free : 76011 MiB
BAR1 Memory Usage
Total : 4096 MiB
Used : 0 MiB
Free : 4096 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : N/A
GPU T.Limit Temp : N/A
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1275 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes : None We checked the health of the GPU using the script available at https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.8.0/cookbooks/aws-parallelcluster-slurm/files/default/config_slurm/scripts/health_checks/gpu_health_check.sh. Successfully passed all the steps.
cc: @Dentrax |
It sounds like the issue was resolved. |
root@68e97f630ad1:/etc/dcgm-exporter# dcgm-exporter -f dcp-metrics-included.csv |
According to this error message:
You already have another dcgm-exporter instance working or another process occupying the 9400 port. |
hi thank you for your reply +-----------------------------------------------------------------------------+ |
@jacksonyi0, |
Thank you for your reply, then I will have to think of other solutions. |
hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand goroutine 0 [idle]: |
Hello,
dcgmi version: 2.2.9
I built dcgm-exporter from source and am running it on a single GPU (Tesla K80). I can't seem to get profiling metrics to show up, though other metrics show up fine.
It looks like the profiling module fails to load:
Though I'm not sure whether this is attributable to dcgm-exporter or dcgm, because when I can't get the metrics to load even when using dcgmi directly:
I've directly followed the instruction to build dcgm-exporter from source and the service runs inside a sidecar container that is responsible for collecting metrics.
How can I enable the collection of profiling metrics?
The text was updated successfully, but these errors were encountered: