Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling metrics not being collected #22

Closed
ppreet opened this issue Oct 18, 2021 · 27 comments
Closed

Profiling metrics not being collected #22

ppreet opened this issue Oct 18, 2021 · 27 comments

Comments

@ppreet
Copy link

ppreet commented Oct 18, 2021

Hello,

dcgmi version: 2.2.9

I built dcgm-exporter from source and am running it on a single GPU (Tesla K80). I can't seem to get profiling metrics to show up, though other metrics show up fine.

root@node-0:/etc/dcgm-exporter# dcgm-exporter -f etc/dcp-metrics-included.csv  -a :9402
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] No configmap data specified, falling back to metric file etc/dcp-metrics-included.csv
WARN[0000] Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled

Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.

It looks like the profiling module fails to load:

root@node-0:/etc/dcgm-exporter# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

Though I'm not sure whether this is attributable to dcgm-exporter or dcgm, because when I can't get the metrics to load even when using dcgmi directly:

root@node-0:/home/user# dcgmi dmon -e 1010
# Entity                 PCIRX
      Id
Error setting watches. Result: This request is serviced by a module of DCGM that is not currently loaded

I've directly followed the instruction to build dcgm-exporter from source and the service runs inside a sidecar container that is responsible for collecting metrics.

How can I enable the collection of profiling metrics?

@nikkon-dev
Copy link
Collaborator

Hello,

The DCP metrics (fieldid 1001-1012) are supported on Volta and newer architectures only. Kepler is not supported.

WBR,
Nik

@babinskiy
Copy link

Hello,

I have Ampere A40 GPU, but I also have the same error:

dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

What could be reason of this?

@nikkon-dev
Copy link
Collaborator

@babinskiy,

There may be several reasons. Could you provide us the debug logs from the nv-hostengine?
nv-hostengine -f host.log --log-level debug

WBR,
Nik

@babinskiy
Copy link

HI @nikkon-dev,
Thanks for your response

The only related I found in log

2022-04-26 06:20:26.707 DEBUG [22375:22377] Processing request of type 10 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Added GroupId 2 name dcgmi_22409_1 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:273] [DcgmGroupManager::AddNewGroup]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Processing request of type 47 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Got 2 entities and 1 fields [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5763] [DcgmHostEngineHandler::WatchFieldGroup]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Adding WatchInfo on entityKey 0x103e900000000 (eg 1, entityId 0, fieldId 1001) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2054] [DcgmCacheManager::GetEntityWatchInfo]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding new watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3021] [DcgmCacheManager::AddOrUpdateWatcher]
2022-04-26 06:20:26.708 DEBUG [22375:22377] UpdateWatchFromWatchers minMonitorFreqUsec 5000, minMaxAgeUsec 1000000, hsw 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3063] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-04-26 06:20:26.708 DEBUG [22375:22377] AddFieldWatch eg 1, eid 0, fieldId 1001, mfu 5000, msa 0.000000, mka 2, sfu 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3156] [DcgmCacheManager::AddEntityFieldWatch]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding WatchInfo on entityKey 0x103e900000001 (eg 1, entityId 1, fieldId 1001) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2054] [DcgmCacheManager::GetEntityWatchInfo]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding new watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3021] [DcgmCacheManager::AddOrUpdateWatcher]
2022-04-26 06:20:26.708 DEBUG [22375:22377] UpdateWatchFromWatchers minMonitorFreqUsec 5000, minMaxAgeUsec 1000000, hsw 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3063] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-04-26 06:20:26.708 DEBUG [22375:22377] AddFieldWatch eg 1, eid 1, fieldId 1001, mfu 5000, msa 0.000000, mka 2, sfu 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3156] [DcgmCacheManager::AddEntityFieldWatch]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Entering dcgmModuleIdToName(dcgmModuleId_t id, char const **name) (8, 0x7f70fb244028) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/entry_point.h:908] [dcgmModuleIdToName]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Returning 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/entry_point.h:908] [dcgmModuleIdToName]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] Initialized logging for module 8 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:91] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] Logger address 0x7f70f8294740 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:92] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] __DCGM_PROF_NO_SKU_CHECK was NOT set. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:450] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReadEnvironmentalVariables]
2022-04-26 06:20:26.722 DEBUG [22375:22377] [[Profiling]] NVPW_InitializeTarget() was successful. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1215] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2022-04-26 06:20:26.722 ERROR [22375:22377] [[Profiling]] NVPW_DCGM_LoadDriver returned1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1216] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2022-04-26 06:20:26.722 ERROR [22375:22377] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:385] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2022-04-26 06:20:26.723 ERROR [22375:22377] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:148] [{anonymous}::SafeWrapper]
2022-04-26 06:20:26.723 ERROR [22375:22377] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3617] [DcgmHostEngineHandler::LoadModule]
2022-04-26 06:20:26.723 ERROR [22375:22377] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5828] [DcgmHostEngineHandler::WatchFieldGroup]
2022-04-26 06:20:26.723 DEBUG [22375:22377] Got 2 entities and 1 fields [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5870] [DcgmHostEngineHandler::UnwatchFieldGroup]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveWatcher removing existing watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2966] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveEntityFieldWatch eg 1, eid 0, nvmlFieldId 1001, clearCache 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3212] [DcgmCacheManager::RemoveEntityFieldWatch]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveWatcher removing existing watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2966] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveEntityFieldWatch eg 1, eid 1, nvmlFieldId 1001, clearCache 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3212] [DcgmCacheManager::RemoveEntityFieldWatch]
2022-04-26 06:20:26.723 WARN  [22375:22377] Skipping loading of module 8 in status 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3534] [DcgmHostEngineHandler::LoadModule]
2022-04-26 06:20:26.723 ERROR [22375:22377] DCGM_PROFILING_SR_UNWATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5914] [DcgmHostEngineHandler::UnwatchFieldGroup]
2022-04-26 06:20:44.586 DEBUG [22375:22377] Processing request of type 3 for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:44.586 DEBUG [22375:22377] persistAfterDisconnect 0 for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:231] [DcgmHostEngineHandler::ProcessClientLogin]
2022-04-26 06:20:44.587 DEBUG [22375:22377] Removed 0 groups for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:364] [DcgmGroupManager::RemoveAllGroupsForConnection]
2022-04-26 06:20:44.587 DEBUG [22375:22377] No field groups found for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmFieldGroup.cpp:392] [DcgmFieldGroupManager::OnConnectionRemove]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
...

Full version of log I uploaded here: https://fex.net/s/2p0p1bm

Will be grateful for any help!

@nikkon-dev
Copy link
Collaborator

@babinskiy,

Could you confirm that the persistence mode is enabled on the GPU?
nvidia-smi output would tell you.
nvidia-smi -pm 1 to enable it.

@yh0413
Copy link

yh0413 commented Nov 21, 2022

Hi @nikkon-dev,

I'm currently running dcgm-exporter 2.3.5-2.6.5 without any problem except DCP metrics for MIG.
To solve some DCGM issues about DCP metrics for MIG, I tried to update dcgm-exporter to 3.0.4-3.0.0 ,
but same problem occurs like above.

Any help would be appreciated.

Env

  • Kubernetes v1.19.9
  • A30
  • NVIDIA Driver 460.73.01 (persistence mode is enabled)

Apps related NVIDIA

  • nvidia-device-plugin v0.11.0
  • nvidia-dcgm-exporter 3.0.4-3.0.0 (starts nv-hostengine as an embedded process)

dcgm-exporter log

time="2022-11-21T05:43:47Z" level=info msg="Starting dcgm-exporter"
time="2022-11-21T05:43:47Z" level=info msg="DCGM successfully initialized!"
time="2022-11-21T05:43:47Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-11-21T05:43:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2022-11-21T05:43:49Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-11-21T05:43:49Z" level=info msg="Starting webserver"
time="2022-11-21T05:43:49Z" level=info msg="Pipeline starting"

@nikkon-dev
Copy link
Collaborator

@yh0413,

Running the nv-hostengine inside a docker container if MIG is enabled may be tricky. The nv-hostengine uses MIG management API to get MIG profiles information (this is privileged functionality). By default a container would not have the proper capability to access MIG profiles information.
For example, this is how you could run a docker container to allow it to access the MIG API:

$ docker run --cap-add SYS_ADMIN --runtime=nvidia \
  --gpus all \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_MIG_CONFIG_DEVICES=all \
  -e NVIDIA_MIG_MONITOR_DEVICES=all \
  ...

Usually, when MIG is enabled, we recommend running nv-hostengine on bare metal and letting dcgm-exporter connect to it instead of running an embedded hostengine.

I hope that would help.

WBR,
Nik

@yh0413
Copy link

yh0413 commented Nov 21, 2022

It works well. I solved the issue by connecting dcgm-exporter with nv-hostengine which is running on host.

Thank you!

@wpso
Copy link

wpso commented Nov 22, 2022

Hi @yh0413,
My vm with MIG meet the same problem, dcgmi profiling failed to load. the cuda version is 11.4 and nvidia driver is 470.141.03, do you have any suggestions?

@nikkon-dev
Copy link
Collaborator

@wpso,

Could you provide more information about your setup? Do you use passthrough or vgpu?

@wpso
Copy link

wpso commented Nov 22, 2022

@nikkon-dev
we use MIG vgpu for vm. we tried three dcgm version(2.0.13,2.0.15 and 2.1.5), both the host and guest all have the problem, and the card is A100 80G(20b5).

@nikkon-dev
Copy link
Collaborator

@wpso,

I'm a bit confused. vGPUs do not allow MIG configurations until you are using the passthrough approach (aka grant exclusive access to the whole GPU to the VM). What hypervisor are you using?
In general, DCGM needs full access to the hardware, and the driver to be able to reach the MIG management API which is usually not virtualized.

@jack161641
Copy link

jack161641 commented Feb 27, 2023

@nikkon-dev

hi,I get the same error, even if I start dcgm-exporter with nv-hostengine

root@release-name-dcgm-exporter-b2xrs:/#
dcgm-exporter -r localhost:5555 -f /etc/dcgm-exporter/custom-collectors.csv -a :9401
INFO[0000] Starting dcgm-exporter
INFO[0000] Attemping to connect to remote hostengine at localhost:5555
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] No configmap data specified, falling back to metric file /etc/dcgm-exporter/custom-collectors.csv
WARN[0000] Skipping line 6 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled
WARN[0000] Skipping line 7 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled
WARN[0000] Skipping line 21 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled
WARN[0000] Skipping line 22 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled
WARN[0000] Skipping line 23 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled

ENV
DCGM Exporter version 3.1.6-3.1.3
Driver Version : 460.91.03
CUDA Version : 11.2
Persistence-M ON
Tesla V100-SXM2-32GB

@nikkon-dev
Copy link
Collaborator

@jack161641,

To determine the cause of the profiling module load failure, we must analyze the nv-hostengine debug logs. The reasons could be varied, ranging from unsupported GPUs to insufficient privileges.

To obtain the debug logs, you can restart the nv-hostengine with the following arguments: nv-hostengine -f /tmp/host.debug.log --log-level-debug

@chenaidong1
Copy link

I have the same problem.

Environment
# dcgmi -v Version : 2.4.6 Build ID : 11 Build Date : 2022-07-06 Build Type : Release Commit ID : b21fb88d38b2d70a5b3330e5806962ad6f207e69 Branch Name : rel_dcgm_2_4 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64

nvidia-smi cmd result:
` Wed Nov 15 07:27:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L40S On | 00000000:27:00.0 Off | 0 |
| N/A 39C P8 34W / 350W | 3MiB / 46068MiB | 0% Default |
| | | N/A |

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=====================================================================================`

I can't seem to get profiling metrics to show up, though other metrics show up fine.

INFO[0000] Starting dcgm-exporter INFO[0000] DCGM successfully initialized! INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv WARN[0000] Skipping line 13 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled WARN[0000] Skipping line 14 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled WARN[0000] Skipping line 15 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled WARN[0000] Skipping line 16 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled WARN[0000] Skipping line 17 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled WARN[0000] Skipping line 18 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled INFO[0000] Pipeline starting INFO[0000] Starting webserver
nv-hostengine.log
/var/log/nv-hostengine.log 2023-11-15 07:18:06.426 ERROR [104:104] [[NvSwitch]] AttachToNscq() returned -25 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/modules/nvswitch/DcgmNvSwitchManager.cpp:317] [DcgmNs::DcgmNvSwitchMan ager::Init] 2023-11-15 07:18:06.426 ERROR [104:104] [[NvSwitch]] Could not initialize switch manager. Ret: DCGM library could not be found [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/modules/nvswitch/DcgmMod uleNvSwitch.cpp:34] [DcgmNs::DcgmModuleNvSwitch::DcgmModuleNvSwitch] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] NVPW_DCGM_LoadDriver returned1 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1353] [Dcgm Ns::Modules::Profiling::DcgmModuleProfiling::InitLop] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgm_private/modules/profiling/DcgmModule Profiling.cpp:481] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_4 -postmerge@2/modules/DcgmModule.h:148] [{anonymous}::SafeWrapper] 2023-11-15 07:18:06.453 ERROR [104:104] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:3671] [DcgmHostEngineHandler::LoadModule] 2023-11-15 07:18:06.542 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.542 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.544 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.544 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal]

Any help would be appreciated.

@nikkon-dev
Copy link
Collaborator

@chenaidong1,

In your case, you need to update the dcgm-exporter to a newer version.

You are using DCGM 2.4.6, which is quite outdated and does not support L40S GPUs. Try using dcgm-exporter based on the 3.2.x or 3.3.x releases.

@chenaidong1
Copy link

@nikkon-dev
Thank for your reply.
I try to use dcgm-exporter based on the 3.2.x or 3.3.x releases to run on the host. Profiling metrics are collected now.
In another environment, I use dcgm-exporter based on the 3.2.5 version, fail to collect profiling metrics

The environment information is as follows:
`nvidia-smi
Tue Nov 14 03:38:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:09.0 Off | Off |
| N/A 53C P8 18W / 70W | 86MiB / 16384MiB | 0% Default |

dcgmi profile -l
Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.`

What is the reason for the failure? Cuda is not installed. Does profile metrics depend on cuda?

@ryan4yin
Copy link

Encountered this problem on GKE's Nvidia L4 machine, fixed by upgrade the docker image of dcgm-exporter & dcgm to 3.3.0

@NierYYDS
Copy link

NierYYDS commented Dec 19, 2023

Hi @nikkon-dev,

I'm currently running dcgm-exporter 2.3.5-2.6.5 without any problem except DCP metrics for MIG. To solve some DCGM issues about DCP metrics for MIG, I tried to update dcgm-exporter to 3.0.4-3.0.0 , but same problem occurs like above.

Any help would be appreciated.

Env

  • Kubernetes v1.19.9
  • A30
  • NVIDIA Driver 460.73.01 (persistence mode is enabled)

Apps related NVIDIA

  • nvidia-device-plugin v0.11.0
  • nvidia-dcgm-exporter 3.0.4-3.0.0 (starts nv-hostengine as an embedded process)

dcgm-exporter log

time="2022-11-21T05:43:47Z" level=info msg="Starting dcgm-exporter"
time="2022-11-21T05:43:47Z" level=info msg="DCGM successfully initialized!"
time="2022-11-21T05:43:47Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-11-21T05:43:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2022-11-21T05:43:49Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-11-21T05:43:49Z" level=info msg="Starting webserver"
time="2022-11-21T05:43:49Z" level=info msg="Pipeline starting"

I've discovered some driver/CUDA compatibility issues when collecting DCP metrics. NVIDIA Driver 460.73.01, which was shipped with CUDA 11.2, is not compatible with nvidia-dcgm-exporter 3.0.4-3.0.0 as it was built on CUDA 11.7. In my case, I resolved this issue by using an older image that was built on CUDA 11.2.

@melikeiremguler
Copy link

melikeiremguler commented Feb 16, 2024

Hi, @nikkon-dev

We're using dcgm-exporter:3.1.8-3.1.5-ubuntu20.04 on Kubernetes(v1.26.6). Additionally, we are utilizing GRID-A100D-7-80C-MIG-7g.80gb. We have observed some errors in the dcgm-exporter pod logs.

time="2024-02-16T11:25:47Z" level=info msg="Starting dcgm-exporter"
time="2024-02-16T11:25:47Z" level=info msg="DCGM successfully initialized!"
time="2024-02-16T11:25:47Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-16T11:25:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"

It looks like the profiling module fails to load:

root@nvidia-dcgm-exporter-8kvn6:/# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Loaded                                           |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

We encountered an error with code -33 while running the `dcgmi dmon -e 1010`` command.

root@nvidia-dcgm-exporter-8kvn6:/# dcgmi dmon -e 1010
#Entity   PCIRX
ID
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

$ docker run --cap-add SYS_ADMIN --runtime=nvidia
--gpus all
-e NVIDIA_VISIBLE_DEVICES=all
-e NVIDIA_MIG_CONFIG_DEVICES=all
-e NVIDIA_MIG_MONITOR_DEVICES=all
...

We tried the command you shared on the dcgm exporter daemonset as shown below.
spec:
revisionHistoryLimit: 10
selector:
  matchLabels:
    app: nvidia-dcgm-exporter
template:
  metadata:
    creationTimestamp: null
    labels:
      app: nvidia-dcgm-exporter
      app.kubernetes.io/managed-by: gpu-operator
      helm.sh/chart: gpu-operator-v23.6.1
  spec:
    containers:
    - env:
      - name: DCGM_EXPORTER_LISTEN
        value: :9400
      - name: DCGM_EXPORTER_KUBERNETES
        value: "true"
      - name: DCGM_EXPORTER_COLLECTORS
        value: /etc/dcgm-exporter/dcp-metrics-included.csv
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_MIG_CONFIG_DEVICES
        value: all
      - name: NVIDIA_MIG_MONITOR_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
      image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
      imagePullPolicy: IfNotPresent
      name: nvidia-dcgm-exporter
      ports:
      - containerPort: 9400
        name: metrics
        protocol: TCP
      securityContext:
        capabilities:
          add:
          - SYS_ADMIN
        privileged: true
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/lib/kubelet/pod-resources
        name: pod-gpu-resources
        readOnly: true
    dnsConfig:
      options:
      - name: ndots
        value: "2"
    dnsPolicy: ClusterFirst
    initContainers:
    - args:
      - until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for
        nvidia container stack to be setup; sleep 5; done
      command:
      - sh
      - -c
      image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.1
      imagePullPolicy: IfNotPresent
      name: toolkit-validation
      securityContext:
        capabilities:
          add:
          - SYS_ADMIN
        privileged: true
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /run/nvidia
        mountPropagation: HostToContainer
        name: run-nvidia
    nodeSelector:
      nvidia.com/gpu.deploy.dcgm-exporter: "true"
    priorityClassName: system-node-critical
    restartPolicy: Always
    runtimeClassName: nvidia
    schedulerName: default-scheduler
    serviceAccount: nvidia-dcgm-exporter
    serviceAccountName: nvidia-dcgm-exporter
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists
    volumes:
    - hostPath:
        path: /var/lib/kubelet/pod-resources
        type: ""
      name: pod-gpu-resources
    - hostPath:
        path: /run/nvidia
        type: ""
      name: run-nvidia
We also want to share the node label added by the gpu-operator.
 Labels: 
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true
feature.node.kubernetes.io/cpu-cpuid.AVXVNNIINT8=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FSRM=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.GFNI=true
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.PSFD=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.VAES=true
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-hardware_multithreading=false
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=106
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/custom-rdma.available=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=5.15.0-91-generic
feature.node.kubernetes.io/kernel-version.major=5
feature.node.kubernetes.io/kernel-version.minor=15
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/memory-numa=true
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-15ad.present=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
kubernetes.io/arch=amd64
kubernetes.io/os=linux
node-role.kubernetes.io/gpu-operator=
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=154
nvidia.com/cuda.driver.rev=05
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/device-plugin.config=a100d-7-80c-mig-7g-80gb
nvidia.com/gfd.timestamp=1707994189
nvidia.com/gpu-driver-upgrade-state=upgrade-done
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.mig-manager=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.nvsm=paused-for-mig-change
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.engines.copy=7
nvidia.com/gpu.engines.decoder=5
nvidia.com/gpu.engines.encoder=0
nvidia.com/gpu.engines.jpeg=1
nvidia.com/gpu.engines.ofa=1
nvidia.com/gpu.family=ampere
nvidia.com/gpu.memory=81920
nvidia.com/gpu.multiprocessors=98
nvidia.com/gpu.present=true
nvidia.com/gpu.product=GRID-A100D-7-80C-MIG-7g.80gb
nvidia.com/gpu.replicas=1
nvidia.com/gpu.slices.ci=7
nvidia.com/gpu.slices.gi=7
nvidia.com/mig.capable=true
nvidia.com/mig.config=all-7g.80gb
nvidia.com/mig.config.state=success
nvidia.com/mig.strategy=single
nvidia.com/vgpu.host-driver-branch=r538_10
nvidia.com/vgpu.host-driver-version=535.154.02
nvidia.com/vgpu.present=true
We also ran nv-hostengine and obtained debug logs.

nv-hostengine -f host.log --log-level debug

2024-02-16 08:59:03.773 ERROR [91:93] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:502] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2024-02-16 08:59:03.773 ERROR [91:93] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-16 08:59:03.773 ERROR [91:93] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1740] [DcgmHostEngineHandler::LoadModule]
2024-02-16 08:59:03.773 ERROR [91:93] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3542] [DcgmHostEngineHandler::WatchFieldGroup]
2024-02-16 08:59:03.773 WARN  [91:93] Skipping loading of module 8 in status 2 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1664] [DcgmHostEngineHandler::LoadModule]
2024-02-16 08:59:03.773 ERROR [91:93] DCGM_PROFILING_SR_UNWATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3636] [DcgmHostEngineHandler::UnwatchFieldGroup]
Nvidia-smi output
[root@nvidia-device-plugin-daemonset-w7q69 /]# nvidia-smi
Fri Feb 16 13:20:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100D-7-80C               On  | 00000000:02:00.0 Off |                   On |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    0   0   0  |               0MiB / 76011MiB  | 98      0 |  7   0    5    1    1 |
|                  |               0MiB /  4096MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[root@nvidia-device-plugin-daemonset-w7q69 /]# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Fri Feb 16 13:20:35 2024
Driver Version                            : 535.154.05
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:02:00.0
  Product Name                          : GRID A100D-7-80C
  Product Brand                         : NVIDIA Virtual Compute Server
  Product Architecture                  : Ampere
  Display Mode                          : Enabled
  Display Active                        : Disabled
  Persistence Mode                      : Enabled
  Addressing Mode                       : None
  MIG Mode
      Current                           : Enabled
      Pending                           : Enabled
  MIG Device
      Index                             : 0
      GPU Instance ID                   : 0
      Compute Instance ID               : 0
      Device Attributes
          Shared
              Multiprocessor count      : 98
              Copy Engine count         : 7
              Encoder count             : 0
              Decoder count             : 5
              OFA count                 : 1
              JPG count                 : 1
      ECC Errors
          Volatile
              SRAM Uncorrectable        : 0
      FB Memory Usage
          Total                         : 76011 MiB
          Reserved                      : 0 MiB
          Used                          : 0 MiB
          Free                          : 76011 MiB
      BAR1 Memory
          Total                         : 4096 MiB
          Used                          : 0 MiB
          Free                          : 4096 MiB
  Accounting Mode                       : Disabled
  Accounting Mode Buffer Size           : 4000
  Driver Model
      Current                           : N/A
      Pending                           : N/A
  Serial Number                         : N/A
  Minor Number                          : 0
  MultiGPU Board                        : No
  FRU Part Number                       : N/A
  Module ID                             : N/A
  Inforom Version
      Image Version                     : N/A
      OEM Object                        : N/A
      ECC Object                        : N/A
      Power Management Object           : N/A
  Inforom BBX Object Flush
      Latest Timestamp                  : N/A
      Latest Duration                   : N/A
  GPU Operation Mode
      Current                           : N/A
      Pending                           : N/A
  GSP Firmware Version                  : N/A
  GPU Virtualization Mode
      Virtualization Mode               : VGPU
      Host VGPU Mode                    : N/A
  vGPU Software Licensed Product
      Product Name                      : NVIDIA Virtual Compute Server
  GPU Reset Status
      Reset Required                    : N/A
      Drain and Reset Recommended       : N/A
  IBMNPU
      Relaxed Ordering Mode             : N/A
  PCI
      GPU Link Info
          PCIe Generation
              Max                       : N/A
              Current                   : N/A
              Device Current            : N/A
              Device Max                : N/A
              Host Max                  : N/A
          Link Width
              Max                       : N/A
              Current                   : N/A
      Bridge Chip
          Type                          : N/A
          Firmware                      : N/A
      Replays Since Reset               : N/A
      Replay Number Rollovers           : N/A
      Tx Throughput                     : N/A
      Rx Throughput                     : N/A
      Atomic Caps Inbound               : N/A
      Atomic Caps Outbound              : N/A
  Fan Speed                             : N/A
  Performance State                     : P0
  Clocks Event Reasons                  : N/A
  FB Memory Usage
      Total                             : 81920 MiB
      Reserved                          : 5908 MiB
      Used                              : 0 MiB
      Free                              : 76011 MiB
  BAR1 Memory Usage
      Total                             : 4096 MiB
      Used                              : 0 MiB
      Free                              : 4096 MiB
  Conf Compute Protected Memory Usage
      Total                             : 0 MiB
      Used                              : 0 MiB
      Free                              : 0 MiB
  Compute Mode                          : Default
  Utilization
      Gpu                               : N/A
      Memory                            : N/A
      Encoder                           : N/A
      Decoder                           : N/A
      JPEG                              : N/A
      OFA                               : N/A
  Encoder Stats
      Active Sessions                   : 0
      Average FPS                       : 0
      Average Latency                   : 0
  FBC Stats
      Active Sessions                   : 0
      Average FPS                       : 0
      Average Latency                   : 0
  ECC Mode
      Current                           : Enabled
      Pending                           : Enabled
  ECC Errors
      Volatile
          SRAM Correctable              : 0
          SRAM Uncorrectable            : 0
          DRAM Correctable              : 0
          DRAM Uncorrectable            : 0
      Aggregate
          SRAM Correctable              : 0
          SRAM Uncorrectable            : 0
          DRAM Correctable              : 0
          DRAM Uncorrectable            : 0
  Retired Pages
      Single Bit ECC                    : N/A
      Double Bit ECC                    : N/A
      Pending Page Blacklist            : N/A
  Remapped Rows                         : N/A
  Temperature
      GPU Current Temp                  : N/A
      GPU T.Limit Temp                  : N/A
      GPU Shutdown Temp                 : N/A
      GPU Slowdown Temp                 : N/A
      GPU Max Operating Temp            : N/A
      GPU Target Temperature            : N/A
      Memory Current Temp               : N/A
      Memory Max Operating Temp         : N/A
  GPU Power Readings
      Power Draw                        : N/A
      Current Power Limit               : N/A
      Requested Power Limit             : N/A
      Default Power Limit               : N/A
      Min Power Limit                   : N/A
      Max Power Limit                   : N/A
  Module Power Readings
      Power Draw                        : N/A
      Current Power Limit               : N/A
      Requested Power Limit             : N/A
      Default Power Limit               : N/A
      Min Power Limit                   : N/A
      Max Power Limit                   : N/A
  Clocks
      Graphics                          : 1410 MHz
      SM                                : 1410 MHz
      Memory                            : 1512 MHz
      Video                             : 1275 MHz
  Applications Clocks
      Graphics                          : N/A
      Memory                            : N/A
  Default Applications Clocks
      Graphics                          : N/A
      Memory                            : N/A
  Deferred Clocks
      Memory                            : N/A
  Max Clocks
      Graphics                          : N/A
      SM                                : N/A
      Memory                            : N/A
      Video                             : N/A
  Max Customer Boost Clocks
      Graphics                          : N/A
  Clock Policy
      Auto Boost                        : N/A
      Auto Boost Default                : N/A
  Voltage
      Graphics                          : N/A
  Fabric
      State                             : N/A
      Status                            : N/A
  Processes                             : None

We checked the health of the GPU using the script available at https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.8.0/cookbooks/aws-parallelcluster-slurm/files/default/config_slurm/scripts/health_checks/gpu_health_check.sh. Successfully passed all the steps.

root@nvidia-dcgm-exporter-8kvn6:/# dcgmi diag -i 0 -r 2
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 535.154.05                                     |
| GPU Device IDs Detected   | 20b5                                           |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Skip                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+

cc: @Dentrax

@nvvfedorov nvvfedorov added duplicate This issue or pull request already exists dcp invalid This doesn't seem right mig and removed duplicate This issue or pull request already exists invalid This doesn't seem right labels Mar 15, 2024
@nvvfedorov
Copy link
Collaborator

It sounds like the issue was resolved.

@jacksonyi0
Copy link

jacksonyi0 commented May 16, 2024

root@68e97f630ad1:/etc/dcgm-exporter# dcgm-exporter -f dcp-metrics-included.csv
2024/05/16 03:45:44 maxprocs: Leaving GOMAXPROCS=64: CPU quota undefined
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] Falling back to metric file 'dcp-metrics-included.csv'
WARN[0000] Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled
WARN[0000] Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled
WARN[0000] Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled
WARN[0000] Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled
WARN[0000] Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled
INFO[0000] Initializing system entities of type: GPU
INFO[0000] Not collecting NvSwitch metrics; no fields to watch for device type: 3
INFO[0000] Not collecting NvLink metrics; no fields to watch for device type: 6
INFO[0000] Not collecting CPU metrics; no fields to watch for device type: 7
INFO[0000] Not collecting CPU Core metrics; no fields to watch for device type: 8
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"
what should i do?
I started the dcgm-exporter in container mode and ran nv-hostengine -f host.log --log-level debug on the host. The error Err: Failed to start DCGM Server: -7
dcgmi modules -l display:
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Not loaded |
| 9 | SysMon | Not loaded |
+-----------+--------------------+--------------------------------------------------+

@nikkon-dev
Copy link
Collaborator

@jack161641,

According to this error message:

FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"

You already have another dcgm-exporter instance working or another process occupying the 9400 port.
There may be only one nv-hostengine instance per GPU (it does not matter if it's standalone, embedded, bare-metal, or containerized).

@jacksonyi0
Copy link

jacksonyi0 commented May 17, 2024

@jack161641,

According to this error message:

FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"

You already have another dcgm-exporter instance working or another process occupying the 9400 port. There may be only one nv-hostengine instance per GPU (it does not matter if it's standalone, embedded, bare-metal, or containerized).

hi thank you for your reply
But I started it through docker-compose and couldn't collect these metrics.
Even if I configure DCGM_FI_PROF_DRAM_ACTIVE,DCGM_FI_PROF_PIPE_FP32_ACTIVE,DCGM_FI_PROF_PIPE_FP16_ACTIVE
Still reporting an error “INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] Falling back to metric file 'default-counters.csv'
WARN[0000] Skipping line 25 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled
WARN[0000] Skipping line 26 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled
WARN[0000] Skipping line 27 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled
WARN[0000] Skipping line 28 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled ”
This is the information displayed by nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:00:08.0 Off | N/A |
| 0% 24C P8 15W / 350W | 22204MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:00:09.0 Off | N/A |
| 0% 25C P8 17W / 350W | 19814MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:00:0A.0 Off | N/A |
| 0% 24C P8 16W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:00:0B.0 Off | N/A |
| 0% 25C P8 20W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

@nikkon-dev
Copy link
Collaborator

@jacksonyi0,
The profiling module that handles the DCGM_FI_PROF* metrics does not support consumer-grade GPUs (GeForce GTX/RTX). These metrics, named DCP (Data Center Profing), require datacenter-grade GPUs (V10x/A10x/H10x) or workstation-grade GPUs (previously known as Quadro).

@jacksonyi0
Copy link

@jacksonyi0, The profiling module that handles the DCGM_FI_PROF* metrics does not support consumer-grade GPUs (GeForce GTX/RTX). These metrics, named DCP (Data Center Profing), require datacenter-grade GPUs (V10x/A10x/H10x) or workstation-grade GPUs (previously known as Quadro).

Thank you for your reply, then I will have to think of other solutions.

@jacksonyi0
Copy link

hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand
Try 'readlink --help' for more information.
Enter the container through docker run -ti --entrypoint=/bin/sh --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04 bash /usr/local/dcgm/dcgm-exporter-entrypoint.sh still reports an error readlink: missing operand
Try 'readlink --help' for more information
The error runtime/cgo: pthread_create failed: Operation not permitted is reported through the /usr/bin/dcgm-exporter command.
SIGABRT: abort
PC=0x7f33397539fc m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: g 0: unknown pc 0x7f33397539fc
stack: frame={sp:0x7ffdbe6fa820, fp:0x0} stack=[0x7ffdbdefbda0,0x7ffdbe6fadb0)
0x00007ffdbe6fa720: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa730: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa740: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa750: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa760: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa770: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa780: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa790: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa7a0: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa7b0: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa7c0: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa7d0: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa7e0: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa7f0: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa800: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa810: 0x0000000000000000 0x00007f33397539ee
0x00007ffdbe6fa820: <0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa830: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa840: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa850: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa860: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa870: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa880: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa890: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa8a0: 0x0000000000000000 0xa8d8867e7227a900
0x00007ffdbe6fa8b0: 0x00007f33396ba740 0x0000000000000006
0x00007ffdbe6fa8c0: 0x0000000001d0e4f7 0x00007ffdbe6fabf0
0x00007ffdbe6fa8d0: 0x0000000002992bc0 0x00007f33396ff476
0x00007ffdbe6fa8e0: 0x00007f33398d8e90 0x00007f33396e57f3
0x00007ffdbe6fa8f0: 0x0000000000000020 0x0000000000000000
0x00007ffdbe6fa900: 0x0000000000000000 0x0000000000000000
0x00007ffdbe6fa910: 0x0000000000000000 0x0000000000000000
runtime: g 0: unknown pc 0x7f33397539fc
What is causing this problem? Please help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests