Add workaround for corrupted nsys GPU utilization data #104

GregoryKimball · 2022-11-10T17:08:41Z

libcudf benchmarks run using nvbench show a conflict with Nsight Systems when collecting GPU utilization data. This issue is tracked in the Nsight Systems Jira board (Slack thread, Jira Issue).

The current consensus is that the root cause lies within Nsight Systems collects utilization data. I'm opening this issue to request that nvbench investigates a workaround. C++ google benchmarks and python pytest benchmarks have no issues collecting GPU utilization data with Nsight Systems, so there must be some way for nvbench user using the --profile flag to access GPU utilization data.

Reference profile with nvbench:

Reference profile with gbench:

The text was updated successfully, but these errors were encountered:

bdice · 2023-01-03T19:27:05Z

The corresponding JIRA issue has been closed. @GregoryKimball Can this be closed?

GregoryKimball · 2023-01-21T18:42:55Z

Thank you @bdice for checking in. I'm sorry to say that there is a still a problem here at the intersection of nsys and nvbench.

When running this command on RAPIDS devel image 9ddc9c4c2046

B=JOIN_NVBENCH && /nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20230113_mixed_join/"$B" cpp/build/benchmarks/"$B" --devices 0 --profile --json /nfs/20230113_mixed_join/"$B".json | tee /nfs/20230113_mixed_join/"$B".txt

We receive this nsys diagnostics error instead of valid GPU Utilization metrics:

Error when processing events: Source ID=
Type=ErrorInformation (18)
 Error information:
 ProcessEventsError (4005)
  Properties:
  ErrorText (100)=/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/EventHandler/GpuMetricsEventHandler.cpp(202): Throw in function void QuadDAnalysis::EventHandler::GpuMetricsEventHandler::PutEvent(QuadDAnalysis::EventHandler::GpuMetricsEventHandler::EventPtr)
Dynamic exception type: boost::wrapexcept
std::exception::what: ChronologicalOrderError
[QuadDCommon::tag_message*] = GPU Metrics event chronological order was broken.

...

Error	Daemon		00:32.551	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		00:46.632	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		00:55.676	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:03.454	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:17.560	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:26.606	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:34.367	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:38.231	
GPU Metrics [0]: Sampling buffer overflow.

As of cudf 23.02, we find the "GPU Metrics event chronological order was broken" error for every libcudf microbenchmark that uses nvbench including:

'GROUPBY_NVBENCH', 
'JOIN_NVBENCH', 
'PARQUET_READER_NVBENCH',
'PARQUET_WRITER_NVBENCH', 
'REDUCTION_NVBENCH', 
'SORT_NVBENCH',
'STREAM_COMPACTION_NVBENCH'

The error only occurs with nvbench, and never with google benchmarks or pytest, so I still think we need an option in nvbench to prevent corrupting GPU Metrics event data. Perhaps solving #100 will also address this. I am hoping that the workaround would take the form of a "cuda safe mode" that makes nvbench act like a GPU-naive benchmarking tool.

GregoryKimball · 2023-02-22T06:20:23Z

Closed by rapidsai/cudf#12728

GregoryKimball closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workaround for corrupted nsys GPU utilization data #104

Add workaround for corrupted nsys GPU utilization data #104

GregoryKimball commented Nov 10, 2022

bdice commented Jan 3, 2023

GregoryKimball commented Jan 21, 2023 •

edited

Loading

GregoryKimball commented Feb 22, 2023

Add workaround for corrupted nsys GPU utilization data #104

Add workaround for corrupted nsys GPU utilization data #104

Comments

GregoryKimball commented Nov 10, 2022

bdice commented Jan 3, 2023

GregoryKimball commented Jan 21, 2023 • edited Loading

GregoryKimball commented Feb 22, 2023

GregoryKimball commented Jan 21, 2023 •

edited

Loading