Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for corrupted nsys GPU utilization data #104

Closed
GregoryKimball opened this issue Nov 10, 2022 · 3 comments
Closed

Add workaround for corrupted nsys GPU utilization data #104

GregoryKimball opened this issue Nov 10, 2022 · 3 comments

Comments

@GregoryKimball
Copy link

libcudf benchmarks run using nvbench show a conflict with Nsight Systems when collecting GPU utilization data. This issue is tracked in the Nsight Systems Jira board (Slack thread, Jira Issue).

The current consensus is that the root cause lies within Nsight Systems collects utilization data. I'm opening this issue to request that nvbench investigates a workaround. C++ google benchmarks and python pytest benchmarks have no issues collecting GPU utilization data with Nsight Systems, so there must be some way for nvbench user using the --profile flag to access GPU utilization data.

Reference profile with nvbench:
image

Reference profile with gbench:
image

@bdice
Copy link
Contributor

bdice commented Jan 3, 2023

The corresponding JIRA issue has been closed. @GregoryKimball Can this be closed?

@GregoryKimball
Copy link
Author

GregoryKimball commented Jan 21, 2023

Thank you @bdice for checking in. I'm sorry to say that there is a still a problem here at the intersection of nsys and nvbench.

When running this command on RAPIDS devel image 9ddc9c4c2046

B=JOIN_NVBENCH && /nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20230113_mixed_join/"$B" cpp/build/benchmarks/"$B" --devices 0 --profile --json /nfs/20230113_mixed_join/"$B".json | tee /nfs/20230113_mixed_join/"$B".txt

We receive this nsys diagnostics error instead of valid GPU Utilization metrics:

Error when processing events: Source ID=
Type=ErrorInformation (18)
 Error information:
 ProcessEventsError (4005)
  Properties:
  ErrorText (100)=/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/EventHandler/GpuMetricsEventHandler.cpp(202): Throw in function void QuadDAnalysis::EventHandler::GpuMetricsEventHandler::PutEvent(QuadDAnalysis::EventHandler::GpuMetricsEventHandler::EventPtr)
Dynamic exception type: boost::wrapexcept
std::exception::what: ChronologicalOrderError
[QuadDCommon::tag_message*] = GPU Metrics event chronological order was broken.

...

Error	Daemon		00:32.551	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		00:46.632	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		00:55.676	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:03.454	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:17.560	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:26.606	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:34.367	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:38.231	
GPU Metrics [0]: Sampling buffer overflow.

As of cudf 23.02, we find the "GPU Metrics event chronological order was broken" error for every libcudf microbenchmark that uses nvbench including:

'GROUPBY_NVBENCH', 
'JOIN_NVBENCH', 
'PARQUET_READER_NVBENCH',
'PARQUET_WRITER_NVBENCH', 
'REDUCTION_NVBENCH', 
'SORT_NVBENCH',
'STREAM_COMPACTION_NVBENCH'

The error only occurs with nvbench, and never with google benchmarks or pytest, so I still think we need an option in nvbench to prevent corrupting GPU Metrics event data. Perhaps solving #100 will also address this. I am hoping that the workaround would take the form of a "cuda safe mode" that makes nvbench act like a GPU-naive benchmarking tool.

@GregoryKimball
Copy link
Author

Closed by rapidsai/cudf#12728

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants