Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rocprofiler does not allow to change metrics when using intercept mode #71

Closed
gcongiu opened this issue Jan 24, 2022 · 10 comments
Closed

Comments

@gcongiu
Copy link

gcongiu commented Jan 24, 2022

Currently, rocprofiler does not allow to change metrics at runtime for intercepted kernels, so the following example won't work:

  rocprofiler_feature_t features[4];
  features[0].kind = ROCPROFILER_FEATURE_KIND_METRIC;
  features[0].name = "SQ_WAVES";
  unsigned feature_count = 1;

  init_intercept(features, feature_count);
  start_intercept();

  hipLaunchKernelGGL(vectoradd_float,
                  dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
                  dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
                  0, 0,
                  deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);

  hipDeviceSynchronize();
  stop_intercept();
  shutdown_intercept();

  features[1].kind = ROCPROFILER_FEATURE_KIND_METRIC;
  features[1].name = "SQ_INSTS_VALU";
  feature_count += 1;

  init_intercept(features, feature_count);
  start_intercept();

  hipLaunchKernelGGL(vectoradd_float,
                  dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
                  dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
                  0, 0,
                  deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);

  hipDeviceSynchronize();
  stop_intercept();
  shutdown_intercept();

Above, init_intercept() initializes the queue callbacks for intercept mode and calls rocprofiler_set_queue_callbacks(). start_intercept() and stop_intercept() call rocprofiler_start_queue_callbacks() and rocprofiler_stop_queue_callbacks(), respectively, and shutdown_intercept() calls rocprofiler_remove_queue_callbacks().

Rocprofiler does not allow users to call rocprofiler_set_queue_callbacks() if this has been already called. Thus, the second call to init_intercept() in the example code above causes the following error message:

> error(4096) "SetCallbacks(), reassigning queue callbacks - not supported”

The ability to change metrics at runtime (while using intercept mode) is a feature highly desirable for tools like PAPI. With the current implementation of rocprofiler PAPI users would have to define metrics once and have them applied to all the kernels being intercepted.

@gcongiu
Copy link
Author

gcongiu commented Jan 25, 2022

Example source code:
vectoradd_hip.cpp.txt

kikimych added a commit to kikimych/rocprofiler that referenced this issue Feb 21, 2022
         callback_data_ is not cleared in RemoveCallbacks
         fixed it
@kikimych
Copy link

Could you please check small fix? #77
Value of feature_count in https://github.com/ROCm-Developer-Tools/rocprofiler/files/7933103/vectoradd_hip.cpp.txt. can't be greater than actual number of features in vector.

@gcongiu
Copy link
Author

gcongiu commented Mar 1, 2022

@kikimych I will check it. Thanks

@gcongiu
Copy link
Author

gcongiu commented Mar 1, 2022

Previously I was getting:

LD_LIBRARY_PATH=$HOME/rocm-4.5.0/rocprofiler/lib:$LD_LIBRARY_PATH ./vectoradd_hip.exe
 System minor 0
 System major 9
 agent prop name
hip Device prop succeeded
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
error(4096) "SetCallbacks(), reassigning queue callbacks - not supported"
HSA_STATUS_ERROR: A generic error has occurred.
Aborted (core dumped)

Now, with the fix that sets callback_data_ = NULL I am getting:

$ LD_LIBRARY_PATH=$HOME/rocm-4.5.0/rocprofiler/lib:$LD_LIBRARY_PATH ./vectoradd_hip.exe
 System minor 0
 System major 9
 agent prop name
hip Device prop succeeded
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
vectoradd_hip.exe: vectoradd_hip.cpp:160: hsa_status_t _rocp_dispatch_callback(const rocprofiler_callback_data_t *, void *, rocprofiler_group_t *): Assertion `status == HSA_STATUS_SUCCESS' failed.
Aborted (core dumped)

@kikimych
Copy link

You are adding 2 features to profile, but setting feature count to 1. I have fixed it locally, but forgot to report. Could you please check with feature_count set to number of actual features?

@gcongiu
Copy link
Author

gcongiu commented Mar 30, 2022

The feature_count=1 was intended not a typo. However, if I set it to 2 instead, start/stop monitoring, then change it back to 1 and start/stop monitoring again, I get the same error(4096) "SetCallbacks(), reassigning queue callbacks - not supported" I used to see.

@kikimych
Copy link

My bad. It's not related to feature_count. You have a typo in metric name. "SQ_INSTS_VALU", not "SQ_INST_VALU".

@gcongiu
Copy link
Author

gcongiu commented Apr 3, 2023

[Update] Still not resolved with ROCm 5.4.x:

$ ./kernel-intercept
Tool lib "/home/gcongiu/rocprofiler/build/librocprofiler64.so" failed to load.
 System minor 0
 System major 9
 agent prop name AMD Instinct MI210
hip Device prop succeeded
error(4096) "SetCallbacks(), reassigning queue callbacks - not supported"
HSA_STATUS_ERROR: A generic error has occurred.
Aborted (core dumped)

@harkgill-amd
Copy link

Hi @gcongiu, apologies for the lack of response. Could you please check if the issue persists with ROCm 6.2.0? If so, we can continue to investigate the issue from there.

@harkgill-amd
Copy link

Closing this issue out. @gcongiu , if you are still encountering this error with the latest ROCm 6.2.1 release, please leave a comment and I will re-open this ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants