-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: RDC libraries try to dlopen missing library, and read xml file from wrong path #35
Comments
It looks to me in like int he top-level CMakefile of the RDC code we have this:
So the RDC code really shouldn't be trying to load a library when it should know that the library doesn't exist. And if the library wasn't build, the most we should expect to see is warning that the "rvs" support does not exist. But is the "RocmValidationSuite" something that we would care about in normal production use of AMD GPUs? Would we miss out on any of the monitoring fields in enum rdc_field_t when rvs support is missing? |
You are correct, For the error related to derived_counters.xml, it looks like the |
That default needs to be fixed. The default should be derived from the configured install target path at build time. |
Could you provide more information on the environment you're running on as well as the output of |
/proc/self/maps can be opened, and it does not contain librocprofiler64.so. But there is no reason to believe that my shell's "maps" file would contain librocprofiler64.so. That doesn't really tell us anything. You might be overthinking things. There is no additional system information needed to understand the problem. We have ROCm version 6.2.1 installed in:
Therefore, /opt/rocm is the incorrect default path for when ROCM_PATH is not set. For our choosen base installation path, the correct default is The code is making an incorrect assumption that everyone installs ROCm in exactly the same place. |
Hi @morrone, does the error occur when you specify |
We don't have an /opt/rocm symlink. Settting ROCM_PATH would probably work, but setting a ROCM_PATH environment variable is somewhat challenging for something that runs as a daemon out of systemd. It means we need to synchronize the runtime configuration with knowledge of how things were compiled. It seems like effort that is kicked down the road from ROCm not handing that. |
You can add environment variables to the environment file at
|
The rdc library will know how to find /opt/rocm-6.2.1/etc/rdc_options? But it can't find /opt/rocm-6.2.1 in other places in the code? |
You should have a service file in your ROCm directory at Line 16 in a0f7290
|
That is not really relevant here. I'm not talking about the rdc service or
rdc.service file. I am talking about using the rdc libraries from a service
that is not part of ROCm. Just using the rdc API. In particular, we are
using the ldmsd monitoring daemon which uses librdc_bootstrap in embedded
mode.
…On Wed, Nov 20, 2024 at 12:07 PM zichguan-amd ***@***.***> wrote:
You should have a service file in your ROCm directory at
/opt/rocm-6.2.1/libexec/rdc/rdc.service, which sets the environment file.
By default, it should point to /opt/rocm-6.2.1/etc/rdc_options, see this
line
https://github.com/ROCm/rdc/blob/a0f72904286950a9060d51404decc04b042cb85d/server/rdc.service.in#L16
and docs here:
https://github.com/ROCm/rdc?tab=readme-ov-file#start-rdcd-using-systemd.
—
Reply to this email directly, view it on GitHub
<#35>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABLRYA7DPDYIJLDBTFTZQ32BTTY3AVCNFSM6AAAAABRJXQRR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGQ2DSMJSGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Then you need to figure out how to set environment variable for ldmsd. Judging from ldms documentation, it already relies on environment variables: https://ovis-hpc-personal.readthedocs.io/projects/ldms/en/latest/ldms-quickstart.html#basic-configuration-and-running. If it is started with |
Pretty much the only sane way to handle this is to dynamically generate the service file at compile time with the fixed path in it. And embedding the correct path at compile time is what ROCm (at this time) seems to be refusing to do. So basically ROCm is forcing me to do the same thing that ROCm refuses to do. |
whoops didn't see this issue until now. |
@morrone I think you're right and setting it to install path is the path forward. |
and yes, trying to load modules when compiled without the support for them is naive: rdc/rdc_libs/rdc/src/RdcModuleMgrImpl.cc Line 91 in d8fec06
It would be much better to write something less headache inducing that uses compile definitions. If you have suggestions on that front - I'm all ears! |
@morrone we can change the default ROCm path to be the path set by the cmake flag |
Assuming that "-DROCM_DIR" is what the ROCm folks use when they build the standard rpms, then yes, that sounds like the right thing to me. |
Problem Description
Note that we are installing ROCm 6.2.1 in the path:
/op/rocm-6.2.1/
The path includes the version to allow multiple versions of ROCm to be installed at the same time.
I have an application that uses RDC, and the RDC libraries are throwing the following ERRORs (which appear to be more like warnings, because things continue to run):
When I search the /opt/rocm-6.2.1 tree for "rvs", the only libraries that I find with that string are the following:
I have no idea if librvslib is a different name for librdc_rvs.so, or something else entirely.
In any event, there does not seem to be a "librdc_rvs.so" anywhere in the tree.
Next, the derived_counters.xml file is found in our install tree here:
Does RDC have the path improperly hard coded?
Operating System
RHEL8
CPU
Irrelevant
GPU
Irrelevant
ROCm Version
ROCm 6.2.1
ROCm Component
rdc
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: