Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: RDC libraries try to dlopen missing library, and read xml file from wrong path #35

Open
morrone opened this issue Nov 6, 2024 · 19 comments

Comments

@morrone
Copy link

morrone commented Nov 6, 2024

Problem Description

Note that we are installing ROCm 6.2.1 in the path:

/op/rocm-6.2.1/

The path includes the version to allow multiple versions of ROCm to be installed at the same time.

I have an application that uses RDC, and the RDC libraries are throwing the following ERRORs (which appear to be more like warnings, because things continue to run):

1730927059.242 ERROR RdcModuleMgrImpl.cc(65): Failed to insert module: N3amd3rdc9RdcRVSLibE
Fail to open librdc_rvs.so: librdc_rvs.so: cannot open shared object file: No such file or directory
1730927059.295 ERROR RdcRocpLib.cc(208): failed to open ROCP_METRICS: /opt/rocm/libexec/rocprofiler/counters/derived_counters.xml
1730927059.295 ERROR RdcRocpLib.cc(46): Rocp related function will not work.
1730927059.295 ERROR RdcModuleMgrImpl.cc(65): Failed to insert module: N3amd3rdc10RdcRocpLibE
rocprofiler path could not be set

When I search the /opt/rocm-6.2.1 tree for "rvs", the only libraries that I find with that string are the following:

/opt/rocm-6.2.1/lib/librvslib.so
/opt/rocm-6.2.1/lib/librvslib.so.0
/opt/rocm-6.2.1/lib/librvslib.so.0.0.60201

I have no idea if librvslib is a different name for librdc_rvs.so, or something else entirely.

In any event, there does not seem to be a "librdc_rvs.so" anywhere in the tree.

Next, the derived_counters.xml file is found in our install tree here:

/opt/rocm-6.2.1/libexec/rocprofiler/counters/derived_counters.xml

Does RDC have the path improperly hard coded?

Operating System

RHEL8

CPU

Irrelevant

GPU

Irrelevant

ROCm Version

ROCm 6.2.1

ROCm Component

rdc

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@morrone morrone changed the title [Issue]: RDC libraries try to dlopen mispelled library, and read xml file from wrong path [Issue]: RDC libraries try to dlopen missing library, and read xml file from wrong path Nov 6, 2024
@morrone
Copy link
Author

morrone commented Nov 6, 2024

It looks to me in like int he top-level CMakefile of the RDC code we have this:

# When cmake -DBUILD_RVS=off, it will not build the librdc_rvs.so
# which requires the RocmValidationSuite
option(BUILD_RVS "Build targets for librdc_rvs.so" OFF)

So the RDC code really shouldn't be trying to load a library when it should know that the library doesn't exist. And if the library wasn't build, the most we should expect to see is warning that the "rvs" support does not exist. But is the "RocmValidationSuite" something that we would care about in normal production use of AMD GPUs?

Would we miss out on any of the monitoring fields in enum rdc_field_t when rvs support is missing?

@harkgill-amd
Copy link
Contributor

harkgill-amd commented Nov 7, 2024

You are correct, librdc_rvs.so should not be loaded when the default is set to not build it. This is fixed in the amd-staging branch with 660c5af.

For the error related to derived_counters.xml, it looks like the ROCM_PATH environment variable is used to locate your ROCm installation. /opt/rocm is the default if the variable is not set (ref). Could you please try setting the variable and check if the error is resolved?

@morrone
Copy link
Author

morrone commented Nov 7, 2024

For the error related to derived_counters.xml, it looks like the ROCM_PATH environment variable is used to locate your ROCm installation. /opt/rocm is the default if the variable is not set (ref). Could you please try setting the variable and check if the error is resolved?

That default needs to be fixed. The default should be derived from the configured install target path at build time.

@harkgill-amd
Copy link
Contributor

Could you provide more information on the environment you're running on as well as the output of stat /proc/self/maps? /opt/rocm is only defaulted to if ROCM_PATH isn't set and /proc/self/maps cannot be opened or librocprofiler64.so is not found in /proc/self/maps.

@morrone
Copy link
Author

morrone commented Nov 14, 2024

Could you provide more information on the environment you're running on as well as the output of stat /proc/self/maps? /opt/rocm is only defaulted to if ROCM_PATH isn't set and /proc/self/maps cannot be opened or librocprofiler64.so is not found in /proc/self/maps.

/proc/self/maps can be opened, and it does not contain librocprofiler64.so. But there is no reason to believe that my shell's "maps" file would contain librocprofiler64.so. That doesn't really tell us anything.

You might be overthinking things. There is no additional system information needed to understand the problem.

We have ROCm version 6.2.1 installed in:

/opt/rocm-6.2.1

Therefore, /opt/rocm is the incorrect default path for when ROCM_PATH is not set. For our choosen base installation path, the correct default is /opt/rocm-6.2.1 when ROCM_PATH is not set.

The code is making an incorrect assumption that everyone installs ROCm in exactly the same place.

@zichguan-amd
Copy link
Contributor

Hi @morrone, does the error occur when you specify ROCM_PATH to your ROCm installation directory? If you have multiple versions of ROCm installed, you need to specify which ROCm version to use with ROCM_PATH. The default /opt/rocm path symlinks to the latest ROCm installed, you can view and change your default ROCm version with sudo update-alternatives --config rocm (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html).

@morrone
Copy link
Author

morrone commented Nov 19, 2024

Hi @morrone, does the error occur when you specify ROCM_PATH to your ROCm installation directory? If you have multiple versions of ROCm installed, you need to specify which ROCm version to use with ROCM_PATH. The default /opt/rocm path symlinks to the latest ROCm installed, you can view and change your default ROCm version with sudo update-alternatives --config rocm (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html).

We don't have an /opt/rocm symlink.

Settting ROCM_PATH would probably work, but setting a ROCM_PATH environment variable is somewhat challenging for something that runs as a daemon out of systemd. It means we need to synchronize the runtime configuration with knowledge of how things were compiled. It seems like effort that is kicked down the road from ROCm not handing that.

@zichguan-amd
Copy link
Contributor

You can add environment variables to the environment file at /opt/rocm-6.2.1/etc/rdc_options. For example

# Append 'rdc' daemon parameters here
RDC_OPTS="-u -d"
ROCM_PATH=/opt/rocm-6.2.1
#RDC_OPTS="-p 50051 -u -d"

@morrone
Copy link
Author

morrone commented Nov 20, 2024

You can add environment variables to the environment file at /opt/rocm-6.2.1/etc/rdc_options. For example

The rdc library will know how to find /opt/rocm-6.2.1/etc/rdc_options? But it can't find /opt/rocm-6.2.1 in other places in the code?

@zichguan-amd
Copy link
Contributor

You should have a service file in your ROCm directory at /opt/rocm-6.2.1/libexec/rdc/rdc.service, which sets the environment file. By default, it should point to /opt/rocm-6.2.1/etc/rdc_options, see this line

EnvironmentFile=-/@CPACK_PACKAGING_INSTALL_PREFIX@/@CMAKE_INSTALL_SYSCONFDIR@/rdc_options
and docs here: https://github.com/ROCm/rdc?tab=readme-ov-file#start-rdcd-using-systemd.

@morrone
Copy link
Author

morrone commented Nov 20, 2024 via email

@zichguan-amd
Copy link
Contributor

Then you need to figure out how to set environment variable for ldmsd. Judging from ldms documentation, it already relies on environment variables: https://ovis-hpc-personal.readthedocs.io/projects/ldms/en/latest/ldms-quickstart.html#basic-configuration-and-running. If it is started with systemd then there would've been a ldms.service file that you can configure similar to how it's done with rdc.service.

@morrone
Copy link
Author

morrone commented Nov 23, 2024

Then you need to figure out how to set environment variable for ldmsd.

Pretty much the only sane way to handle this is to dynamically generate the service file at compile time with the fixed path in it. And embedding the correct path at compile time is what ROCm (at this time) seems to be refusing to do.

So basically ROCm is forcing me to do the same thing that ROCm refuses to do.

@dmitrii-galantsev
Copy link
Collaborator

whoops didn't see this issue until now.
Hopefully we can figure something out by EOY.

@dmitrii-galantsev
Copy link
Collaborator

dmitrii-galantsev commented Nov 26, 2024

@morrone I think you're right and setting it to install path is the path forward.
I'll address in next couple weeks due to PTO conflicts.

@dmitrii-galantsev
Copy link
Collaborator

and yes, trying to load modules when compiled without the support for them is naive:

insert_modules<RdcRVSLib, RdcRocrLib, RdcRocpLib>();

It would be much better to write something less headache inducing that uses compile definitions.
I've not been great at supporting non-standard build configurations (read: whatever we have in our CI build scripts).

If you have suggestions on that front - I'm all ears!

@zichguan-amd
Copy link
Contributor

@morrone we can change the default ROCm path to be the path set by the cmake flag -DROCM_DIR at build time, would that resolve your concerns?

@morrone
Copy link
Author

morrone commented Dec 2, 2024

@morrone we can change the default ROCm path to be the path set by the cmake flag -DROCM_DIR at build time, would that resolve your concerns?

Assuming that "-DROCM_DIR" is what the ROCm folks use when they build the standard rpms, then yes, that sounds like the right thing to me.

@zichguan-amd
Copy link
Contributor

Hi @morrone, changes have landed in amd-staging c042b4f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants