Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIOpen cache issue with SLURM and multiple jobs #1878

Open
arkhodamoradi opened this issue Nov 14, 2022 · 6 comments
Open

MIOpen cache issue with SLURM and multiple jobs #1878

arkhodamoradi opened this issue Nov 14, 2022 · 6 comments
Assignees
Labels

Comments

@arkhodamoradi
Copy link

arkhodamoradi commented Nov 14, 2022

The environment is a computing cluster:
Slurm 20.11.3
MI50 GPUs
PyTorch 1.12.0
ROCM 5.2.0

Code (test.py):


import torch
net = torch.nn.Conv2d(2, 28, 3).cuda()
inp = torch.randn(20, 2, 50, 50).cuda()
outputs = net(inp)


Run the code (test.py) in multiple jobs (using the sbatch command) to get this error:
/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/include/miopen/kern_db.hpp:147: Internal error while accessing SQLite database: database disk image is malformed
Traceback (most recent call last):
File "/home/alirezak/playground/test.py", line 4, in
outputs = net(inp)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 268, in _forward_impl
x = self.conv1(x)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: miopenStatusInternalError

Solution:
For each (sbatch) job, create the following:

experiment=some unique name
HOME=some tmp folder/$experiment
mkdir some tmp folder/$experiment
TMPDIR=some tmp folder/$experiment/tmp
mkdir some tmp folder/$experiment/tmp
MIOPEN_CUSTOM_CACHE_DIR=some tmp folder/$experiment/miopen
mkdir some tmp folder/$experiment/miopen

run the experiment

rm -rf some tmp folder/$experiment

My guess:
The MIOpen cache includes gfx906_60.ukdb, gfx906_60.ukdb-shm, and gfx906_60.ukdb-wal files that are used/shared by multiple jobs. Is it possible to add some random number to these files per job?

Thank you

@dmikushin
Copy link
Contributor

@arkhodamoradi, the discussion in #114 seems to be related to your guess. Could you please try:

export MIOPEN_USER_DB_PATH=*

@atamazov
Copy link
Contributor

atamazov commented Nov 14, 2022

@arkhodamoradi Please read this first. Is your ${HOME} mapped to NFS or similar?

If yes, then we have workaround on hand. Otherwise some further investigation required.

@arkhodamoradi
Copy link
Author

@dmikushin
setting MIOPEN_USER_DB_PATH to * did not resolve my issue.

@atamazov
Yes, the HOME is NSF. And setting MIOPEN_USER_DB_PATH to a unique location per job fixed my issue.
However, some jobs are running very slowly, and I still have the ~/.cache/miopen folder shared by all the jobs.
I can resolve the "slow job" issue by setting MIOPEN_CUSTOM_CACHE_DIR to a unique location per job.

@atamazov
Copy link
Contributor

@dmikushin

@arkhodamoradi, the discussion in #114 seems to be related to your guess. Could you please try:

export MIOPEN_USER_DB_PATH=<non-default-path>

Well, this is necessary in order to prevent issues user-perf-db and user-find-db (and I highly recommend using it if ${HOME} is mapped to a network file system). However, the issue happens with the user kernel cache.

In order to change the user's kernel cache location, the user should either use MIOPEN_CACHE_DIR CMake variable during the build (see https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/doc/src/cache.md#kernel-cache) or engage the undocumented MIOPEN_CUSTOM_CACHE_DIR environment variable.

@atamazov
Copy link
Contributor

atamazov commented Nov 14, 2022

@arkhodamoradi

setting MIOPEN_USER_DB_PATH to *

literal * is incorrect ;) You can find some good hint at #114 (comment)

Yes, the HOME is NSF. And setting MIOPEN_USER_DB_PATH to a unique location per job fixed my issue. However, some jobs are running very slowly, and I still have the ~/.cache/miopen folder shared by all the jobs. I can resolve the "slow job" issue by setting MIOPEN_CUSTOM_CACHE_DIR to a unique location per job.

I suspect that setting MIOPEN_USER_DB_PATH changed the timings and, therefore, changed the likelihood of simultaneous accesses to user-kernel-db from different nodes. That might camouflage the database corruption problem and, possibly, negatively affect the performance of some jobs (as you've mentioned). And yes, MIOPEN_CUSTOM_CACHE_DIR should resolve this.

WRT performance, please prefer local volumes over network volumes, if possible (see #114 (comment))

Please note the following:

  • (1) MIOpen is guaranteed to work correctly in multi-process environments. In other words, if several instances of MIOpen are used in different processes, everything should work correctly provided that $HOME/.cache and $HOME/.config reside on the local volume.
    • If the above is the case, then the use of things like $SLURM_PROCID in the database paths is not necessary. Moreover, it would negatively affect performance because binary kernels are not shared among different processes anymore: each process should build each kernel instead of reusing the one built by some other process. This is serious impact.
  • (2) If .cache and .config reside on network volume, then the use of things like $SLURM_PROCID is necessary.

@ppanchad-amd
Copy link

@arkhodamoradi Has this issue been resolved? If so, please close ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants