-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIOpen cache issue with SLURM and multiple jobs #1878
Comments
@arkhodamoradi, the discussion in #114 seems to be related to your guess. Could you please try:
|
@arkhodamoradi Please read this first. Is your ${HOME} mapped to NFS or similar? If yes, then we have workaround on hand. Otherwise some further investigation required. |
@dmikushin @atamazov |
Well, this is necessary in order to prevent issues user-perf-db and user-find-db (and I highly recommend using it if ${HOME} is mapped to a network file system). However, the issue happens with the user kernel cache. In order to change the user's kernel cache location, the user should either use |
literal
I suspect that setting MIOPEN_USER_DB_PATH changed the timings and, therefore, changed the likelihood of simultaneous accesses to user-kernel-db from different nodes. That might camouflage the database corruption problem and, possibly, negatively affect the performance of some jobs (as you've mentioned). And yes, MIOPEN_CUSTOM_CACHE_DIR should resolve this. WRT performance, please prefer local volumes over network volumes, if possible (see #114 (comment)) Please note the following:
|
@arkhodamoradi Has this issue been resolved? If so, please close ticket. Thanks! |
The environment is a computing cluster:
Slurm 20.11.3
MI50 GPUs
PyTorch 1.12.0
ROCM 5.2.0
Code (test.py):
import torch
net = torch.nn.Conv2d(2, 28, 3).cuda()
inp = torch.randn(20, 2, 50, 50).cuda()
outputs = net(inp)
Run the code (test.py) in multiple jobs (using the sbatch command) to get this error:
/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/include/miopen/kern_db.hpp:147: Internal error while accessing SQLite database: database disk image is malformed
Traceback (most recent call last):
File "/home/alirezak/playground/test.py", line 4, in
outputs = net(inp)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 268, in _forward_impl
x = self.conv1(x)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: miopenStatusInternalError
Solution:
For each (sbatch) job, create the following:
experiment=some unique name
HOME=some tmp folder/$experiment
mkdir some tmp folder/$experiment
TMPDIR=some tmp folder/$experiment/tmp
mkdir some tmp folder/$experiment/tmp
MIOPEN_CUSTOM_CACHE_DIR=some tmp folder/$experiment/miopen
mkdir some tmp folder/$experiment/miopen
run the experiment
rm -rf some tmp folder/$experiment
My guess:
The MIOpen cache includes gfx906_60.ukdb, gfx906_60.ukdb-shm, and gfx906_60.ukdb-wal files that are used/shared by multiple jobs. Is it possible to add some random number to these files per job?
Thank you
The text was updated successfully, but these errors were encountered: