MIOpen cache issue with SLURM and multiple jobs #1878

arkhodamoradi · 2022-11-14T19:15:11Z

The environment is a computing cluster:
Slurm 20.11.3
MI50 GPUs
PyTorch 1.12.0
ROCM 5.2.0

Code (test.py):

import torch
net = torch.nn.Conv2d(2, 28, 3).cuda()
inp = torch.randn(20, 2, 50, 50).cuda()
outputs = net(inp)

Run the code (test.py) in multiple jobs (using the sbatch command) to get this error:
/long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/include/miopen/kern_db.hpp:147: Internal error while accessing SQLite database: database disk image is malformed
Traceback (most recent call last):
File "/home/alirezak/playground/test.py", line 4, in
outputs = net(inp)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 285, in forward
return self._forward_impl(x)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torchvision/models/resnet.py", line 268, in _forward_impl
x = self.conv1(x)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/public/apps/python/3.10.6/gcc.7.3.1/base/pytorch/1.12.0/rocm.5.2.0/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: miopenStatusInternalError

Solution:
For each (sbatch) job, create the following:

experiment=some unique name
HOME=some tmp folder/$experiment
mkdir some tmp folder/$experiment
TMPDIR=some tmp folder/$experiment/tmp
mkdir some tmp folder/$experiment/tmp
MIOPEN_CUSTOM_CACHE_DIR=some tmp folder/$experiment/miopen
mkdir some tmp folder/$experiment/miopen

run the experiment

rm -rf some tmp folder/$experiment

My guess:
The MIOpen cache includes gfx906_60.ukdb, gfx906_60.ukdb-shm, and gfx906_60.ukdb-wal files that are used/shared by multiple jobs. Is it possible to add some random number to these files per job?

Thank you

dmikushin · 2022-11-14T19:52:07Z

@arkhodamoradi, the discussion in #114 seems to be related to your guess. Could you please try:

export MIOPEN_USER_DB_PATH=*

atamazov · 2022-11-14T20:58:01Z

@arkhodamoradi Please read this first. Is your ${HOME} mapped to NFS or similar?

If yes, then we have workaround on hand. Otherwise some further investigation required.

arkhodamoradi · 2022-11-14T21:21:57Z

@dmikushin
setting MIOPEN_USER_DB_PATH to * did not resolve my issue.

@atamazov
Yes, the HOME is NSF. And setting MIOPEN_USER_DB_PATH to a unique location per job fixed my issue.
However, some jobs are running very slowly, and I still have the ~/.cache/miopen folder shared by all the jobs.
I can resolve the "slow job" issue by setting MIOPEN_CUSTOM_CACHE_DIR to a unique location per job.

atamazov · 2022-11-14T21:23:44Z

@dmikushin

@arkhodamoradi, the discussion in #114 seems to be related to your guess. Could you please try:
export MIOPEN_USER_DB_PATH=<non-default-path>

Well, this is necessary in order to prevent issues user-perf-db and user-find-db (and I highly recommend using it if ${HOME} is mapped to a network file system). However, the issue happens with the user kernel cache.

In order to change the user's kernel cache location, the user should either use MIOPEN_CACHE_DIR CMake variable during the build (see https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/doc/src/cache.md#kernel-cache) or engage the undocumented MIOPEN_CUSTOM_CACHE_DIR environment variable.

atamazov · 2022-11-14T21:49:10Z

@arkhodamoradi

setting MIOPEN_USER_DB_PATH to *

literal * is incorrect ;) You can find some good hint at #114 (comment)

Yes, the HOME is NSF. And setting MIOPEN_USER_DB_PATH to a unique location per job fixed my issue. However, some jobs are running very slowly, and I still have the ~/.cache/miopen folder shared by all the jobs. I can resolve the "slow job" issue by setting MIOPEN_CUSTOM_CACHE_DIR to a unique location per job.

I suspect that setting MIOPEN_USER_DB_PATH changed the timings and, therefore, changed the likelihood of simultaneous accesses to user-kernel-db from different nodes. That might camouflage the database corruption problem and, possibly, negatively affect the performance of some jobs (as you've mentioned). And yes, MIOPEN_CUSTOM_CACHE_DIR should resolve this.

WRT performance, please prefer local volumes over network volumes, if possible (see #114 (comment))

Please note the following:

(1) MIOpen is guaranteed to work correctly in multi-process environments. In other words, if several instances of MIOpen are used in different processes, everything should work correctly provided that $HOME/.cache and $HOME/.config reside on the local volume.
- If the above is the case, then the use of things like $SLURM_PROCID in the database paths is not necessary. Moreover, it would negatively affect performance because binary kernels are not shared among different processes anymore: each process should build each kernel instead of reusing the one built by some other process. This is serious impact.
(2) If .cache and .config reside on network volume, then the use of things like $SLURM_PROCID is necessary.

ppanchad-amd · 2024-04-17T15:41:57Z

@arkhodamoradi Has this issue been resolved? If so, please close ticket. Thanks!

junliume assigned JehandadKhan Nov 14, 2022

junliume added the question label Nov 14, 2022

atamazov mentioned this issue Nov 14, 2022

'boost::filesystem::filesystem_error' & Perf Db: record not found for: ConvAsm3x3U #114

Closed

formiel mentioned this issue Oct 18, 2024

Use of MIOPEN_USER_DB_PATH for training speedup in sequential jobs settings #3322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIOpen cache issue with SLURM and multiple jobs #1878

MIOpen cache issue with SLURM and multiple jobs #1878

arkhodamoradi commented Nov 14, 2022 •

edited

Loading

dmikushin commented Nov 14, 2022

atamazov commented Nov 14, 2022 •

edited

Loading

arkhodamoradi commented Nov 14, 2022

atamazov commented Nov 14, 2022

atamazov commented Nov 14, 2022 •

edited

Loading

ppanchad-amd commented Apr 17, 2024

MIOpen cache issue with SLURM and multiple jobs #1878

MIOpen cache issue with SLURM and multiple jobs #1878

Comments

arkhodamoradi commented Nov 14, 2022 • edited Loading

dmikushin commented Nov 14, 2022

atamazov commented Nov 14, 2022 • edited Loading

arkhodamoradi commented Nov 14, 2022

atamazov commented Nov 14, 2022

atamazov commented Nov 14, 2022 • edited Loading

ppanchad-amd commented Apr 17, 2024

arkhodamoradi commented Nov 14, 2022 •

edited

Loading

atamazov commented Nov 14, 2022 •

edited

Loading

atamazov commented Nov 14, 2022 •

edited

Loading