Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301

numPlumber · 2019-01-27T03:09:57Z

System information

Custom code: Installed tensorpack from github rather than the package due to a lack of update to msgpack params that led to an error. Commented out lines 484, 491, 492 in train.py to prevent calls to nvidia performance tracking.
OS: Linux Ubuntu 18.04
TensorFlow installed binary
TensorFlow version: v1.12.0-871-gf480b4a 1.12.0
Python version: Python 3.6.7
GPU model and memory:
name: Ellesmere [Radeon RX 470/480]
AMDGPU ISA: gfx803
memoryClockRate (GHz) 1.15
pciBusID 0000:01:00.0
Total memory: 8.00GiB
Free memory: 7.75GiB

Describe the current behavior
Memory access fault by GPU node-1 (Agent handle: 0x2b68910) on address 0xadca1b000. Reason: Page not present or supervisor privilege.

Usually comes up on first training iteration, but it has made it to 8 iterations before showing the error message. I assume the variance is due to the randomization of the training data.

Describe the expected behavior
Not getting a fault

Code to reproduce the issue
python3 train.py --config MODE_MASK=True MODE_FPN=True DATA.BASEDIR=/.../COCO/DIR BACKBONE.WEIGHTS=/.../ImageNet-R50-AlignPadding.npz
COCO
└── DIR
├── annotations - captions_train2014.json instances_minival2014.json instances_val2014.json person_keypoints_train2014.json captions_val2014.json instances_train2014.json instances_valminusminival2014.json person_keypoints_val2014.json
├── train2014
└── val2014

Let me know what logs and debug params you want me to include. I looked at similar issues in an attempt to fix the problem. Nothing solved my issue, however, I did run all the tensorflow models that were suggested for other issues and they all worked.

sunway513 · 2019-01-27T03:46:18Z

Hi @numPlumber , can you provide the FasterRCNN repo you've been using? I'll try to repro the failure locally.
Besides, the failure signature looks similar to issue #282 and #300. We are investigating the root cause.

numPlumber · 2019-01-27T06:13:28Z

https://github.com/tensorpack/tensorpack
Master branch
Go to examples/FasterRCNN

witeko · 2019-01-27T11:47:37Z

Same error in #297

numPlumber · 2019-01-27T20:01:04Z

Clarification regarding tensorpack install process:
The tensorpack PyPI install process didn't include updated code for tensorpack/utils/serialize.py which let to an error with msgpack. Thus, I installed via: "pip3 install git+https://github.com/tensorpack/tensorpack.git"
That commit number is 21c494697faa40db0e280a3c53abad2521b2e1f6

sunway513 · 2019-01-27T20:10:42Z

@numPlumber , I've installed it as well. However, tensorpack has dependencies on NVML library, which prevents me from running the FasterRCNN training on ROCm at the runtime.

numPlumber · 2019-01-27T20:21:00Z

Commenting out lines 484, 491, 492 in ../FasterRCNN/train.py didn't take care of that?

sunway513 · 2019-01-27T20:33:09Z

Thanks @numPlumber , I've hacked train.py, config.py and callbacks/prof.py, now the training is moving. Will let you know how it goes.

numPlumber · 2019-01-27T20:34:21Z

What did you change in config.py and prof.py?

sunway513 · 2019-01-27T20:39:30Z

Hmm, it still failed after TF auto-tuning session due to the NVML dependancy.
Since your issue is similar to #282 (vmem access fault on Polaris GPUs), I'll use that to continue the investigation. Will keep you updated if we have any workarounds or fixes for the issue.

numPlumber · 2019-01-27T20:43:01Z

Can you post the traceback?

sunway513 · 2019-01-30T16:52:41Z

Hi @numPlumber , can you try the following step and see if that can fix your issue?

cd ~ && mkdir rocm1.9.2-opencl && cd rocm1.9.2-opencl &&
wget https://www.dropbox.com/s/rtwe1zrpuphbyqm/rocm-opencl-1.2.0-2018111340_amd64.deb && 
wget https://www.dropbox.com/s/6gp2g5zju66i4e9/rocm-opencl-dev-1.2.0-2018111340_amd64.deb && 
sudo dpkg -i rocm-opencl*.deb && rm -rf ~/.cache

sunway513 · 2019-02-07T19:13:57Z

@numPlumber , any feedback on the suggested workaround?

geekboood · 2019-04-07T14:05:54Z

@sunway513 I encounter the same issue when I train my semantic segmentation model. Reinstall the opencl package fixed the issue first time, however, when I rerun my program, I need to clear the MIOpen cache to get it run normally. Is that normal?

sunway513 · 2019-04-07T18:26:01Z

Hi @geekboood , It's required to clean the MIOpen cache after each time MIOpen library got updated. Otherwise, MIOpen library would just use the old kernel binary compiled with the faulty toolchain.

sunway513 · 2019-06-08T05:55:07Z

Hi @numPlumber, @geekboood , we have included a set of OCL toolchain fixes for GFX803 targets in ROCm2.5, please kindly try it and let us know if that helps with the issue.

sunway513 · 2019-06-14T15:25:53Z

I'm closing this issue as it has been fixed in ROCm2.5.

Closes #301 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#301 from AlexandreEichenberger:vect-doc-update 7e5418a9101a4bdad2357882fe660b02bba8bd01 PiperOrigin-RevId: 284202462 Change-Id: Icc3f89105534fa06821433caae97f38f74a8a205

sunway513 mentioned this issue Jan 27, 2019

cifar10 - Memory access fault during training #300

Closed

sunway513 self-assigned this Feb 7, 2019

sunway513 added bug Something isn't working gfx803 issue specific to gfx803 GPUs labels Feb 7, 2019

sunway513 mentioned this issue Jun 8, 2019

Add GFX803 to TF-ROCm Continuous Integration #479

Closed

sunway513 closed this as completed Jun 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301

Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301

numPlumber commented Jan 27, 2019 •

edited

Loading

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

witeko commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 30, 2019

sunway513 commented Feb 7, 2019

geekboood commented Apr 7, 2019

sunway513 commented Apr 7, 2019

sunway513 commented Jun 8, 2019

sunway513 commented Jun 14, 2019

Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301

Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301

Comments

numPlumber commented Jan 27, 2019 • edited Loading

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

witeko commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 27, 2019

numPlumber commented Jan 27, 2019

sunway513 commented Jan 30, 2019

sunway513 commented Feb 7, 2019

geekboood commented Apr 7, 2019

sunway513 commented Apr 7, 2019

sunway513 commented Jun 8, 2019

sunway513 commented Jun 14, 2019

numPlumber commented Jan 27, 2019 •

edited

Loading