Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301

Closed
numPlumber opened this issue Jan 27, 2019 · 16 comments
Assignees
Labels
bug Something isn't working gfx803 issue specific to gfx803 GPUs

Comments

@numPlumber
Copy link

numPlumber commented Jan 27, 2019

System information

  • Custom code: Installed tensorpack from github rather than the package due to a lack of update to msgpack params that led to an error. Commented out lines 484, 491, 492 in train.py to prevent calls to nvidia performance tracking.
  • OS: Linux Ubuntu 18.04
  • TensorFlow installed binary
  • TensorFlow version: v1.12.0-871-gf480b4a 1.12.0
  • Python version: Python 3.6.7
  • GPU model and memory:
    name: Ellesmere [Radeon RX 470/480]
    AMDGPU ISA: gfx803
    memoryClockRate (GHz) 1.15
    pciBusID 0000:01:00.0
    Total memory: 8.00GiB
    Free memory: 7.75GiB

Describe the current behavior
Memory access fault by GPU node-1 (Agent handle: 0x2b68910) on address 0xadca1b000. Reason: Page not present or supervisor privilege.

  • Usually comes up on first training iteration, but it has made it to 8 iterations before showing the error message. I assume the variance is due to the randomization of the training data.

Describe the expected behavior
Not getting a fault

Code to reproduce the issue
python3 train.py --config MODE_MASK=True MODE_FPN=True DATA.BASEDIR=/.../COCO/DIR BACKBONE.WEIGHTS=/.../ImageNet-R50-AlignPadding.npz
COCO
└── DIR
├── annotations - captions_train2014.json instances_minival2014.json instances_val2014.json person_keypoints_train2014.json captions_val2014.json instances_train2014.json instances_valminusminival2014.json person_keypoints_val2014.json
├── train2014
└── val2014

Let me know what logs and debug params you want me to include. I looked at similar issues in an attempt to fix the problem. Nothing solved my issue, however, I did run all the tensorflow models that were suggested for other issues and they all worked.

@sunway513
Copy link

Hi @numPlumber , can you provide the FasterRCNN repo you've been using? I'll try to repro the failure locally.
Besides, the failure signature looks similar to issue #282 and #300. We are investigating the root cause.

@numPlumber
Copy link
Author

https://github.com/tensorpack/tensorpack
Master branch
Go to examples/FasterRCNN

@witeko
Copy link

witeko commented Jan 27, 2019

Same error in #297

@numPlumber
Copy link
Author

Clarification regarding tensorpack install process:
The tensorpack PyPI install process didn't include updated code for tensorpack/utils/serialize.py which let to an error with msgpack. Thus, I installed via: "pip3 install git+https://github.com/tensorpack/tensorpack.git"
That commit number is 21c494697faa40db0e280a3c53abad2521b2e1f6

@sunway513
Copy link

@numPlumber , I've installed it as well. However, tensorpack has dependencies on NVML library, which prevents me from running the FasterRCNN training on ROCm at the runtime.

@numPlumber
Copy link
Author

Commenting out lines 484, 491, 492 in ../FasterRCNN/train.py didn't take care of that?

@sunway513
Copy link

Thanks @numPlumber , I've hacked train.py, config.py and callbacks/prof.py, now the training is moving. Will let you know how it goes.

@numPlumber
Copy link
Author

What did you change in config.py and prof.py?

@sunway513
Copy link

Hmm, it still failed after TF auto-tuning session due to the NVML dependancy.
Since your issue is similar to #282 (vmem access fault on Polaris GPUs), I'll use that to continue the investigation. Will keep you updated if we have any workarounds or fixes for the issue.

@numPlumber
Copy link
Author

Can you post the traceback?

@sunway513
Copy link

Hi @numPlumber , can you try the following step and see if that can fix your issue?

cd ~ && mkdir rocm1.9.2-opencl && cd rocm1.9.2-opencl &&
wget https://www.dropbox.com/s/rtwe1zrpuphbyqm/rocm-opencl-1.2.0-2018111340_amd64.deb && 
wget https://www.dropbox.com/s/6gp2g5zju66i4e9/rocm-opencl-dev-1.2.0-2018111340_amd64.deb && 
sudo dpkg -i rocm-opencl*.deb && rm -rf ~/.cache

@sunway513
Copy link

@numPlumber , any feedback on the suggested workaround?

@sunway513 sunway513 self-assigned this Feb 7, 2019
@sunway513 sunway513 added bug Something isn't working gfx803 issue specific to gfx803 GPUs labels Feb 7, 2019
@geekboood
Copy link

@sunway513 I encounter the same issue when I train my semantic segmentation model. Reinstall the opencl package fixed the issue first time, however, when I rerun my program, I need to clear the MIOpen cache to get it run normally. Is that normal?

@sunway513
Copy link

Hi @geekboood , It's required to clean the MIOpen cache after each time MIOpen library got updated. Otherwise, MIOpen library would just use the old kernel binary compiled with the faulty toolchain.

@sunway513
Copy link

Hi @numPlumber, @geekboood , we have included a set of OCL toolchain fixes for GFX803 targets in ROCm2.5, please kindly try it and let us know if that helps with the issue.

@sunway513
Copy link

I'm closing this issue as it has been fixed in ROCm2.5.

jeffdaily pushed a commit that referenced this issue Dec 9, 2019
Closes #301

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#301 from AlexandreEichenberger:vect-doc-update 7e5418a9101a4bdad2357882fe660b02bba8bd01
PiperOrigin-RevId: 284202462
Change-Id: Icc3f89105534fa06821433caae97f38f74a8a205
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gfx803 issue specific to gfx803 GPUs
Projects
None yet
Development

No branches or pull requests

4 participants