-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorpack FasterRCNN Memory access fault by GPU node-1 ... Reason: Page not present or supervisor privilege. #301
Comments
Hi @numPlumber , can you provide the FasterRCNN repo you've been using? I'll try to repro the failure locally. |
https://github.com/tensorpack/tensorpack |
Same error in #297 |
Clarification regarding tensorpack install process: |
@numPlumber , I've installed it as well. However, tensorpack has dependencies on NVML library, which prevents me from running the FasterRCNN training on ROCm at the runtime. |
Commenting out lines 484, 491, 492 in ../FasterRCNN/train.py didn't take care of that? |
Thanks @numPlumber , I've hacked train.py, config.py and callbacks/prof.py, now the training is moving. Will let you know how it goes. |
What did you change in config.py and prof.py? |
Hmm, it still failed after TF auto-tuning session due to the NVML dependancy. |
Can you post the traceback? |
Hi @numPlumber , can you try the following step and see if that can fix your issue?
|
@numPlumber , any feedback on the suggested workaround? |
@sunway513 I encounter the same issue when I train my semantic segmentation model. Reinstall the opencl package fixed the issue first time, however, when I rerun my program, I need to clear the MIOpen cache to get it run normally. Is that normal? |
Hi @geekboood , It's required to clean the MIOpen cache after each time MIOpen library got updated. Otherwise, MIOpen library would just use the old kernel binary compiled with the faulty toolchain. |
Hi @numPlumber, @geekboood , we have included a set of OCL toolchain fixes for GFX803 targets in ROCm2.5, please kindly try it and let us know if that helps with the issue. |
I'm closing this issue as it has been fixed in ROCm2.5. |
Closes #301 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#301 from AlexandreEichenberger:vect-doc-update 7e5418a9101a4bdad2357882fe660b02bba8bd01 PiperOrigin-RevId: 284202462 Change-Id: Icc3f89105534fa06821433caae97f38f74a8a205
System information
name: Ellesmere [Radeon RX 470/480]
AMDGPU ISA: gfx803
memoryClockRate (GHz) 1.15
pciBusID 0000:01:00.0
Total memory: 8.00GiB
Free memory: 7.75GiB
Describe the current behavior
Memory access fault by GPU node-1 (Agent handle: 0x2b68910) on address 0xadca1b000. Reason: Page not present or supervisor privilege.
Describe the expected behavior
Not getting a fault
Code to reproduce the issue
python3 train.py --config MODE_MASK=True MODE_FPN=True DATA.BASEDIR=/.../COCO/DIR BACKBONE.WEIGHTS=/.../ImageNet-R50-AlignPadding.npz
COCO
└── DIR
├── annotations - captions_train2014.json instances_minival2014.json instances_val2014.json person_keypoints_train2014.json captions_val2014.json instances_train2014.json instances_valminusminival2014.json person_keypoints_val2014.json
├── train2014
└── val2014
Let me know what logs and debug params you want me to include. I looked at similar issues in an attempt to fix the problem. Nothing solved my issue, however, I did run all the tensorflow models that were suggested for other issues and they all worked.
The text was updated successfully, but these errors were encountered: