-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throwing segfaults like rice at a wedding #41
Comments
This is with batch_size 8.
Any idea what's going on? All the tests of |
Background:
I installed ROCm from binaries a few weeks ago. I've built MIOpen from source after configuring with I would think that this shouldn't be happening since it's a static computational graph we're talking about here, same model for each batch, and if you can allocate the space in memory you need for the data on a single batch, it should be the same for each batch and thus shouldn't be an issue for successive batches so I'm confused. |
It is not that your GPU is running out of memory. That is a GPU equivalent of a segmentation fault. Let us take a look at this network and see which layer is causing it. |
@odellus What version of Tensorflow and which balranch did you pull from? |
Default (develop-upstream) branch of ROCm tensorflow-upstream. tf version
rocm-1.7.2.
On May 26, 2018 09:10, "Daniel Lowell" <[email protected]> wrote:
@odellus <https://github.com/odellus> What version of Tensorflow and which
balranch did you pull from?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEeEbFttE2EarnzaM4u_mKEk27IgjTb4ks5t2X6HgaJpZM4UOys9>
.
|
Here are some environment variables that might help point us to a culprit. Just set these and collect all of the verbose output from running your workload.
Also, are you be able to reproduce via one of our pre-built Docker containers using TF1.3? When running with a 1.7.2
|
Btw/ this ticket appears to be better suited to one of these projects: |
I was going to submit to tensorflow-upstream, but they don't have an issues
tab.
…On Sat, May 26, 2018, 20:14 Jeff Poznanovic ***@***.***> wrote:
Btw/ this ticket appears to be better suited to one of these projects:
- https://github.com/ROCmSoftwarePlatform/tensorflow
- https://github.com/ROCmSoftwarePlatform/tensorflow-upstream
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEeEbIGeLV_mzZ1LcgFZ9t9U8T7DUN7Hks5t2hn6gaJpZM4UOys9>
.
|
Here's truncated output from setting those system debugging flags and running QANet |
Just to clarify, I was trying to connect you with the folks who most often deal with framework-level triage. At this point, we don't know whether this particular issue is an MIOpen library problem, or if it is somewhere else in the stack. In this circumstance, the frameworks team often does initial triage. Edit: Thanks for opening a ticket with the tensorflow repo, we'll take a look. |
722feea66 sp/chk precomp kernel 264 (#41) b9aba2034 Batch norm find compile (#50) 359f3da80 Fix missing link directives in fin binary (#48) a4020c1ba Cache Miss Fixes (#46) 2ec7ef44d Enable google test and compiling fin in the CI (#47) 8b6b453bc Applicability support for batch norm (#45) 44323aae9 Perf compile/eval for fin (#42) REVERT: a30a51bc6 remove unused header REVERT: 7d2fd834c reduce scope of variable REVERT: f6e9abe79 clang format REVERT: 834e9a397 remove comment REVERT: c8d6eb1a0 workspace rename REVERT: aa7d2ea24 Merge remote-tracking branch 'origin/develop' into cderb/miopen_perf REVERT: aaf13fb12 add to print for debug REVERT: 34e11fa70 Merge remote-tracking branch 'origin/develop' into cderb/miopen_perf REVERT: cb6c19d13 add search+update directives to execution context, add json examples for perf eval REVERT: 85029077b connecting new fin functions for perf eval REVERT: 4d1e031fd add outputs and definitions REVERT: 952538cb8 adding perf eval function, in progress REVERT: 617dccd9c rename REVERT: 5c35ae886 fixes for collecting kernel blobs REVERT: 5cfea7c43 syntax fixes REVERT: 2f2a4ed9f add test file REVERT: 7175019f5 first rendition of perf_compile git-subtree-dir: fin git-subtree-split: 722feea660e2e3d7f8e1edcc520a938be4885a44
30d699b9e Perf Eval Update (#60) 3535b948c PerfCompile and PerfEval changes (#59) de79468d2 remove unneccessary solution check, add check for previously modified kernel names (#56) 6924286a2 miopen hash update (#55) 530399575 Refactor googletest infra to align with MIOpen (#53) 71c50d146 Datatype fix for BN (#57) 8abe2f5c6 Perf Eval updates, Add find info (#51) e1c1ef0f5 filter find compile by solver input (#54) 722feea66 sp/chk precomp kernel 264 (#41) b9aba2034 Batch norm find compile (#50) 359f3da80 Fix missing link directives in fin binary (#48) a4020c1ba Cache Miss Fixes (#46) 2ec7ef44d Enable google test and compiling fin in the CI (#47) 8b6b453bc Applicability support for batch norm (#45) 44323aae9 Perf compile/eval for fin (#42) ebd9aa6bd update member name (#43) d6d798efe add cu count (#39) 8e1989a9f Add find option for selecting only dynamic solvers (#38) 0e164bf66 setting json version (#37) f3f7fed18 Remove function redefinition (#36) e1de51a58 Performance DB de-serialize test (#34) 043cdcdaa Layout support in Fin (#33) 3a1d58236 Hotfix (#32) ee3f0d543 4.4 Tuning Bugfixes (#31) 832dbe234 Tunability Reporting (#27) a564a229f include gfx90a_110 (#28) git-subtree-dir: fin git-subtree-split: 30d699b9edc014c6076a9649f849bd3c4588d4ab
* add perf cfg validity test to TestSysDbRecord * remove debug prints * removing invalid entries from all perf dbs * VACUUM sqlite * Squashed 'fin/' changes from 53d2563fe..30d699b9e 30d699b9e Perf Eval Update (#60) 3535b948c PerfCompile and PerfEval changes (#59) de79468d2 remove unneccessary solution check, add check for previously modified kernel names (#56) 6924286a2 miopen hash update (#55) 530399575 Refactor googletest infra to align with MIOpen (#53) 71c50d146 Datatype fix for BN (#57) 8abe2f5c6 Perf Eval updates, Add find info (#51) e1c1ef0f5 filter find compile by solver input (#54) 722feea66 sp/chk precomp kernel 264 (#41) b9aba2034 Batch norm find compile (#50) 359f3da80 Fix missing link directives in fin binary (#48) a4020c1ba Cache Miss Fixes (#46) 2ec7ef44d Enable google test and compiling fin in the CI (#47) 8b6b453bc Applicability support for batch norm (#45) 44323aae9 Perf compile/eval for fin (#42) ebd9aa6bd update member name (#43) d6d798efe add cu count (#39) 8e1989a9f Add find option for selecting only dynamic solvers (#38) 0e164bf66 setting json version (#37) f3f7fed18 Remove function redefinition (#36) e1de51a58 Performance DB de-serialize test (#34) 043cdcdaa Layout support in Fin (#33) 3a1d58236 Hotfix (#32) ee3f0d543 4.4 Tuning Bugfixes (#31) 832dbe234 Tunability Reporting (#27) a564a229f include gfx90a_110 (#28) git-subtree-dir: fin git-subtree-split: 30d699b9edc014c6076a9649f849bd3c4588d4ab * Squashed 'fin/' changes from 30d699b9e..ea5c844af ea5c844af fix direction test 3aa412ee1 Update to use revised testSysDbRecord miopen function git-subtree-dir: fin git-subtree-split: ea5c844aff8b5d46537aa59034a596fd15cd9e1e * rename pipe step * Squashed 'fin/' changes from ea5c844af..c702cb968 c702cb968 format git-subtree-dir: fin git-subtree-split: c702cb96800a03b17ee17d03a015dfa38e3883b9 * Squashed 'fin/' changes from c702cb968..d5397abd3 d5397abd3 rename targets git-subtree-dir: fin git-subtree-split: d5397abd37b6908bcd96ef750ea5a3ace04cdf3c * rename archive Co-authored-by: Jun Liu <[email protected]>
e05dcb421 perf db validation fix (#68) 260d9465d Add INT8 as a data_type v2 (#67) b6a5b2a77 sync with fin folder in miopen (#62) 0e03399ec prep for Palamida scan (#63) e6bd05c33 Performance db testing (#61) 30d699b9e Perf Eval Update (#60) 3535b948c PerfCompile and PerfEval changes (#59) de79468d2 remove unneccessary solution check, add check for previously modified kernel names (#56) 6924286a2 miopen hash update (#55) 530399575 Refactor googletest infra to align with MIOpen (#53) 71c50d146 Datatype fix for BN (#57) 8abe2f5c6 Perf Eval updates, Add find info (#51) e1c1ef0f5 filter find compile by solver input (#54) 722feea66 sp/chk precomp kernel 264 (#41) b9aba2034 Batch norm find compile (#50) 359f3da80 Fix missing link directives in fin binary (#48) a4020c1ba Cache Miss Fixes (#46) 2ec7ef44d Enable google test and compiling fin in the CI (#47) 8b6b453bc Applicability support for batch norm (#45) 44323aae9 Perf compile/eval for fin (#42) ebd9aa6bd update member name (#43) d6d798efe add cu count (#39) 8e1989a9f Add find option for selecting only dynamic solvers (#38) 0e164bf66 setting json version (#37) f3f7fed18 Remove function redefinition (#36) e1de51a58 Performance DB de-serialize test (#34) 043cdcdaa Layout support in Fin (#33) 3a1d58236 Hotfix (#32) ee3f0d543 4.4 Tuning Bugfixes (#31) 832dbe234 Tunability Reporting (#27) a564a229f include gfx90a_110 (#28) git-subtree-dir: fin git-subtree-split: e05dcb42187f05fe0d0d1b05b822dc4b750f199e
* remove datatype 0,1 from perf_db * rm invalid fp16 entries from pdb * Squashed 'fin/' changes from 53d2563fe..e05dcb421 e05dcb421 perf db validation fix (#68) 260d9465d Add INT8 as a data_type v2 (#67) b6a5b2a77 sync with fin folder in miopen (#62) 0e03399ec prep for Palamida scan (#63) e6bd05c33 Performance db testing (#61) 30d699b9e Perf Eval Update (#60) 3535b948c PerfCompile and PerfEval changes (#59) de79468d2 remove unneccessary solution check, add check for previously modified kernel names (#56) 6924286a2 miopen hash update (#55) 530399575 Refactor googletest infra to align with MIOpen (#53) 71c50d146 Datatype fix for BN (#57) 8abe2f5c6 Perf Eval updates, Add find info (#51) e1c1ef0f5 filter find compile by solver input (#54) 722feea66 sp/chk precomp kernel 264 (#41) b9aba2034 Batch norm find compile (#50) 359f3da80 Fix missing link directives in fin binary (#48) a4020c1ba Cache Miss Fixes (#46) 2ec7ef44d Enable google test and compiling fin in the CI (#47) 8b6b453bc Applicability support for batch norm (#45) 44323aae9 Perf compile/eval for fin (#42) ebd9aa6bd update member name (#43) d6d798efe add cu count (#39) 8e1989a9f Add find option for selecting only dynamic solvers (#38) 0e164bf66 setting json version (#37) f3f7fed18 Remove function redefinition (#36) e1de51a58 Performance DB de-serialize test (#34) 043cdcdaa Layout support in Fin (#33) 3a1d58236 Hotfix (#32) ee3f0d543 4.4 Tuning Bugfixes (#31) 832dbe234 Tunability Reporting (#27) a564a229f include gfx90a_110 (#28) git-subtree-dir: fin git-subtree-split: e05dcb42187f05fe0d0d1b05b822dc4b750f199e * fix clang-format issue Co-authored-by: Jun Liu <[email protected]>
49e3e3a62 clang format db80b1777 update to using TestPerfCfgParams for pdb validity checks e48a4fd3a format a4f85842c exception for non-tunable solvers in params check d58c42bbd Check params at end of perf tuning (#70) 1a3b47c7b Return status for failed compile commands (#69) d59962752 out_layout -> in_layout 6ba7a8f3f Rename conv_mode to mode (#64) 513a3da1b [bg/LWPTUNA-173] (#65) e05dcb421 perf db validation fix (#68) 260d9465d Add INT8 as a data_type v2 (#67) b6a5b2a77 sync with fin folder in miopen (#62) 0e03399ec prep for Palamida scan (#63) e6bd05c33 Performance db testing (#61) 30d699b9e Perf Eval Update (#60) 3535b948c PerfCompile and PerfEval changes (#59) de79468d2 remove unneccessary solution check, add check for previously modified kernel names (#56) 6924286a2 miopen hash update (#55) 530399575 Refactor googletest infra to align with MIOpen (#53) 71c50d146 Datatype fix for BN (#57) 8abe2f5c6 Perf Eval updates, Add find info (#51) e1c1ef0f5 filter find compile by solver input (#54) 722feea66 sp/chk precomp kernel 264 (#41) b9aba2034 Batch norm find compile (#50) 359f3da80 Fix missing link directives in fin binary (#48) a4020c1ba Cache Miss Fixes (#46) 2ec7ef44d Enable google test and compiling fin in the CI (#47) 8b6b453bc Applicability support for batch norm (#45) 44323aae9 Perf compile/eval for fin (#42) ebd9aa6bd update member name (#43) d6d798efe add cu count (#39) 8e1989a9f Add find option for selecting only dynamic solvers (#38) 0e164bf66 setting json version (#37) f3f7fed18 Remove function redefinition (#36) e1de51a58 Performance DB de-serialize test (#34) 043cdcdaa Layout support in Fin (#33) 3a1d58236 Hotfix (#32) ee3f0d543 4.4 Tuning Bugfixes (#31) 832dbe234 Tunability Reporting (#27) a564a229f include gfx90a_110 (#28) git-subtree-dir: fin git-subtree-split: 49e3e3a62a7cc54adacbeea95680d35f9a4685de
So I'm running https://github.com/NLPLearn/QANet with tensorflow-upstream and I've had to cut my batch_size down to nothing to fit the model onto the GPU.
This is what happens when I try to train the model:
The authors of the project are using a similar sized GPU (though twice as much desktop RAM/not sure if this is the problem) and aren't having to drop their batch size down around 4 to fit the model on their GPU.
From localminimum/QANet#2
"""
Hi @kamalkraj I uploaded the most recent model pretrained weights (EM/F1 = 70.0/79.4) and you can download it here.
The specification of the system I used is:
CPU: i7-3930K CPU @ 3.20GHz
GPU: GTX1080 (8GB)
RAM: 16GB
Training takes about 5~8 hours depending on your gpu/cpu spec. The model takes about 8 GB gpu memory so if you're using anything bigger than 96 as your hidden unit size then you'll get an OOM error. Or if you are using a preoccupied GPU it will also cause an OOM error.
NOTE: If you are using your desktop GPU, try running it in terminal mode (alt + ctrl + F1) and close all applications that require gpu memory (e.g. Xorg)
sudo service lightdm stop
python config.py --mode train
after training,
sudo service lightdm start
"""
I followed the advice to shut down all the other applications and just use terminal too. Won't fit. Any idea why this is happening? My RX580 is supposed to have the same amount of memory. Curious as to what's going on 😕
The text was updated successfully, but these errors were encountered: