-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Access Faults during training #414
Comments
With
I could not observe any crashes. Without them, however, the program failed in two ways:
which is also the final op if the graph is only executed once by removing the while loop.
I attached main-process-gdb-bt the thread returns to 100% after detaching, reattaching + I've monitored the fan speed & temperature in The fan level stays at around 20% throughout the whole time, leading to temperatures between 100 and 110°C after a while and ultimately freezing which drops the temperature back to 60-70°C. Fixing the fan level to 75% leads to a sustained temperature around 70°C and runs for more than 1 hour without a freeze. Resetting levels using Setting the @sebpuetz experiences the same fan behaviour (25% at 105°C). Fixing the fans to a higher speed, however, does not fix #325. At least the freezes may then be caused by a driver issue where the fan-speeds are not adjusted? |
Hi @twuebi , could you try to lower the GPU sclk and see if the issue is reproducible? |
BTW, please help provide the dmesg log if you see the system hanging again. That can provide more hints on the issue you are facing right now. |
I'll give it a shot, I'd expect it to work but wouldn't see it as a satisfactory workaround. So far, my observation were that the fans do not adjust to the GPU temperature leading to temps around 110°C which is the temperature range where the crashes occur. Fixing the fan speeds at 75% or setting the serialize flags both lead to lower GPU temps and I could not observe crashes. From that experience, my guess is that lowering the GPU sclk will work as long as the 20% fans will match the heat production at the lower sclk level. Interestingly, the fans ramp up when running the benchmarks at https://github.com/tensorflow/benchmarks.
GPU[0] : VBIOS version: 113-D3600200-105
Will do the next time I observe it. To prevent a misunderstanding: the system does not hang, the program gets unresponsive and only terminates via |
Setting sclk to 5 ran for >1h without issues at temperature ~80-90°C. Resetting clocks to auto via The fan speed sat at 21.96% throughout the whole run. |
Thanks @twuebi , I agree the fan speed should be faster while the temp increases to the 105-110C range. |
Fwiw, I'm getting the same fan behaviour on 106 VBIOS, so I wouldn't expect the BIOS upgrade to change that behaviour. |
@sebpuetz , thanks for the heads up. |
Upgraded the bios to 106. The fan behavior persists. |
Please see my post here for more details about what may be happening with your fans. To wit, my first guess is: your GPU card vendor set the acoustic limit of your fans too low, and possibly set the thermal throttling limit of your GPU too high. As the GPU heats up, that temperature bleeds over to the HBM that's situated next to the GPU chip. As the HBM memory starts to get too hot, it starts to get corruptions before the periodic refresh cycle comes around -- and you end up with corrupted memory causing a corrupted pointer and thus a crash. This is a hypothesis, however, and it will be a little difficult to test. However, could you answer the following questions? These aren't to try to lay blame on you, I just want to make sure of the system setup before I start doing any deep dives to try to solve the problem. :)
|
Hi @sebpuetz -- since you say you're also observing the problem, could you also give us info about your GPU vendor and potentially answer those same questions? |
Thanks for the extensive answer! A weird thing about the fans - for the benchmarks at https://github.com/tensorflow/benchmarks they actually ramp up.
Sapphire, same is true for @sebpuetz
No changes of the clock speeds.
I ran both 2.2 and 2.3, also a kernel 5 rc.
No patterns besides the temperature thing.
Sure!
|
OK, this doesn't appear to directly be a pp_table issue. I see the maximum fanspeed is 3850 RPM, and the acoustic limit is 2900 RPM. You say you have two situations (one benchmark where the fan doesn't ramp up, one where it does). For both of these situations, could you show me the values in the following files?
|
I have my tensorflow models and those contained in the benchmark repository, for my own models like https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/files/3090116/transformer_model.py.txt the fans don't ramp up. Sometimes they can be around 30% when starting the program the temperatures then increase while the fan speed gradually decreases to a minimum of 20.xx-21.96%. For the benchmarks the fan speeds do actually ramp up.
No ramp:
Ramp
|
Yeah, the Could you also show me the following:
In addition, for those "no ramp" and "ramp" cases, can you also read |
I had some time to log the values you requested over the course of about 10 minutes for the "no ramp" and "ramp" cases. Both files contain tab seperated columns for: The rows were printed in 1-second intervals.
On my system that's the case,
|
|
I did some more experiments. Besides general model architecture, the difference between the benchmarks and my models is the way the graphs are executed. In my custom models, the session is executed within a python loop using a feed dictionary. In the benchmarks, the inputs are fed using input queues. The usage of queues ensures that the next batch is always ready, minimizing idle GPU time (AFAIK in some cases the next batch is already prefetched to the GPU). With the feed dictionary, the GPU idles between each session call in python. My hypothesis is now that the idle time between the session calls somehow prevents the fans from being adjusted to the temperature. To test this I altered my original model so that it only features a single session call with an infinite while loop in the graph (inf_transformer_model.py.txt). The result is a fan that adjusts to the temperature: ramp_transformer.txt. Another observation I made which may be the reason for the behavior is that the avg. power consumption in the feed dictionary models fluctuates between 50 and 200W while it seems to be much more consistent with the non feed dictionary models. edit 2: I forgot to alter the header in the following files, the last column is the output of:
edit: For the sake of completeness tf_benchmark with power consumption:
|
Any suggestions on how to proceed? |
Hi @twuebi If this is holding up your work, I would say that for now your best bet would be to manually set fan speeds using |
Following ROCm/ROCm#564 (comment), I checked the My theory is now that my vendor set the |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
Describe the current behavior
I'm experiencing memory access faults like this:
in various models. They usually happen during the first 20-45 minutes of training. I am experiencing them across a variety of environments (on upstream kernel 5.0 without rock-dkms, 4.15.0-47-generic with rock-dkms, the #316 docker and the rocm2.3-tf1.13-python3 docker). Due to these errors I couldn't finish training a network yet as they either crash with a shape related error (#325) or fail with the memory access fault.
Describe the expected behavior
No memory errorrs.
Code to reproduce the issue
This file contains the minimal code I could replicate the issue with. In my experience it takes up to 45 minutes until the error occurs.
Other info / logs
I am attempting to reproduce the issue with
as per #302 but could not observe another instance yet,
as the flags seem to be slowing things down quite a lot, currently at step 10k with no crash, without the flags the crash happened at 8k.Following #394 I also ran the HIP tests, where all but the native fp16 test pass.
rocm-dev.txt
rocm_info.txt
The text was updated successfully, but these errors were encountered: