-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
system freezes when running this kaggle kernel #297
Comments
Thanks for the report, @witeko. Looks like we've got a few similar failures. We'll look at this too. |
Hi @witeko - Would you be able to follow these steps on your system that gets the Please gather the logs for this run:
That should help us understand where the issue is. Thanks! |
@parallelo from ubuntu 18.04: |
Hi @witeko , could you try the following workaround?
|
@sunway513 , still the end result is that i cant fit the model.
Whats still wrong:
Mine results (not full): Epoch 00004: ReduceLROnPlateau reducing learning rate to 0.00800000037997961. Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.0020000000949949026. Epoch 00010: ReduceLROnPlateau reducing learning rate to 0.002. Perfect run (as on kaggle: https://www.kaggle.com/martinpiotte/bounding-box-model/output) Epoch 00008: ReduceLROnPlateau reducing learning rate to 0.00800000037997961. Epoch 00011: ReduceLROnPlateau reducing learning rate to 0.0020000000949949026. Epoch 00024: ReduceLROnPlateau reducing learning rate to 0.002. |
@witeko thanks for confirming the vmem fault is gone with the workaround. |
@witeko I'm trying to reproduce the issue now. I noticed that kaggle website don't provide pre-build gpu-enabled docker image. Just want to confirm if you get started with a fresh rocm-enabled docker or not? If you start with an image, it would be helpful for me to start with the same image so that we are at the same page |
@jerryyin I don't use any images, it would me impractical for me.
then You will get the same issue as I did. |
@sunway513 , @jerryyin , cmon guys. :) Ppl using tensorflow for creating deep learning models are not developers, we don't use docker/images/... for living. |
@witeko Sure, we've been working on it closely. As an update, we've been able to reproduce it in gfx803, but not on gfx900. Still trying my best in investigating the root cause. Thank you for your patience. |
@jerryyin thx, "being impatient" is actually what I do for living. :) |
Giving an update of things I have tried so far, with hints from #337, and #251 non-converging kernel issues on gfx803. The claim from issue 251 is that keras.optimizers.Adam malfunctions when using together with MaxPooling2D, which is exactly the same use case of this issue. The claim from issue 337 is related only with the tf.train.AdamOptimizer. Given this, my debugging so far focuses on trying to narrow down to a specific operator, and the couple of things I have tried:
Looking at the debugging attempts, it seems that there is a strong indication on the malfunction on Adam optimizer, both in tf domain and keras domain. I'm suggesting to revisit this issue after we are able to isolate the problem on Adam optimizer. |
@jerryyin thanks for the update :) |
Giving another update. After several trial runs, the very likely operator malfunctions is one of training/Adam/gradients/conv2d* operator. This is based on me manually put all Conv2D, both inference and training operators on GPU, and the model running so far is converging. I will be working on to compile a complete list of operators that get switched to CPU. This can help us narrow down the problem scope greatly. |
@sunway513 don’t we have other tickets where Adam optimizer behaving funky on gfx803? |
Yes. From the investigation of @jerryyin it seems to be similar issues? |
Right, I'm spsecting those two issues have same root cause. Besides, #325 can be related as well. |
Just now taking a look at the operators put on CPU. It is a rather long list. A summary of that: |
Confirmed that the following patch make the model converges. However, please note that this will make the model run 10x ~ 100x slower. |
@jerryyin could you help re-validate this issue with ROCm2.5 docker containers? |
@sunway513 I did a re-validation just now and the run straight up crashed. Looking at the tensorflow VLOG context I don't think it it even related with tensorflow. The complaint is:
|
@jerryyin , thanks Jerry. |
Providing an update here: Internal ticket opened with additional details provided to reproduce: SWDEV-193136 Tensorflow report GPUVM fault on gfx803. Will update once received further information from the ticket. |
Thanks for reaching out. |
System information
Describe the current behavior
When I run this exact kaggle kernel (code and data provided) https://www.kaggle.com/martinpiotte/bounding-box-model/notebook my system always freezes (I can still move the mouse cursor but nothing else).
I checked batch sizes all the way down from 32 to 2.
Problem occurs during the first epoch, but not immediately (after different periods of time).
I also checked gpu options like limiting memory to a given fraction and by allowing memory growth.
edit: switched to my other linux distro ubuntu 18.04 (everything newest: rocm, tensorflow,...) and the system is not freezing but i get the error message:
"Memory access fault by GPU node-1 (Agent handle: 0x5557c91b8950) on address 0x12dba01000. Reason: Page not present or supervisor privilege."
The text was updated successfully, but these errors were encountered: