-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sockeye freezes at new validation start [v1.18.54] #544
Comments
This sounds similar to #529. |
I'm always using |
Just fishing in the dark here, but I wonder if this could be related to https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
|
True, and in general this works fine, but I wonder if sometimes this logic runs into some corner case with system file descriptor limits, semaphores or whatever. |
I'm not sure if this is relevant, but the issue happened to me when starting the 3rd validation on a machine I had just rebooted before the training began. On another one that has not been rebooted for a while, it happened at the 9th validation start. Also, the former was CentOS and the latter Ubuntu. |
I have always put the command starting |
Thanks for sharing more information! We are investigating this issue and any additional details may be helpful. |
What version of mxnet are you using in your virtualenv? Mxnet-mkl? Are you also using an mkl-optimized version of numpy? |
I had the issue with MxNet versions |
Thanks, these are the versions but are you pip-installing |
For MxNet, I pip-installed the requirements from here, which runs As for Numpy,
|
Ok, it seems there is some prior knowledge about MXNet/Cuda and multiprocessing:
Some previous fix: apache/mxnet#8995 |
@franckbrl I have a work-around for a related MKL hang. Would you be able to test whether this would also fix your problem? I committed the change to: |
It's going well so far, but I get a strange message about the decoder not being alive:
|
that should be fine. I added this as a debug logging statement, but I think you can safely ignore it. It mainly means that your decoder finished before reaching the next checkpoint, which is to be expected. Let's see how the rest of training goes. Keep us posted. |
btw, would you have normally hit the issue at this point, or is it too early to say? |
I'm running frequent validations and the decoder has been successfully started for the 22nd time. I'll let it run all night and tell you how it went tomorrow. |
perfect, thanks! |
258 validations have been successfully run. I can't explain why and when the issue occurred before, so I can't tell you for sure that I would have encountered it with the master branch version of the code. |
that's good to hear. We will work on integrating a version of this fix into master. Let us know, in case you should still run into issues. |
I'm not sure this was really helpful. I repeated exactly what earlier brought me to this issue: I went back to the |
The issue happened again with MxNet MKL version |
That's unfortunate! If you could try the forkserver branch again, to see whether this fixes your issue, that would be highly appreciated. I'm currently still looking into this issue and trying to confirm that the forkserver method successfully fixes it. Given the difficulty of reproducing the issue, it is also difficult to confirm the fix. So any additional datapoints would be very helpful :) |
Thank you, I'll use it and report here. Here is one new thing I haven't seen before. When I kill Sockeye, I now get the following message:
|
@tdomhan It's been nearly 2 months and I've trained several systems on different machines using the |
That is great to hear. In the internal evaluation we ran we also did not not observe this issue with the forkserver branch anymore. We should now move ahead an integrate this change into the master branch :) |
Great! We'll be waiting for this one! Shall we close the issue, now? |
let's leave the issue open while we still don't have this in master. |
I merged the change. Let us know if you have any issues. I will close the issue for now. |
For the third time in a few days and on 2 independent trainings, I observed that Sockeye freezes after starting some new validation, i.e. it does not crash, does not send any warning, but stops going forward (0% on CPU/GPU). Here are the last lines of my log file before this issue occurs:
So at this point, it has outputted
params.00007
. When I kill the Sockeye process and restart to continue training, it starts again after validation 6 (update260000
), then later overwritesparams.00007
, startsDecoder-7
and continues training successfully.I noted that the freezing occurs at the same moment as in #462, but I have no idea whether it is related to this case. I checked all parameters of the last param file after the issue with
numpy.isnan()
and no nans were reported.The text was updated successfully, but these errors were encountered: