Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sockeye freezes at new validation start [v1.18.54] #544

Closed
franckbrl opened this issue Sep 25, 2018 · 30 comments
Closed

Sockeye freezes at new validation start [v1.18.54] #544

franckbrl opened this issue Sep 25, 2018 · 30 comments

Comments

@franckbrl
Copy link
Contributor

For the third time in a few days and on 2 independent trainings, I observed that Sockeye freezes after starting some new validation, i.e. it does not crash, does not send any warning, but stops going forward (0% on CPU/GPU). Here are the last lines of my log file before this issue occurs:

[2018-09-24:21:45:33:INFO:sockeye.training:__call__] Epoch[3] Batch [270000]    Speed: 650.11 samples/sec 22445.47 tokens/sec 2.06 updates/sec  perplexity=3.5
46109
[2018-09-24:21:45:34:INFO:root:save_params_to_file] Saved params to "/run/work/generic_fr2en/model_baseline/params.00007"
[2018-09-24:21:45:34:INFO:sockeye.training:fit] Checkpoint [7]  Updates=270000 Epoch=3 Samples=81602144 Time-cost=4711.141 Updates/sec=2.123
[2018-09-24:21:45:34:INFO:sockeye.training:fit] Checkpoint [7]  Train-perplexity=3.546109
[2018-09-24:21:45:36:INFO:sockeye.training:fit] Checkpoint [7]  Validation-perplexity=3.752938
[2018-09-24:21:45:36:INFO:sockeye.utils:log_gpu_memory_usage] GPU 0: 10093/11178 MB (90.29%) GPU 1: 9791/11178 MB (87.59%) GPU 2: 9795/11178 MB (87.63%) GPU 3: 9789/11178 MB (87.57%)
[2018-09-24:21:45:36:INFO:sockeye.training:collect_results] Decoder-6 finished: {'rouge2-val': 0.4331754429258854, 'rouge1-val': 0.6335038896620699, 'decode-walltime-val': 3375.992604494095, 'rougel-val': 0.5947101830587342, 'avg-sec-per-sent-val': 1.794786073627908, 'chrf-val': 0.6585073715647153, 'bleu-val': 0.43439024563194745}
[2018-09-24:21:45:36:INFO:sockeye.training:start_decoder] Starting process: Decoder-7

So at this point, it has outputted params.00007. When I kill the Sockeye process and restart to continue training, it starts again after validation 6 (update 260000), then later overwrites params.00007, starts Decoder-7 and continues training successfully.

I noted that the freezing occurs at the same moment as in #462, but I have no idea whether it is related to this case. I checked all parameters of the last param file after the issue with numpy.isnan() and no nans were reported.

@fhieber
Copy link
Contributor

fhieber commented Sep 25, 2018

This sounds similar to #529.
Can you reproduce when using using --decode-and-evaluate-use-cpu? It'd be useful to know whether this is somehow related to another MXNet process on an already occupied GPU.

@franckbrl
Copy link
Contributor Author

I'm always using --decode-and-evaluate-use-cpu. And it just happened again for the 4th time.

@fhieber
Copy link
Contributor

fhieber commented Sep 25, 2018

Just fishing in the dark here, but I wonder if this could be related to https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
In training.py, we use the spawn method to create the CheckpointDecoder process. If I set this to 'fork', the main process will wait indefinitely for the first decoder process to finish. 'spawn' and 'forkserver' work on my laptop but maybe there is an issue on certain Unix systems with this? The documentation for multiprocessing also mentions this:

On Unix using the spawn or forkserver start methods will also start a semaphore tracker process which tracks the unlinked named semaphores created by processes of the program. When all processes have exited the semaphore tracker unlinks any remaining semaphores. Usually there should be none, but if a process was killed by a signal there may be some “leaked” semaphores. (Unlinking the named semaphores is a serious matter since the system allows only a limited number, and they will not be automatically unlinked until the next reboot.)

@franckbrl
Copy link
Contributor Author

In this case, I guess when the decoder starts here, whatever happens in the CheckpointDecoder process, the training should carry on. Then we would get stuck only when waiting for the decoder output, which would produce a warning, right?

@fhieber
Copy link
Contributor

fhieber commented Sep 25, 2018

True, and in general this works fine, but I wonder if sometimes this logic runs into some corner case with system file descriptor limits, semaphores or whatever.

@franckbrl
Copy link
Contributor Author

I'm not sure if this is relevant, but the issue happened to me when starting the 3rd validation on a machine I had just rebooted before the training began. On another one that has not been rebooted for a while, it happened at the 9th validation start. Also, the former was CentOS and the latter Ubuntu.

@franckbrl
Copy link
Contributor Author

I have always put the command starting sockeye.train in a bash script. Both Sockeye and MxNet were installed in a virtual environment that I activated manually BEFORE running the script (source /my/env/bin/activate). Recently, I started adding the virtualenv activation command line to the bash script, and that's when I repeatedly got the reported issue. Now I have removed the virtualenv activation from the bash script and started running it manually again. It's been 5 days and the issue has not occurred yet. I have no clue how to explain this behavior.

@fhieber
Copy link
Contributor

fhieber commented Sep 30, 2018

Thanks for sharing more information! We are investigating this issue and any additional details may be helpful.
No updates unfortunately so far :/

@fhieber
Copy link
Contributor

fhieber commented Sep 30, 2018

What version of mxnet are you using in your virtualenv? Mxnet-mkl? Are you also using an mkl-optimized version of numpy?

@franckbrl
Copy link
Contributor Author

I had the issue with MxNet versions 1.2.1 and 1.3.0, and numpy version 1.14.5 in both cases.

@fhieber
Copy link
Contributor

fhieber commented Oct 1, 2018

Thanks, these are the versions but are you pip-installing mxnet-cuXmkl==1.3.0 or mxnet-cuX==1.3.0?
Similarly, is your numpy version mkl-optimized? If you use Anaconda as your Python distribution you can check via conda list | grep numpy; if it says <pip> in the last column its not using mkl.

@franckbrl
Copy link
Contributor Author

For MxNet, I pip-installed the requirements from here, which runs mxnet-cu90mkl.

As for Numpy, numpy.show_config() says:

blas_opt_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
blas_mkl_info:
  NOT AVAILABLE

@tdomhan
Copy link
Contributor

tdomhan commented Oct 2, 2018

@franckbrl I have a work-around for a related MKL hang. Would you be able to test whether this would also fix your problem? I committed the change to:
https://github.com/awslabs/sockeye/tree/forkserver

@franckbrl
Copy link
Contributor Author

It's going well so far, but I get a strange message about the decoder not being alive:

[INFO:sockeye.training] Decoder no longer alive...
[INFO:sockeye.training] Decoder-11 finished: {'bleu-val': 0.38399161973481205, 'chrf-val': 0.621943354132811, 'rouge1-val': 0.6120198908499329, 'rouge2-val': 
0.4257106522131144, 'rougel-val': 0.5846460755885678, 'avg-sec-per-sent-val': 1.4005149364471436, 'decode-walltime-val': 14.005149364471436}
[INFO:sockeye.training] Starting process: Decoder-12

@tdomhan
Copy link
Contributor

tdomhan commented Oct 2, 2018

that should be fine. I added this as a debug logging statement, but I think you can safely ignore it. It mainly means that your decoder finished before reaching the next checkpoint, which is to be expected.

Let's see how the rest of training goes. Keep us posted.

@tdomhan
Copy link
Contributor

tdomhan commented Oct 2, 2018

btw, would you have normally hit the issue at this point, or is it too early to say?

@franckbrl
Copy link
Contributor Author

I'm running frequent validations and the decoder has been successfully started for the 22nd time. I'll let it run all night and tell you how it went tomorrow.

@tdomhan
Copy link
Contributor

tdomhan commented Oct 2, 2018

perfect, thanks!

@franckbrl
Copy link
Contributor Author

258 validations have been successfully run. I can't explain why and when the issue occurred before, so I can't tell you for sure that I would have encountered it with the master branch version of the code.

@franckbrl franckbrl reopened this Oct 3, 2018
@tdomhan
Copy link
Contributor

tdomhan commented Oct 4, 2018

that's good to hear. We will work on integrating a version of this fix into master. Let us know, in case you should still run into issues.

@franckbrl
Copy link
Contributor Author

I'm not sure this was really helpful. I repeated exactly what earlier brought me to this issue: I went back to the master branch code and activated the virtual environment in my bash script. The decoder worked fine for 234 validations. I'm sorry to say I have no idea whether what you did on the forkserver branch actually helped.

@franckbrl
Copy link
Contributor Author

The issue happened again with MxNet MKL version 1.3.0.post0 and Sockeye version 1.18.56. The virtual environment was manually activated, so my earlier observations on this are not relevant. @tdomhan Should I go back to experimenting on the forkserver branch?

@tdomhan tdomhan reopened this Oct 10, 2018
@tdomhan
Copy link
Contributor

tdomhan commented Oct 10, 2018

That's unfortunate! If you could try the forkserver branch again, to see whether this fixes your issue, that would be highly appreciated. I'm currently still looking into this issue and trying to confirm that the forkserver method successfully fixes it. Given the difficulty of reproducing the issue, it is also difficult to confirm the fix. So any additional datapoints would be very helpful :)

@franckbrl
Copy link
Contributor Author

Thank you, I'll use it and report here.

Here is one new thing I haven't seen before. When I kill Sockeye, I now get the following message:

/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))

@franckbrl
Copy link
Contributor Author

@tdomhan It's been nearly 2 months and I've trained several systems on different machines using the forkserver branch. The problem has not occurred once. So forkserver seems to have solved it. Were your experiments as satisfactory?

@tdomhan
Copy link
Contributor

tdomhan commented Nov 26, 2018

That is great to hear. In the internal evaluation we ran we also did not not observe this issue with the forkserver branch anymore. We should now move ahead an integrate this change into the master branch :)

@franckbrl
Copy link
Contributor Author

franckbrl commented Nov 27, 2018

Great! We'll be waiting for this one! Shall we close the issue, now?

@tdomhan
Copy link
Contributor

tdomhan commented Nov 27, 2018

let's leave the issue open while we still don't have this in master.

@tdomhan
Copy link
Contributor

tdomhan commented Dec 12, 2018

I merged the change. Let us know if you have any issues. I will close the issue for now.

@tdomhan tdomhan closed this as completed Dec 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants