Sockeye freezes at new validation start [v1.18.54] #544

franckbrl · 2018-09-25T11:34:53Z

For the third time in a few days and on 2 independent trainings, I observed that Sockeye freezes after starting some new validation, i.e. it does not crash, does not send any warning, but stops going forward (0% on CPU/GPU). Here are the last lines of my log file before this issue occurs:

[2018-09-24:21:45:33:INFO:sockeye.training:__call__] Epoch[3] Batch [270000]    Speed: 650.11 samples/sec 22445.47 tokens/sec 2.06 updates/sec  perplexity=3.5
46109
[2018-09-24:21:45:34:INFO:root:save_params_to_file] Saved params to "/run/work/generic_fr2en/model_baseline/params.00007"
[2018-09-24:21:45:34:INFO:sockeye.training:fit] Checkpoint [7]  Updates=270000 Epoch=3 Samples=81602144 Time-cost=4711.141 Updates/sec=2.123
[2018-09-24:21:45:34:INFO:sockeye.training:fit] Checkpoint [7]  Train-perplexity=3.546109
[2018-09-24:21:45:36:INFO:sockeye.training:fit] Checkpoint [7]  Validation-perplexity=3.752938
[2018-09-24:21:45:36:INFO:sockeye.utils:log_gpu_memory_usage] GPU 0: 10093/11178 MB (90.29%) GPU 1: 9791/11178 MB (87.59%) GPU 2: 9795/11178 MB (87.63%) GPU 3: 9789/11178 MB (87.57%)
[2018-09-24:21:45:36:INFO:sockeye.training:collect_results] Decoder-6 finished: {'rouge2-val': 0.4331754429258854, 'rouge1-val': 0.6335038896620699, 'decode-walltime-val': 3375.992604494095, 'rougel-val': 0.5947101830587342, 'avg-sec-per-sent-val': 1.794786073627908, 'chrf-val': 0.6585073715647153, 'bleu-val': 0.43439024563194745}
[2018-09-24:21:45:36:INFO:sockeye.training:start_decoder] Starting process: Decoder-7

So at this point, it has outputted params.00007. When I kill the Sockeye process and restart to continue training, it starts again after validation 6 (update 260000), then later overwrites params.00007, starts Decoder-7 and continues training successfully.

I noted that the freezing occurs at the same moment as in #462, but I have no idea whether it is related to this case. I checked all parameters of the last param file after the issue with numpy.isnan() and no nans were reported.

The text was updated successfully, but these errors were encountered:

fhieber · 2018-09-25T16:35:33Z

This sounds similar to #529.
Can you reproduce when using using --decode-and-evaluate-use-cpu? It'd be useful to know whether this is somehow related to another MXNet process on an already occupied GPU.

franckbrl · 2018-09-25T16:49:39Z

I'm always using --decode-and-evaluate-use-cpu. And it just happened again for the 4th time.

fhieber · 2018-09-25T17:17:59Z

Just fishing in the dark here, but I wonder if this could be related to https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
In training.py, we use the spawn method to create the CheckpointDecoder process. If I set this to 'fork', the main process will wait indefinitely for the first decoder process to finish. 'spawn' and 'forkserver' work on my laptop but maybe there is an issue on certain Unix systems with this? The documentation for multiprocessing also mentions this:

On Unix using the spawn or forkserver start methods will also start a semaphore tracker process which tracks the unlinked named semaphores created by processes of the program. When all processes have exited the semaphore tracker unlinks any remaining semaphores. Usually there should be none, but if a process was killed by a signal there may be some “leaked” semaphores. (Unlinking the named semaphores is a serious matter since the system allows only a limited number, and they will not be automatically unlinked until the next reboot.)

franckbrl · 2018-09-25T17:49:35Z

In this case, I guess when the decoder starts here, whatever happens in the CheckpointDecoder process, the training should carry on. Then we would get stuck only when waiting for the decoder output, which would produce a warning, right?

fhieber · 2018-09-25T18:02:54Z

True, and in general this works fine, but I wonder if sometimes this logic runs into some corner case with system file descriptor limits, semaphores or whatever.

franckbrl · 2018-09-25T18:24:57Z

I'm not sure if this is relevant, but the issue happened to me when starting the 3rd validation on a machine I had just rebooted before the training began. On another one that has not been rebooted for a while, it happened at the 9th validation start. Also, the former was CentOS and the latter Ubuntu.

franckbrl · 2018-09-30T10:12:48Z

I have always put the command starting sockeye.train in a bash script. Both Sockeye and MxNet were installed in a virtual environment that I activated manually BEFORE running the script (source /my/env/bin/activate). Recently, I started adding the virtualenv activation command line to the bash script, and that's when I repeatedly got the reported issue. Now I have removed the virtualenv activation from the bash script and started running it manually again. It's been 5 days and the issue has not occurred yet. I have no clue how to explain this behavior.

fhieber · 2018-09-30T17:22:46Z

Thanks for sharing more information! We are investigating this issue and any additional details may be helpful.
No updates unfortunately so far :/

fhieber · 2018-09-30T17:24:18Z

What version of mxnet are you using in your virtualenv? Mxnet-mkl? Are you also using an mkl-optimized version of numpy?

franckbrl · 2018-09-30T21:07:48Z

I had the issue with MxNet versions 1.2.1 and 1.3.0, and numpy version 1.14.5 in both cases.

fhieber · 2018-10-01T08:05:32Z

Thanks, these are the versions but are you pip-installing mxnet-cuXmkl==1.3.0 or mxnet-cuX==1.3.0?
Similarly, is your numpy version mkl-optimized? If you use Anaconda as your Python distribution you can check via conda list | grep numpy; if it says <pip> in the last column its not using mkl.

franckbrl · 2018-10-01T10:21:31Z

For MxNet, I pip-installed the requirements from here, which runs mxnet-cu90mkl.

As for Numpy, numpy.show_config() says:

blas_opt_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
blas_mkl_info:
  NOT AVAILABLE

fhieber · 2018-10-01T13:52:43Z

Ok, it seems there is some prior knowledge about MXNet/Cuda and multiprocessing:

Crash when MXNet API called before spawning multiprocess apache/mxnet#9213
gpu memory allocate will be error when using multiprocessing.Process apache/mxnet#4659
Prior evidence in PyTorch:
CUDA initialization fails fatally in multiprocessing pytorch/pytorch#334 (comment)
Use ForkingPickler for sharing tensor/storages across processes pytorch/pytorch#344

Some previous fix: apache/mxnet#8995

tdomhan · 2018-10-02T13:43:56Z

@franckbrl I have a work-around for a related MKL hang. Would you be able to test whether this would also fix your problem? I committed the change to:
https://github.com/awslabs/sockeye/tree/forkserver

franckbrl · 2018-10-02T18:07:32Z

It's going well so far, but I get a strange message about the decoder not being alive:

[INFO:sockeye.training] Decoder no longer alive...
[INFO:sockeye.training] Decoder-11 finished: {'bleu-val': 0.38399161973481205, 'chrf-val': 0.621943354132811, 'rouge1-val': 0.6120198908499329, 'rouge2-val': 
0.4257106522131144, 'rougel-val': 0.5846460755885678, 'avg-sec-per-sent-val': 1.4005149364471436, 'decode-walltime-val': 14.005149364471436}
[INFO:sockeye.training] Starting process: Decoder-12

tdomhan · 2018-10-02T18:23:58Z

that should be fine. I added this as a debug logging statement, but I think you can safely ignore it. It mainly means that your decoder finished before reaching the next checkpoint, which is to be expected.

Let's see how the rest of training goes. Keep us posted.

tdomhan · 2018-10-02T18:24:57Z

btw, would you have normally hit the issue at this point, or is it too early to say?

franckbrl · 2018-10-02T18:41:16Z

I'm running frequent validations and the decoder has been successfully started for the 22nd time. I'll let it run all night and tell you how it went tomorrow.

tdomhan · 2018-10-02T18:44:49Z

perfect, thanks!

franckbrl · 2018-10-03T09:12:10Z

258 validations have been successfully run. I can't explain why and when the issue occurred before, so I can't tell you for sure that I would have encountered it with the master branch version of the code.

tdomhan · 2018-10-04T08:26:17Z

that's good to hear. We will work on integrating a version of this fix into master. Let us know, in case you should still run into issues.

franckbrl · 2018-10-04T18:42:35Z

I'm not sure this was really helpful. I repeated exactly what earlier brought me to this issue: I went back to the master branch code and activated the virtual environment in my bash script. The decoder worked fine for 234 validations. I'm sorry to say I have no idea whether what you did on the forkserver branch actually helped.

franckbrl · 2018-10-09T23:33:28Z

The issue happened again with MxNet MKL version 1.3.0.post0 and Sockeye version 1.18.56. The virtual environment was manually activated, so my earlier observations on this are not relevant. @tdomhan Should I go back to experimenting on the forkserver branch?

tdomhan · 2018-10-10T08:33:20Z

That's unfortunate! If you could try the forkserver branch again, to see whether this fixes your issue, that would be highly appreciated. I'm currently still looking into this issue and trying to confirm that the forkserver method successfully fixes it. Given the difficulty of reproducing the issue, it is also difficult to confirm the fix. So any additional datapoints would be very helpful :)

franckbrl · 2018-10-10T09:26:42Z

Thank you, I'll use it and report here.

Here is one new thing I haven't seen before. When I kill Sockeye, I now get the following message:

/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))

franckbrl · 2018-11-25T10:40:08Z

@tdomhan It's been nearly 2 months and I've trained several systems on different machines using the forkserver branch. The problem has not occurred once. So forkserver seems to have solved it. Were your experiments as satisfactory?

tdomhan · 2018-11-26T16:24:49Z

That is great to hear. In the internal evaluation we ran we also did not not observe this issue with the forkserver branch anymore. We should now move ahead an integrate this change into the master branch :)

franckbrl · 2018-11-27T12:30:13Z

Great! We'll be waiting for this one! Shall we close the issue, now?

tdomhan · 2018-11-27T14:36:16Z

let's leave the issue open while we still don't have this in master.

tdomhan · 2018-12-12T16:23:15Z

I merged the change. Let us know if you have any issues. I will close the issue for now.

franckbrl closed this as completed Oct 3, 2018

franckbrl reopened this Oct 3, 2018

franckbrl closed this as completed Oct 9, 2018

tdomhan reopened this Oct 10, 2018

tdomhan mentioned this issue Dec 11, 2018

Forkserver for master #595

Merged

tdomhan closed this as completed Dec 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sockeye freezes at new validation start [v1.18.54] #544

Sockeye freezes at new validation start [v1.18.54] #544

franckbrl commented Sep 25, 2018

fhieber commented Sep 25, 2018

franckbrl commented Sep 25, 2018

fhieber commented Sep 25, 2018 •

edited

Loading

franckbrl commented Sep 25, 2018

fhieber commented Sep 25, 2018

franckbrl commented Sep 25, 2018

franckbrl commented Sep 30, 2018

fhieber commented Sep 30, 2018

fhieber commented Sep 30, 2018

franckbrl commented Sep 30, 2018

fhieber commented Oct 1, 2018

franckbrl commented Oct 1, 2018

fhieber commented Oct 1, 2018

tdomhan commented Oct 2, 2018

franckbrl commented Oct 2, 2018

tdomhan commented Oct 2, 2018 •

edited

Loading

tdomhan commented Oct 2, 2018

franckbrl commented Oct 2, 2018

tdomhan commented Oct 2, 2018

franckbrl commented Oct 3, 2018

tdomhan commented Oct 4, 2018

franckbrl commented Oct 4, 2018

franckbrl commented Oct 9, 2018

tdomhan commented Oct 10, 2018

franckbrl commented Oct 10, 2018

franckbrl commented Nov 25, 2018

tdomhan commented Nov 26, 2018

franckbrl commented Nov 27, 2018 •

edited

Loading

tdomhan commented Nov 27, 2018

tdomhan commented Dec 12, 2018

Sockeye freezes at new validation start [v1.18.54] #544

Sockeye freezes at new validation start [v1.18.54] #544

Comments

franckbrl commented Sep 25, 2018

fhieber commented Sep 25, 2018

franckbrl commented Sep 25, 2018

fhieber commented Sep 25, 2018 • edited Loading

franckbrl commented Sep 25, 2018

fhieber commented Sep 25, 2018

franckbrl commented Sep 25, 2018

franckbrl commented Sep 30, 2018

fhieber commented Sep 30, 2018

fhieber commented Sep 30, 2018

franckbrl commented Sep 30, 2018

fhieber commented Oct 1, 2018

franckbrl commented Oct 1, 2018

fhieber commented Oct 1, 2018

tdomhan commented Oct 2, 2018

franckbrl commented Oct 2, 2018

tdomhan commented Oct 2, 2018 • edited Loading

tdomhan commented Oct 2, 2018

franckbrl commented Oct 2, 2018

tdomhan commented Oct 2, 2018

franckbrl commented Oct 3, 2018

tdomhan commented Oct 4, 2018

franckbrl commented Oct 4, 2018

franckbrl commented Oct 9, 2018

tdomhan commented Oct 10, 2018

franckbrl commented Oct 10, 2018

franckbrl commented Nov 25, 2018

tdomhan commented Nov 26, 2018

franckbrl commented Nov 27, 2018 • edited Loading

tdomhan commented Nov 27, 2018

tdomhan commented Dec 12, 2018

fhieber commented Sep 25, 2018 •

edited

Loading

tdomhan commented Oct 2, 2018 •

edited

Loading

franckbrl commented Nov 27, 2018 •

edited

Loading