Notebook erroring out when running many iterations #928

raphpapercup · 2019-06-26T09:13:57Z

When running the following notebook: https://github.com/ericjang/nf-jax/blob/c6636a010bb744a48185eb3f622d32336e929990/nf-tutorial-jax.ipynb, I have found that if you run all 1e4 iterations in a row, then evaluate y.max(), the notebook crashes. However if you run say 1000 iterations, then run y.max(), then run 1000 iterations, then run y.max(), etc, all the way to reach 1e4 iterations, the notebook does not crash and things work as expected.

mattjj · 2019-06-26T13:38:33Z

@hawkinsp could this be related to async execution?

@raphpapercup which backend is this on (CPU or GPU)? Could you try running y.block_until_ready() every 1000 iterations, instead of y.max(), just to help us diagnose the bug? See the async dispatch docs for an explanation of why it might be related.

hawkinsp · 2019-06-26T20:37:02Z

If this is running on CPU, I can believe it might well be related to async execution; you could easily run OOM. We probably need to limit how far ahead the Python code can run. Could you please verify which backend you were using? Thanks!

Currently on CPU and GPU there is no limit to how many operations the host can enqueue on the device stream. On GPU this doesn't usually cause a problem because the allocator is logically synchronized to the tail of the compute stream, and so we can free and reuse memory for operations enqueued on the stream. On CPU, the allocator is logically synchronized to the head of the compute stream, which means that the allocator cannot reuse buffers between operations enqueued on the stream. This means that the memory usage is proportional to the number of enqueued operations, which can rapidly blow up. Add a semaphore class and use it to set a moderate limit on the depth of the queue (32). The existing "synchronous" mode, used on TPU at present, is a special case of this support where the queue depth is 1. This may help with jax-ml/jax#928 . PiperOrigin-RevId: 257606960

hawkinsp · 2019-12-18T01:17:45Z

I believe this was fixed by the stream pacing mechanism referenced above.

hawkinsp added bug Something isn't working question Questions for the JAX team labels Jun 26, 2019

joaogui1 mentioned this issue Dec 17, 2019

Closing old issues #1874

Closed

hawkinsp closed this as completed Dec 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook erroring out when running many iterations #928

Notebook erroring out when running many iterations #928

raphpapercup commented Jun 26, 2019

mattjj commented Jun 26, 2019

hawkinsp commented Jun 26, 2019

hawkinsp commented Dec 18, 2019

Notebook erroring out when running many iterations #928

Notebook erroring out when running many iterations #928

Comments

raphpapercup commented Jun 26, 2019

mattjj commented Jun 26, 2019

hawkinsp commented Jun 26, 2019

hawkinsp commented Dec 18, 2019