-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event leaks #1250
Comments
This looks almost like a design feature: HSAOp's live in a ring buffer HSAQueue::asyncOps which is set to 16384 entries by default, and no attempt to do resource reclamation is made until the pointer gets all the way around the buffer. It is a trivial change to turn that std::vector into std::list and to add garbage collection (any time a new op is pushed to the tail of the list, remove all finished ops from its head), which drops the maximum number of simultaneously coexisting ops in my runs from 60,000 to 4,000. There does not seem to be much need for that buffer in the first place, since no one ever does anything with entries in it, apart from checking the last enqueued op and counting the number of pending ops. |
And, in fact, I see that there's a change in master (but not in the 2.7 tag I've been working with) that goes even further and simply empties the entire queue on every wait. However, things are not as peachy as I'd like, because, with my change, I am now getting odd crashes:
They are random and fairly rare (took me several tries and 30+ min of training to capture one in gdb). I don't see how it could be my doing - I suspect an unrelated bug in resource refcounting that got exposed because objects are getting released sooner than before. Going to rebuild with 346267d#diff-447101ed1788631fe8f3e89117d084f2 and see if that takes care of it. |
You cannot simply stop allocating signals for all kernels. In the current ring buffer implementation, if the buffer loops back around and overwrites a kernel op before it has launched, the device-allocated kernel argument buffer will be free'd before the kernel uses it. We are aware of the large number of signals being used and we have plans to fix it. In the meantime, perhaps try reducing the size of the ring buffer using HCC_ASYNCOPS_SIZE env var. The default is 16K. Or if you're comfortable rebuilding HCC yourself, try cherry-picking the queue flush on wait. a9eec7b |
I've noticed that ROCm uses an astounding number of event objects when used as backend for https://github.com/tensorflow/tensor2tensor. It easily overshoots the kernel default of 4096, triggering warnings in dmesg (see ROCm/ROCK-Kernel-Driver#80) and, according to my traces, the number of InterruptSignal (ROCR-Runtime/src/core/runtime/interrupt_signal.cpp) objects stabilizes somewhere in the neighborhood of 60,000. Which is very hard to explain as anything but a resource leak.
Most of those are created via hsa_signal_create or Kalmar::ctx->getSignal(), and owned by HSAOp objects created in lib/hsa/mcwamp_hsa.cpp, of which there are, likewise, tens of thousands at any given time. It seems that they don't get released promptly when they are done executing, maybe because they form daisy chains of references.
The text was updated successfully, but these errors were encountered: