-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements NDScope to automatically close NDArray in the scope #2321
Implements NDScope to automatically close NDArray in the scope #2321
Conversation
This PR can be used to solve the problem #2210. As an illustration:
To repair the memory leak (red situation in the picture) the only code that had to be changed is introducing a RCScope (green situation in the picture) :
@KexinFeng The temporary solution #2304 that fixed the performance impact of this memory leak could potentially be removed after this PR. |
Garbage collection and reference queues do not provide efficient real-time guarantee of unreachability, while reference-counting provides an alternative mechanism at the cost of slight inconvenience. [source ... from netty] One cost of slight inconvenience: With the high level interface
|
Codecov ReportBase: 72.08% // Head: 74.36% // Increases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #2321 +/- ##
============================================
+ Coverage 72.08% 74.36% +2.27%
- Complexity 5126 6773 +1647
============================================
Files 473 664 +191
Lines 21970 29403 +7433
Branches 2351 3053 +702
============================================
+ Hits 15838 21866 +6028
- Misses 4925 6048 +1123
- Partials 1207 1489 +282
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Performance looks good now. I have looked for hot spots using JProfiler. The following screenshot from an example application that heavily uses the reference counting code (by using the high-level UI). None of the methods that consume more than 1% of the time belong to the reference counting code. |
Thanks for this contribution. Very cool idea. Also thanks for mentioning this. But the issue PR #2304 originates from the issue #2024 regarding the setting of gradient accumulation: add or overwrite. Regarding the subsequent issue #2210 you opened, the inefficiency there is due to that the searching scope is too large. So probably this PR still wouldn't heal it. |
@KexinFeng Thx for looking into it. I really love how this reference counting deterministically works and solves the memory leak problem ... but the idea comes from your team: @lanking520 referenced almson-refcount and @frankfliu referenced the JavaCPP with the easy to use Scope approach. The approach can work completely without relying on GC and therefore likely avoids problems @zachgk mentioned. Concerning PR #2304 ... you are the expert. I just wanted to mention it because I had the impression that the too large searching scope did not hurt until the memory-leaking NDArrays came into play. |
@enpasos A quick question about
does this PR also solves the efficiency problem that was found in that same issue #2210? If searching scope being too large were really the root cause, then probably the efficiency issue would still be there. But if not, there might be a different root cause. |
@KexinFeng Thx for asking. I didn't test it before ... and with the test I ran into a bug in the
My interpretation: With the "temporary solution" the performance impact is fixed but the causing memory leak is still there. With "reference counting but not temporary solution" the memory leak is fixed and the caused performance impact is fixed, too. The performance numbers of "temporary solution" are slightly better then "reference counting but not temporary solution", so "reference counting with temporary solution" should give the best in terms of memory leak and performance impact. But if there are other reasons you would like to remove the "temporary solution" again, it is not strictly needed to solve issue #2210. |
Thanks so much for these test results! It looks like reference counting without temporary solution indeeds solves the issue, the reason of which probably is not known until the true root cause of the previous issue is found. From these test results, there are also other bizzare issues. The first is that, temporary the solution #2304 is nothing but the revertion of the features PR 2111 and PR2101. So the results indicate that DJL has memory leak issues already before PR 2111 and PR2101. One check is whether it is due to the PtNDArray indexing memory leak issue, which was solved later than this in #2300. On the other hand, if this was indeed the root cause, or if anyother root cause were indentified, then how the reference counting featue solves these issues would be an interesting example to demonstrate its advantage. The second is that from the efficiency plot, it looks like the global searching doesn't necessarily increase the duration. Probably it is relavant to memory leak, so when reference counting feature is applied, it is solved at the same time. |
@KexinFeng The introductory DJL Doc Example as I put it together as an example is for my understanding a from scratch and not a very sophisticated approach - but the memory leaking mistakes made there are on the level I make them - as an example for a DJL user. One "memory leak" there: the I think this memory leak is not related to the indexing memory leak issue, solved in #2300 |
Hi @enpasos, thanks again for your very nice contribution. We have had several team meetings to discuss your PRs, including this one and the preceding one #2273, and we were inspired by your memory management idea and the relevant Java language knowledge. More specifically, we do agree that your implementation of using So can we propose the following? We can force-push our simplified version to this PR, so that the feature is credited to you. But before this, you can first backup your current implementation, and if you still want, you can create a new PR and put them there. What do you think? |
a6dde78
to
6ae6600
Compare
Perfect! Many thanks to @KexinFeng, @frankfliu and the DJL team! The solution you found is just right to completely solve my memory leak problems. It is elegant per se and in its integration with DJL. |
I am missing the feature to find a specific NDArray when I get the error "Native resource has been released already". As this is a feature, I would follow @KexinFeng's suggestion to add a small PR when this PR is on the master. |
…ows an Exception - fixed
When I first tested your solution I took the two files from KexinFeng's link and hooked them in by NDScope.register from the constructors of PtNDArray. Then I run my application and :-) ... no exceptions and the memory leak was away. Then I recognized there already was the integrated solution pushed. I checked it out, ran with my application into minor bugs which I fixed. But then I recognized that the memory leak was back again. The leak was in the detach Method of PtNDArray - somehow the To really get rid of memory leaks where the user has no clue where they come from, it is in my opinion crucial that NDScope closes the NDArrays created inside it, no matter what is done inside NDScope, unless So I think the |
} | ||
|
||
detached = manager.create(new int[] {1}); | ||
detached.detach(); // detached from NDManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In most cases, when
NDArray.detach()
is called, user want to return the NDArray from the function. - And usually, NDManager is at the out side if NDScope
Which means user has to call both function in most of time. And we documented that once NDArray is detached, user must manually close it.
So I feel we should unregister from the NDScope when NDArray.detach()
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are abusing NDArray.detach()
in many internal place (attache()
, they should only detach from NDManager only. Introducing two detach(boolean)
function seems too much. Requires user to call both should be fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your intention to introduce NDScope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention would be: Implement it because you need it to close a memory leak that you cannot solve otherwise. So I would not use NDScope if I did not have to. But if I have a leak - which I unfortunately do - I would put the NDScope where I think no NDArray should leak through - no matter if inside or outside of NDManager and no matter how they are used. And in that case I would not want NDArrays to leak through for whatever reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are abusing
NDArray.detach()
in many internal place (attache()
, they should only detach from NDManager only. Introducing twodetach(boolean)
function seems too much. Requires user to call both should be fine for now.
I agree that if there was a convenient method where the user knew exactly that he/she was also unregistering from NDScope, there should be no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave it as it is for this PR?
if (((NDArray) obj).isReleased()) { | ||
return this == obj; | ||
} | ||
if (((NDArray) obj).getManager() != manager) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equals()
compares the content of two NDArray, if their NDManager are different, their content still can be equals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NDScope.unregister calls resources.remove which uses equals to identify the object.
I got an exception from unregister in two cases: one object (not the one to be unregistered) in NDScope was on a different Device and one was already closed. The lines I added - maybe not perfect - were intended to solve both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we should remove the NDArray by it's uid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we should remove the NDArray by it's uid
I think this is not save because the uid from the handle is only unique as long as the handle lives. If you create a uuid by uid plus handleCreationTimestamp it would do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, uid meant to be unique, but it's possible have a clash if the NDArray is closed and the same memory get reused. We should add a timestamp to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problems I experience where Exceptions coming from the equals-Method. That is not part of the equals method contract - it should answer yes or no.
I agree that if you want to have the contract on NDScope on the same instance of the NDArray - a clear contract - you need to solve it in NDScope.
A simple wrapper of NDArray would do, which is only used inside NDScope to put into the List. It should just inherit the equals implementation from Object to detect the instance on removing from the List.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's separate in different issues:
- NDScope should track the object itself, the ArrayList might not be the right collection, we can use
IdentityHashMap
to resolve this problem - uid is not unique. we can solve this problem in a different PR
equals()
function:isReleased()
is a bug, we can fixed it here or in a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I comitted a suggestion, but I think it is not correct ... NDArrayWrapper and NDArrayList it a problem ... I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx. IdentityHashMap works nicely. Better solution than my wrapper suggestion ... and using NDArrayList with the wrapper is simply wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point 1. I think done.
Points 2. and 3. in a separate PR sounds good.
Co-authored-by: Frank Liu <[email protected]>
This PR introduces - as suggest by @frankfliu - reference counting to prevent or get rid of memory leaks on the native side of NDArrays. As suggested by @frankfliu the implementation is derived from the well established reference counting code of JavaCPP. It is a deterministic approach and does not rely on Java garbage collection as the approach #2273.
The user interface - in this first attempt - is already easy to use thanks to the design by JavaCPP's author Samuel Audet:
RCObject
(= reference counted object).RCScope
(= reference counter scope).The approach works loosely together with the NDManager hierarchy - with potential for closer integration.
At the moment the feature is implemented for the PyTorch
NativeResource
PtNDArray
. The extension to otherNativeResource
s should be simple.