-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix PooledArray performance bottleneck #2733
Conversation
Performance. This PR:
On
|
if x isa PooledVector || allunique(refpool) | ||
return refpool, refarray | ||
else | ||
return nothing, nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj - can Arrow.jl give us an example of pooled vector with duplicates in refpool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's technically allowed to have duplicates in the dictionary pool; but we could also call allunique
in the DataAPI.refpool
method in Arrow.jl to avoid this. Perhaps we should make it formally part of the refpool
contract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way we ensure other refpool
-enabled array types get the same optimization w/o needing to hard-code PooledVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a possibility. But are we sure no existing implementations have duplicates and that it couldn't be useful? Another solution would be to add a new function to DataAPI to check whether entries are unique. That may be overkill if having duplicates is not useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I can't see duplicate values in a pool being useful at all. It's our API, so I say we enforce it. All people would have to do if they allow duplicates is call allunique
on their refpool. I think it just comes back to the fact that all our current uses/designs for refpool
revolve around it being a unique pool, so we might as well enforce that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good at least as a short-term measure until we find a general solution.
I think we have a general solution, but @quinnj would have to confirm that it is OK with Arrow.jl.
which implies that in this case The question to @quinnj is:
|
@nalimilan - just to be 100% sure - for CategoricalArray.jl both |
Good idea! Then we already have everything we need. CategoricalArray indeed just wraps its internal dict, so it's very cheap. And types for which computing |
They can do it lazily and dynamically decide if it is worth to compute it. Let me wait for @quinnj to comment if he is OK with this and if yes then I will merge and make a patch release. |
Yes, I'm ok w/ this approach. I think we can figure things out on the Arrow.jl side; I don't think it would be bad even to enforce teh uniqueness in Arrow.jl specifically, since the benefit would be great. We could do a uniqueness check/scrub when we do the initial record batch processing. I'll need to think a bit more about invrefpool; I'll ping @dmbates as well since he was the one who implemented the DataAPI methods in Arrow.jl, though I'm not signing him up for any additional work if he doesn't want it 😛 |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Thank you! |
avoid using
allunique
forPooledArray
.