-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using dictionary arrays as filters #12382
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that a DictionaryArray
with Boolean values is likely to only ever perform worse than a native BooleanArray
The rationale is that each element of a DictionaryArray takes at least 1 byte (for the index into the dictionary when indexed by Int8
/UInt8
) where each element in a BooleanArray
consumes a single bit
So while I think this code is fine and we could merge it, the feature seems suspicious to me 🤔
I agree, I was thinking the same thing. The context is a kernel that operates on arrays, the output of which may or may not be used in a filter. I guess the kernel could do the conversion but I figured if we can support it here Filter could do the conversion internally (if it makes sense, maybe it already is?) |
I think this conversion is typically done in coercion -- like in https://docs.rs/datafusion/latest/datafusion/logical_expr/type_coercion/binary/fn.comparison_coercion.html (which basically means DataFusion would apply a cast internally to cast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to allow Dictionary as a filter (even though I think subsequent processing will effectively cast it to a BooleanArray
)
I think that functions like |
Yeah I agree, that's probably the better way to do things but does it hurt to also add this support? I guess it could be a performance foot gun to allow this, although I don't know how bad it would be. The alternative of not implementing this is the foot gun of: you implement a kernel, write some tests for it but not this specific case and then discovering via a user reporting a bug that DataFusion has this limitation and have to go patch your kernel to protect against this case. |
Yeah, I think the potential issue would be that your code creates these If DataFusion errors, then you know to go try and fix it, but if it doesn't error, then you may never know That being said, I think this particular change is fine and that we could merge it for the reasons discussed above. |
Thanks again @adriangb and @Dandandan |
See #12511 which is closely related |
Fixes #12380