You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe. Rlike was an very expensive cudf operation compared to other operator like compare. One option is mentioned in #10600 that we can replace some cases with cheaper expression. Another option we could have is to introduce short-circuit evaluation to skip some regexp by prioritizing to evaluate those cheaper evaluations.
Describe the solution you'd like
record_name regexpr (.*[0-9]{2}) AND date BTWEEN `20201111` AND `20201112` AND length(store_names) > 15
For condition mentioned above, we could optimize it by:
Evaluate other conditions firstly other than regexp
Support null masking filter on the fisrst 2 conditions and evaluate null masked data using regexp
(nice to have) reorder date and length conditions on the fly based on its selectivity
The text was updated successfully, but these errors were encountered:
On the GPU the problems typically show up around thread divergence and non-coalesed memory access patterns.
I am not 100% sure about this so we should run some experiments and see, but I don't think it is a clear win every time. If the string is long, then it will always have a bad memory access pattern, and the length of time that the kernel takes to run is likely the amount of time it takes for the longest string to be processed by a warp. If the string is short then the memory access pattern is likely good, and even though we might still have thread divergence it is decently fast.
Replacing string values with nulls requires a memory copy, and copy_if_else for strings is not known to be that great. So we might need some kind of a heuristic to see what matters. I think in really bad cases today this could be a big win, especially if we can free up some warps early to process more data, and there are enough input strings to make that be a win.
Is your feature request related to a problem? Please describe.
Rlike
was an very expensive cudf operation compared to other operator like compare. One option is mentioned in #10600 that we can replace some cases with cheaper expression. Another option we could have is to introduce short-circuit evaluation to skip some regexp by prioritizing to evaluate those cheaper evaluations.Describe the solution you'd like
For condition mentioned above, we could optimize it by:
The text was updated successfully, but these errors were encountered: