[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

mythrocks · 2021-01-30T01:08:38Z

#7189 implements COLLECT aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics.
E.g.

auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74};
auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ]

Note that the null element (∅) is replicated in the first 3 rows of the output.

SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following:

auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ]

CUDF should allow the COLLECT aggregation to be constructed with an optional null_policy argument (with default INCLUDE). The COLLECT window function should check the policy, and filter out null list-elements a posteriori.

The text was updated successfully, but these errors were encountered:

@mythrocks

Closes #7258. #7189 implements `COLLECT` aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics. E.g. ```c++ auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74}; auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr); // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ] ``` Note that the null element (`∅`) is replicated in the first 3 rows of the output. SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following: ```c++ auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr); // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ] ``` CUDF should allow the `COLLECT` aggregation to be constructed with an optional `null_policy` argument (with default `INCLUDE`). The `COLLECT` window function should check the policy, and filter out null list-elements _a posteriori_. Authors: - MithunR (@mythrocks) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - AJ Schmidt (@ajschmidt8) - Vukasin Milovanovic (@vuule) - Jake Hemstad (@jrhemstad) URL: #7264

mythrocks added feature request New feature or request Needs Triage Need team to review and classify labels Jan 30, 2021

mythrocks self-assigned this Jan 30, 2021

mythrocks mentioned this issue Jan 30, 2021

Implement COLLECT rolling window aggregation #7189

Merged

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Jan 30, 2021

mythrocks mentioned this issue Feb 1, 2021

Support null_policy::EXCLUDE for COLLECT rolling aggregation #7264

Merged

rapids-bot bot closed this as completed in #7264 Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

mythrocks commented Jan 30, 2021 •

edited

Loading

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

Comments

mythrocks commented Jan 30, 2021 • edited Loading

mythrocks commented Jan 30, 2021 •

edited

Loading