Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

Closed
mythrocks opened this issue Jan 30, 2021 · 0 comments · Fixed by #7264
Closed

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE #7258

mythrocks opened this issue Jan 30, 2021 · 0 comments · Fixed by #7264
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@mythrocks
Copy link
Contributor

mythrocks commented Jan 30, 2021

#7189 implements COLLECT aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics.
E.g.

auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74};
auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ]

Note that the null element () is replicated in the first 3 rows of the output.

SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following:

auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ]

CUDF should allow the COLLECT aggregation to be constructed with an optional null_policy argument (with default INCLUDE). The COLLECT window function should check the policy, and filter out null list-elements a posteriori.

@mythrocks mythrocks added feature request New feature or request Needs Triage Need team to review and classify labels Jan 30, 2021
@mythrocks mythrocks self-assigned this Jan 30, 2021
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Jan 30, 2021
rapids-bot bot pushed a commit that referenced this issue Feb 18, 2021
Closes #7258.

#7189 implements `COLLECT` aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics.
E.g. 
```c++
auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74};
auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ]
```
Note that the null element (`∅`) is replicated in the first 3 rows of the output.

SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following:
```c++
auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ]
```

CUDF should allow the `COLLECT` aggregation to be constructed with an optional `null_policy` argument (with default `INCLUDE`). The `COLLECT` window function should check the policy, and filter out null list-elements _a posteriori_.

Authors:
  - MithunR (@mythrocks)

Approvers:
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
  - AJ Schmidt (@ajschmidt8)
  - Vukasin Milovanovic (@vuule)
  - Jake Hemstad (@jrhemstad)

URL: #7264
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants