Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing issue with explode_outer position not nulling position entries of null rows #7754

Merged

Conversation

hyperbolic2346
Copy link
Contributor

explode_outer supports writing a position column, but if the row was null it would incorrectly set the position to 0 and the row valid. Instead, it should null that position row as well. Luckily the null column matches 100% with the null column of the exploded column, so we can just copy it after it is created.

Fixes #7721

@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner March 29, 2021 20:30
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 29, 2021
@hyperbolic2346 hyperbolic2346 added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Mar 29, 2021
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 29, 2021
@codecov
Copy link

codecov bot commented Mar 29, 2021

Codecov Report

Merging #7754 (e0315dc) into branch-0.19 (7871e7a) will increase coverage by 0.43%.
The diff coverage is n/a.

❗ Current head e0315dc differs from pull request most recent head 7756567. Consider uploading reports for the commit 7756567 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7754      +/-   ##
===============================================
+ Coverage        81.86%   82.30%   +0.43%     
===============================================
  Files              101      101              
  Lines            16884    17053     +169     
===============================================
+ Hits             13822    14035     +213     
+ Misses            3062     3018      -44     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/lists.py 87.21% <0.00%> (-4.18%) ⬇️
python/cudf/cudf/core/column/decimal.py 93.84% <0.00%> (-1.03%) ⬇️
python/cudf/cudf/utils/utils.py 85.06% <0.00%> (-0.38%) ⬇️
python/cudf/cudf/core/column/column.py 87.43% <0.00%> (-0.33%) ⬇️
python/cudf/cudf/core/scalar.py 86.91% <0.00%> (-0.19%) ⬇️
python/cudf/cudf/utils/ioutils.py 78.71% <0.00%> (ø)
python/cudf/cudf/utils/cudautils.py 50.38% <0.00%> (ø)
python/cudf/cudf/core/tools/datetimes.py 84.44% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 95.02% <0.00%> (ø)
python/cudf/cudf/core/join/casting_logic.py
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cddafd9...7756567. Read the comment docs.

@sperlingxx
Copy link
Contributor

This change looks good in terms of spark-rapids.

@jrhemstad
Copy link
Contributor

@kkraus14 is there a Pandas equivalent for "explode_outer" that we should be concerned with matching its semantics?

@kkraus14
Copy link
Collaborator

@kkraus14 is there a Pandas equivalent for "explode_outer" that we should be concerned with matching its semantics?

Yes, it's just called explode in Pandas. The semantics are empty lists --> null, nulls --> null.

import pandas as pd

pdf = pd.DataFrame({'a': [[1, 2, 3], [4, 5], [], pd.NA], 'b': [1, 2, 3, 4]})
pdf.explode('a')
      a  b
0     1  1
0     2  1
0     3  1
1     4  2
1     5  2
2   NaN  3
3  <NA>  4

Ignore the NaN value where that should just be <NA> as well, which is Pandas representation for null.

@jrhemstad
Copy link
Contributor

So cudf::explode_outer -> pd.explode, cudf::explode -> pd.?

@jrhemstad
Copy link
Contributor

  a  b

0 1 1
0 2 1
0 3 1
1 4 2
1 5 2
2 NaN 3
3 4

Ah, but this is different behavior. The change @hyperbolic2346 just made would result in:

  a  b

0 1 1
0 2 1
0 3 1
1 4 2
1 5 2
null NaN 3
null 4

i.e., the position value for null/empty lists is also null.

@kkraus14
Copy link
Collaborator

cudf::explode -> pd.?

We don't have a need for an equivalent of a non-outer explode at this time.

@jrhemstad
Copy link
Contributor

What's more, @hyperbolic2346's original implementation would have resulted in:

  a  b

0 1 1
0 2 1
0 3 1
1 4 2
1 5 2
0 NaN 3
0 4

i.e., zeros in the empty/null list positions. So that didn't do what Pandas wanted either.

@jrhemstad
Copy link
Contributor

In fact, @kkraus14 the example you gave is totally different.

For input data like:

  • [[5,null,15], 100]
  • [null, 200]
  • [[], 300]

@hyperbolic2346 's original implementation returns

  • [0, 5, 100]
  • [1, null, 100]
  • [2, 15, 100]
  • [0, null, 200]
  • [0, null, 300]

The new one would do:

  • [0, 5, 100]
  • [1, null, 100]
  • [2, 15, 100]
  • [null, null, 200]
  • [null, null, 300]

@jrhemstad
Copy link
Contributor

Ah, perhaps Pandas doesn't need the cudf::explode_outer_position. You could just do explode_outer and prepend the index as an additional column that would yield the result from @kkraus14 's example.

@kkraus14
Copy link
Collaborator

Ah, perhaps Pandas doesn't need the cudf::explode_outer_position. You could just do explode_outer and prepend the index as an additional column that would yield the result from @kkraus14 's example.

Yes, we do not need position handling, just explode_outer and we'll pass the index column(s) as needed.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

cpp/src/lists/explode.cu Outdated Show resolved Hide resolved
@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 635dc9c into rapidsai:branch-0.19 Mar 30, 2021
@hyperbolic2346 hyperbolic2346 deleted the mwilson/explode_outer_pos_nulls branch March 30, 2021 23:12
rapids-bot bot pushed a commit that referenced this pull request Mar 31, 2021
After #7754 the Java explode outer unit tests were not updated to expect the nulls.

Authors:
  - Jason Lowe (@jlowe)

Approvers:
  - Robert (Bobby) Evans (@revans2)

URL: #7782
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] explode_outer_position doesn't match to Spark's counterpart
5 participants