Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add equi-join support for structs #7543

Closed
hyperbolic2346 opened this issue Mar 9, 2021 · 3 comments · Fixed by #7720
Closed

[FEA] Add equi-join support for structs #7543

hyperbolic2346 opened this issue Mar 9, 2021 · 3 comments · Fixed by #7720
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@hyperbolic2346
Copy link
Contributor

There is no current way to run a join using a struct column. It would be nice to support this in cudf. This will probably leverage some of the work from #7422.

Describe the solution you'd like
It would certainly be nice to support arbitrary depths, but based on #7422 it might require a first-pass at a single depth and a later PR to add support for more depths.

Of particular concern is handling of NaN, Inf, -Inf, etc. These should be handled exactly the same as if the comparison was made between columns of the types inside the structs, so this should be fairly straight forward. The other concern is null handling. Join has the option of nulls comparing equal or not based on incoming parameters. There is some potential for issues around nulls here and care needs to be taken with the implementation.

A null struct would not be the same as a struct with all null entries.

@hyperbolic2346 hyperbolic2346 added feature request New feature or request Needs Triage Need team to review and classify labels Mar 9, 2021
@jrhemstad
Copy link
Contributor

It would certainly be nice to support arbitrary depths, but based on #7422 it might require a first-pass at a single depth

We should clarify what this means. "Arbitrary depths" should not mean a struct that contains arbitrarily nested list type(s). But multiple levels of struct, e.g., struct< int, struct<int, float> >.

Arbitrary levels of structs can be handled trivially in the same way as #7422 by exploding the struct parent columns' null masks into bool columns and then handling the struct column as extra columns in the table.

@nvdbaranec
Copy link
Contributor

Do you even need to expand the null masks into seperate columns? If a struct has a null mask, all of it's children will also have a null mask with everything pushed down. Seems like all you need to do is just flatten all the columns into one big table_view and use that for comparison.

@jrhemstad
Copy link
Contributor

Do you even need to expand the null masks into seperate columns? If a struct has a null mask, all of it's children will also have a null mask with everything pushed down. Seems like all you need to do is just flatten all the columns into one big table_view and use that for comparison.

I had the same idea in sorting, but Bobby reminded me why that won't work.

The key is that you need to be able to differentiate null from {null, null, null}, i.e., a null struct vs a struct of all nulls.

Though actually... I wonder if we actually have the same problem in join, especially if nulls are considered unequal.

@hyperbolic2346 hyperbolic2346 self-assigned this Mar 25, 2021
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Mar 26, 2021
@rapids-bot rapids-bot bot closed this as completed in #7720 Apr 1, 2021
rapids-bot bot pushed a commit that referenced this issue Apr 1, 2021
Adds support for equijoin on structs.

This PR is leveraging the [struct PR](#7422) and the [rewrite for join API](#7454). It enables equijoin on structs by flattening the struct for the hash calculation.

closes #7543

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Ashwin Srinath (https://github.com/shwina)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Alessandro Bellina (https://github.com/abellina)
  - Devavret Makkar (https://github.com/devavret)
  - David Wendt (https://github.com/davidwendt)
  - Liangcai Li (https://github.com/firestarman)
  - Paul Taylor (https://github.com/trxcllnt)
  - Kumar Aatish (https://github.com/kaatish)
  - Jason Lowe (https://github.com/jlowe)
  - Dillon Cullinan (https://github.com/dillon-cullinan)
  - Raza Jafri (https://github.com/razajafri)
  - https://github.com/rwlee
  - Michael Wang (https://github.com/isVoid)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Keith Kraus (https://github.com/kkraus14)
  - Robert Maynard (https://github.com/robertmaynard)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/ChrisJar
  - AJ Schmidt (https://github.com/ajschmidt8)
  - https://github.com/nvdbaranec
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/chenrui17
  - Conor Hoekstra (https://github.com/codereport)
  - Mike Wendt (https://github.com/mike-wendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #7720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants