Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JNI for splitting groups in a table after groupby [skip-ci] #7954

Merged
merged 4 commits into from
Apr 20, 2021

Conversation

firestarman
Copy link
Contributor

@firestarman firestarman commented Apr 14, 2021

This PR is to add an API named contiguousSplitGroups in JNI which will split the groups in a table after a groupby operation, instead of executing an aggregate on each group, along with its unit tests.

This API will be used by some Spark operators ( e.g. Python UDFs ) to process the data group by group.

Other changes:

  • Renames the AggregateOperation to GroupByOperation which sounds better, since it is retuned from exactly a groupby call.
  • Adds some additional fields to GroupByOptions which will be used by native groupby to propably achieve a better performance.

Signed-off-by: Firestarman [email protected]

Along with its unit tests.

Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman requested a review from a team as a code owner April 14, 2021 10:05
@firestarman firestarman changed the title Add JNI for the splitting groups in a table after groupby. Add JNI for the splitting groups in a table after groupby [skip ci] Apr 14, 2021
@github-actions github-actions bot added the Java Affects Java cuDF API. label Apr 14, 2021
@firestarman firestarman changed the title Add JNI for the splitting groups in a table after groupby [skip ci] Add JNI for splitting groups in a table after groupby [skip ci] Apr 14, 2021
@firestarman firestarman changed the title Add JNI for splitting groups in a table after groupby [skip ci] Add JNI for splitting groups in a table after groupby [skip-ci] Apr 14, 2021
@firestarman firestarman added feature request New feature or request non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS labels Apr 14, 2021
@firestarman firestarman linked an issue Apr 14, 2021 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Apr 14, 2021

Codecov Report

Merging #7954 (e665b4a) into branch-0.20 (599f62d) will increase coverage by 0.41%.
The diff coverage is 92.73%.

❗ Current head e665b4a differs from pull request most recent head a167264. Consider uploading reports for the commit a167264 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #7954      +/-   ##
===============================================
+ Coverage        82.30%   82.71%   +0.41%     
===============================================
  Files              101      103       +2     
  Lines            17053    17711     +658     
===============================================
+ Hits             14035    14650     +615     
- Misses            3018     3061      +43     
Impacted Files Coverage Δ
python/cudf/cudf/utils/dtypes.py 83.44% <ø> (-6.45%) ⬇️
python/cudf/cudf/utils/utils.py 83.25% <ø> (-1.81%) ⬇️
python/dask_cudf/dask_cudf/backends.py 89.58% <ø> (-0.05%) ⬇️
python/cudf/cudf/core/groupby/groupby.py 92.41% <78.57%> (-1.04%) ⬇️
python/cudf/cudf/core/column/lists.py 87.41% <80.00%> (+0.19%) ⬆️
python/cudf/cudf/core/column/struct.py 96.29% <86.66%> (-3.71%) ⬇️
python/cudf/cudf/core/index.py 93.04% <88.09%> (+0.01%) ⬆️
python/cudf/cudf/core/column/column.py 87.86% <88.57%> (+0.43%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.92% <91.48%> (-0.92%) ⬇️
python/cudf/cudf/core/column/interval.py 91.11% <92.30%> (+0.48%) ⬆️
... and 69 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1c3245...a167264. Read the comment docs.

@firestarman
Copy link
Contributor Author

rerun tests

@firestarman firestarman requested review from revans2 and jlowe April 15, 2021 03:06
@firestarman firestarman added the 3 - Ready for Review Ready for review by team label Apr 15, 2021
@firestarman firestarman changed the title Add JNI for splitting groups in a table after groupby [skip-ci] Add JNI for splitting groups in a table after groupby Apr 15, 2021
@firestarman
Copy link
Contributor Author

rerun tests

@firestarman firestarman changed the title Add JNI for splitting groups in a table after groupby Add JNI for splitting groups in a table after groupby [skip-ci] Apr 15, 2021
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the code looks good. My only other real concern is that there is no code reuse between the new API and the existing aggregations API.

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Contributor Author

firestarman commented Apr 16, 2021

Overall the code looks good. My only other real concern is that there is no code reuse between the new API and the existing aggregations API.

Now at least the new API and the existing 'aggregate()' are sharing the groupby options. Is it OK ? @revans2

@firestarman firestarman added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Apr 20, 2021
@firestarman
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6fb7909 into rapidsai:branch-0.20 Apr 20, 2021
@firestarman firestarman deleted the split-groups branch April 20, 2021 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] JNI: Support splitting groups in a table after a groupby
3 participants