-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use List of Columns as Input for drop_nulls
, gather
and drop_duplicates
#9558
Use List of Columns as Input for drop_nulls
, gather
and drop_duplicates
#9558
Conversation
…dropna_list_interface
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9558 +/- ##
===============================================
Coverage ? 10.47%
===============================================
Files ? 119
Lines ? 20335
Branches ? 0
===============================================
Hits ? 2130
Misses ? 18205
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is a nice improvement in performance. I have some suggestions for improving implementation and developer documentation, since some methods' assumptions are not very clear to me.
Co-authored-by: Bradley Dice <[email protected]>
…ropna_list_interface
I have added a few It should be ready for another round of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Just a couple minor suggestions, otherwise LGTM 😄
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment, otherwise I think this is good enough as is. Let's retarget to 22.02 as we discussed. Also, since #9516 is removing the old _from_columns
and I think @shwina was happy to merge that pretty much as is, let's wait until that gets merged so that you don't have to worry about the methods conflicting.
Also @isVoid you'll need to fix style :) |
rerun tests |
@gpucibot merge |
This PR brings changes from #9558 to `apply_boolean_mask` and removes the `as_frame` -> `as_column` round trip. Benchmark the column method: ``` ------------------------------------- benchmark 'col0': 2 tests ------------------------------------- Name (time in us) Min Max Mean ----------------------------------------------------------------------------------------------------- column_apply_boolean_mask[col0] (afte) 87.0090 (1.0) 132.8980 (1.0) 95.8815 (1.0) column_apply_boolean_mask[col0] (befo) 210.4580 (2.42) 307.8270 (2.32) 225.4821 (2.35) ----------------------------------------------------------------------------------------------------- ------------------------------------- benchmark 'col1': 2 tests ------------------------------------- Name (time in us) Min Max Mean ----------------------------------------------------------------------------------------------------- column_apply_boolean_mask[col1] (afte) 74.2240 (1.0) 110.0600 (1.0) 75.6356 (1.0) column_apply_boolean_mask[col1] (befo) 172.5240 (2.32) 278.5250 (2.53) 176.5672 (2.33) ----------------------------------------------------------------------------------------------------- ------------------------------------- benchmark 'col2': 2 tests ------------------------------------- Name (time in us) Min Max Mean ----------------------------------------------------------------------------------------------------- column_apply_boolean_mask[col2] (afte) 101.5740 (1.0) 141.8850 (1.0) 110.2334 (1.0) column_apply_boolean_mask[col2] (befo) 234.1140 (2.30) 312.7140 (2.20) 245.5453 (2.23) ----------------------------------------------------------------------------------------------------- ------------------------------------- benchmark 'col3': 2 tests ------------------------------------- Name (time in us) Min Max Mean ----------------------------------------------------------------------------------------------------- column_apply_boolean_mask[col3] (afte) 88.7710 (1.0) 142.7500 (1.0) 90.5082 (1.0) column_apply_boolean_mask[col3] (befo) 195.0980 (2.20) 303.1020 (2.12) 199.8368 (2.21) ----------------------------------------------------------------------------------------------------- ``` Dataframe benchmark ``` ----------------------------------- benchmark '100': 2 tests ----------------------------------- Name (time in us) Min Max Mean ------------------------------------------------------------------------------------------------ df_apply_boolean_mask[100] (afte) 380.6770 (1.05) 654.7080 (1.18) 389.3374 (1.03) df_apply_boolean_mask[100] (befo) 362.3220 (1.0) 554.6130 (1.0) 378.7087 (1.0) ------------------------------------------------------------------------------------------------ ----------------------------------- benchmark '10000': 2 tests ----------------------------------- Name (time in us) Min Max Mean -------------------------------------------------------------------------------------------------- df_apply_boolean_mask[10000] (afte) 399.5240 (1.05) 461.6310 (1.0) 405.1225 (1.04) df_apply_boolean_mask[10000] (befo) 379.4080 (1.0) 564.5770 (1.22) 389.6990 (1.0) -------------------------------------------------------------------------------------------------- ``` Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9832
…d create their `_base_index` counterparts (#9807) This PR is a follow up of #9558 (Part 1 of 3) One remaining problem from #9558 is that `Frame` is index agnostic, however the above functions needs to perform index-aware operations when building the list of columns to pass to libcudf. For example, to remove duplicates of `BaseIndex`, it should only construct the list with all its columns. But in a dataframe, it would need to pass in all data columns plus the index columns, while specifying the indices of the data columns to consider duplicates. This complicates for `_gather` which supports `keep_index` argument. This PR moves aforementioned functions to `IndexedFrames`, and create its counterparts in `_base_index`. A couple noteworthy changes: - Merge object added with two new arguments `l(r)hs_is_index` - DataFrame/Series.take `keep_index` argument is removed. For internal usage it's more advised to use `_gather`. (And thus this PR is labeled breaking) Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - https://github.com/brandon-b-miller URL: #9807
Follow up to #9558 On a return trip from libcudf, it is a common pattern for cudf frame to apply its own metadata to the columns. This PR generalizes this procedure as a new factory function `_from_colums_like_self` Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - Paul Taylor (https://github.com/trxcllnt) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10022
Currently, there are several APIs that accepts a
Frame
object as input, in corresponding to their libcudf counterparts that accepts atable_view
. To make some also work for columns, currently we pass them throughas_frame
and return with_as_column
. This PR changes the cython API to accept a list of columns and greatly reduces the overhead of column roundtrip (see benchmark for column APIs below).Starting as a pilot study of standardizing cython calling convention for table APIs, some decisions were made in this PR:
list
as the container for the collection of the columns. Ideally, an iterable is most pythonic, but may lose some type safety.Gather/Take Benchmark
Dropna Benchmark
Unique/Drop_duplicate Benchmark