Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use List of Columns as Input for drop_nulls, gather and drop_duplicates #9558

Merged
merged 46 commits into from
Nov 19, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Oct 29, 2021

Currently, there are several APIs that accepts a Frame object as input, in corresponding to their libcudf counterparts that accepts a table_view. To make some also work for columns, currently we pass them through as_frame and return with _as_column. This PR changes the cython API to accept a list of columns and greatly reduces the overhead of column roundtrip (see benchmark for column APIs below).

Starting as a pilot study of standardizing cython calling convention for table APIs, some decisions were made in this PR:

  1. Use list as the container for the collection of the columns. Ideally, an iterable is most pythonic, but may lose some type safety.
  2. The column collection is agnostic to index/data columns, libcudf handle index column separately either. This helps simplify cython logics.
Gather/Take Benchmark
----------------------------------- benchmark '100-random-False': 4 tests ------------------------------------
Name (time in us)                                      Min                   Max                Mean          
--------------------------------------------------------------------------------------------------------------
gather_single_column[100-random-False] (afte)     420.4372 (1.0)        552.7758 (1.0)      428.8227 (1.0)    
gather_single_column[100-random-False] (befo)     597.7047 (1.42)       811.8181 (1.47)     606.3709 (1.41)   
take_multiple_column[100-random-False] (afte)     849.6591 (2.02)     6,339.7521 (11.47)    870.1292 (2.03)   
take_multiple_column[100-random-False] (befo)     864.0001 (2.06)     1,091.5170 (1.97)     872.8270 (2.04)   
--------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark '100-random-True': 4 tests -----------------------------------
Name (time in us)                                     Min                   Max                Mean          
-------------------------------------------------------------------------------------------------------------
gather_single_column[100-random-True] (afte)     141.4879 (1.0)      3,144.3723 (2.64)     145.7316 (1.0)    
gather_single_column[100-random-True] (befo)     291.5259 (2.06)     3,083.7669 (2.59)     299.2343 (2.05)   
take_multiple_column[100-random-True] (afte)     958.2350 (6.77)     1,295.6643 (1.09)     971.2230 (6.66)   
take_multiple_column[100-random-True] (befo)     967.4439 (6.84)     1,191.7809 (1.0)      976.4725 (6.70)   
-------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark '100-reverse-False': 4 tests -----------------------------------
Name (time in us)                                       Min                   Max                Mean          
---------------------------------------------------------------------------------------------------------------
gather_single_column[100-reverse-False] (afte)     414.2257 (1.0)      6,856.2678 (2.05)     426.5804 (1.0)    
gather_single_column[100-reverse-False] (befo)     589.7889 (1.42)     3,387.3413 (1.01)     602.0794 (1.41)   
take_multiple_column[100-reverse-False] (afte)     849.6824 (2.05)     4,650.7069 (1.39)     862.7702 (2.02)   
take_multiple_column[100-reverse-False] (befo)     863.7700 (2.09)     3,348.6579 (1.0)      877.5145 (2.06)   
---------------------------------------------------------------------------------------------------------------

----------------------------------- benchmark '100-reverse-True': 4 tests ------------------------------------
Name (time in us)                                      Min                   Max                Mean          
--------------------------------------------------------------------------------------------------------------
gather_single_column[100-reverse-True] (afte)     141.5601 (1.0)        292.0129 (1.0)      144.5997 (1.0)    
gather_single_column[100-reverse-True] (befo)     286.7738 (2.03)     4,374.5530 (14.98)    297.3910 (2.06)   
take_multiple_column[100-reverse-True] (afte)     960.0958 (6.78)     1,354.3908 (4.64)     973.7589 (6.73)   
take_multiple_column[100-reverse-True] (befo)     963.5990 (6.81)     1,175.8050 (4.03)     975.9332 (6.75)   
--------------------------------------------------------------------------------------------------------------

----------------------------------- benchmark '100-sequence-False': 4 tests ------------------------------------
Name (time in us)                                        Min                   Max                Mean          
----------------------------------------------------------------------------------------------------------------
gather_single_column[100-sequence-False] (afte)     418.4479 (1.0)      4,602.9259 (2.09)     436.3953 (1.0)    
gather_single_column[100-sequence-False] (befo)     589.5318 (1.41)     4,665.3422 (2.12)     605.6177 (1.39)   
take_multiple_column[100-sequence-False] (afte)     851.3979 (2.03)     5,037.6062 (2.29)     866.8329 (1.99)   
take_multiple_column[100-sequence-False] (befo)     858.9821 (2.05)     2,197.5730 (1.0)      872.5517 (2.00)   
----------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark '100-sequence-True': 4 tests -----------------------------------
Name (time in us)                                       Min                   Max                Mean          
---------------------------------------------------------------------------------------------------------------
gather_single_column[100-sequence-True] (afte)     145.0991 (1.0)        229.3726 (1.0)      148.7882 (1.0)    
gather_single_column[100-sequence-True] (befo)     289.9761 (2.00)       363.9143 (1.59)     295.9855 (1.99)   
take_multiple_column[100-sequence-True] (afte)     961.4970 (6.63)     1,028.0283 (4.48)     969.3146 (6.51)   
take_multiple_column[100-sequence-True] (befo)     962.7347 (6.64)     1,048.2450 (4.57)     973.8807 (6.55)   
---------------------------------------------------------------------------------------------------------------

----------------------------------- benchmark '10000-random-False': 4 tests ------------------------------------
Name (time in us)                                        Min                   Max                Mean          
----------------------------------------------------------------------------------------------------------------
gather_single_column[10000-random-False] (afte)     419.3909 (1.0)        669.2931 (1.0)      427.0140 (1.0)    
gather_single_column[10000-random-False] (befo)     600.0311 (1.43)     2,198.0200 (3.28)     610.3418 (1.43)   
take_multiple_column[10000-random-False] (afte)     862.4257 (2.06)     4,764.4433 (7.12)     880.1974 (2.06)   
take_multiple_column[10000-random-False] (befo)     873.0851 (2.08)     1,024.1494 (1.53)     881.4482 (2.06)   
----------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark '10000-random-True': 4 tests -----------------------------------
Name (time in us)                                       Min                   Max                Mean          
---------------------------------------------------------------------------------------------------------------
gather_single_column[10000-random-True] (afte)     134.2846 (1.0)      4,995.3298 (12.11)    139.0623 (1.0)    
gather_single_column[10000-random-True] (befo)     284.2899 (2.12)       412.4213 (1.0)      289.8005 (2.08)   
take_multiple_column[10000-random-True] (afte)     960.2159 (7.15)     1,361.8441 (3.30)     973.4057 (7.00)   
take_multiple_column[10000-random-True] (befo)     965.8998 (7.19)     1,140.6899 (2.77)     976.9224 (7.03)   
---------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark '10000-reverse-False': 4 tests -----------------------------------
Name (time in us)                                         Min                   Max                Mean          
-----------------------------------------------------------------------------------------------------------------
gather_single_column[10000-reverse-False] (afte)     419.7811 (1.0)        634.7937 (1.0)      428.2997 (1.0)    
gather_single_column[10000-reverse-False] (befo)     600.3999 (1.43)       762.5762 (1.20)     608.6369 (1.42)   
take_multiple_column[10000-reverse-False] (afte)     856.1970 (2.04)     1,138.3081 (1.79)     870.1638 (2.03)   
take_multiple_column[10000-reverse-False] (befo)     869.8748 (2.07)     3,184.0033 (5.02)     889.7182 (2.08)   
-----------------------------------------------------------------------------------------------------------------

----------------------------------- benchmark '10000-reverse-True': 4 tests ------------------------------------
Name (time in us)                                        Min                   Max                Mean          
----------------------------------------------------------------------------------------------------------------
gather_single_column[10000-reverse-True] (afte)     135.4842 (1.0)      3,634.2950 (7.81)     140.8658 (1.0)    
gather_single_column[10000-reverse-True] (befo)     284.9372 (2.10)       465.4219 (1.0)      292.6105 (2.08)   
take_multiple_column[10000-reverse-True] (afte)     957.0192 (7.06)     1,240.3540 (2.67)     966.7779 (6.86)   
take_multiple_column[10000-reverse-True] (befo)     967.6940 (7.14)     1,062.0849 (2.28)     975.9307 (6.93)   
----------------------------------------------------------------------------------------------------------------

----------------------------------- benchmark '10000-sequence-False': 4 tests ------------------------------------
Name (time in us)                                          Min                   Max                Mean          
------------------------------------------------------------------------------------------------------------------
gather_single_column[10000-sequence-False] (afte)     420.3622 (1.0)        555.1544 (1.0)      427.4441 (1.0)    
gather_single_column[10000-sequence-False] (befo)     601.7918 (1.43)     3,534.9689 (6.37)     613.6190 (1.44)   
take_multiple_column[10000-sequence-False] (afte)     858.0340 (2.04)     1,166.5919 (2.10)     868.6121 (2.03)   
take_multiple_column[10000-sequence-False] (befo)     871.3542 (2.07)     1,118.0961 (2.01)     881.9761 (2.06)   
------------------------------------------------------------------------------------------------------------------

------------------------------------ benchmark '10000-sequence-True': 4 tests -----------------------------------
Name (time in us)                                         Min                   Max                Mean          
-----------------------------------------------------------------------------------------------------------------
gather_single_column[10000-sequence-True] (afte)     135.8581 (1.0)      3,894.4702 (3.55)     141.3496 (1.0)    
gather_single_column[10000-sequence-True] (befo)     284.5018 (2.09)     2,703.6560 (2.47)     290.8583 (2.06)   
take_multiple_column[10000-sequence-True] (afte)     957.4448 (7.05)     1,096.1141 (1.0)      966.4487 (6.84)   
take_multiple_column[10000-sequence-True] (befo)     966.2341 (7.11)     1,242.0323 (1.13)     978.3753 (6.92)   
-----------------------------------------------------------------------------------------------------------------
Dropna Benchmark
------------------------------------ benchmark '100-False': 2 tests -----------------------------------
Name (time in us)                               Min                   Max                Mean          
-------------------------------------------------------------------------------------------------------
dropna_single_column[100-False] (afte)     143.9294 (1.0)      6,808.9343 (1.58)     150.8468 (1.0)    
dropna_single_column[100-False] (befo)     306.3441 (2.13)     4,297.9000 (1.0)      315.3899 (2.09)   
-------------------------------------------------------------------------------------------------------

---------------------------------- benchmark '100-True': 2 tests -----------------------------------
Name (time in us)                              Min                 Max                Mean          
----------------------------------------------------------------------------------------------------
dropna_single_column[100-True] (afte)     275.7823 (1.0)      327.2779 (1.0)      279.8443 (1.0)    
dropna_single_column[100-True] (befo)     548.6836 (1.99)     692.2791 (2.12)     557.9867 (1.99)   
----------------------------------------------------------------------------------------------------

------------------------------------ benchmark '10000-False': 2 tests -----------------------------------
Name (time in us)                                 Min                   Max                Mean          
---------------------------------------------------------------------------------------------------------
dropna_single_column[10000-False] (afte)     164.9209 (1.0)      5,742.9820 (1.61)     170.0143 (1.0)    
dropna_single_column[10000-False] (befo)     328.6479 (1.99)     3,565.7589 (1.0)      336.6208 (1.98)   
---------------------------------------------------------------------------------------------------------

----------------------------------- benchmark '10000-True': 2 tests ------------------------------------
Name (time in us)                                Min                   Max                Mean          
--------------------------------------------------------------------------------------------------------
dropna_single_column[10000-True] (afte)     304.6701 (1.0)        441.9931 (1.0)      309.9858 (1.0)    
dropna_single_column[10000-True] (befo)     571.9690 (1.88)     5,526.0560 (12.50)    586.4943 (1.89)   
--------------------------------------------------------------------------------------------------------
Unique/Drop_duplicate Benchmark
------------------------------------ benchmark '100': 4 tests -----------------------------------
Name (time in us)                         Min                   Max                Mean          
-------------------------------------------------------------------------------------------------
drop_duplicate_df[100] (afte)        891.9560 (2.77)     1,151.0071 (2.76)     904.5752 (2.74)   
drop_duplicate_df[100] (befo)        880.9832 (2.74)     5,528.1101 (13.23)    896.1535 (2.72)   
unique_single_column[100] (afte)     322.0579 (1.0)        417.7210 (1.0)      329.5932 (1.0)    
unique_single_column[100] (befo)     480.7310 (1.49)     4,470.7772 (10.70)    491.7183 (1.49)   
-------------------------------------------------------------------------------------------------

-------------------------------- benchmark '10000': 4 tests -------------------------------
Name (time in ms)                         Min               Max              Mean          
-------------------------------------------------------------------------------------------
drop_duplicate_df[10000] (afte)        1.0108 (2.23)     3.9981 (4.72)     1.0280 (2.17)   
drop_duplicate_df[10000] (befo)        1.0021 (2.21)     3.5031 (4.14)     1.0177 (2.15)   
unique_single_column[10000] (afte)     0.4534 (1.0)      4.5188 (5.33)     0.4740 (1.0)    
unique_single_column[10000] (befo)     0.6095 (1.34)     0.8471 (1.0)      0.6332 (1.34)   
-------------------------------------------------------------------------------------------

@isVoid isVoid requested a review from a team as a code owner October 29, 2021 00:29
@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 29, 2021
@isVoid isVoid added improvement Improvement / enhancement to an existing function Performance Performance related issue non-breaking Non-breaking change labels Oct 29, 2021
@codecov
Copy link

codecov bot commented Oct 29, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.02@31f92d7). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.02    #9558   +/-   ##
===============================================
  Coverage                ?   10.47%           
===============================================
  Files                   ?      119           
  Lines                   ?    20335           
  Branches                ?        0           
===============================================
  Hits                    ?     2130           
  Misses                  ?    18205           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31f92d7...c047f60. Read the comment docs.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a nice improvement in performance. I have some suggestions for improving implementation and developer documentation, since some methods' assumptions are not very clear to me.

python/cudf/cudf/_lib/copying.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/stream_compaction.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/utils.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/utils.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/utils.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/multiindex.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/utils.py Outdated Show resolved Hide resolved
@isVoid
Copy link
Contributor Author

isVoid commented Nov 9, 2021

I have added a few TODO in this PR to address open questions spawned from this PR:
#9558 (comment)
#9558 (comment)
#9558 (comment)

It should be ready for another round of review.

@isVoid isVoid requested review from vyasr and bdice November 9, 2021 20:33
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just a couple minor suggestions, otherwise LGTM 😄

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, otherwise I think this is good enough as is. Let's retarget to 22.02 as we discussed. Also, since #9516 is removing the old _from_columns and I think @shwina was happy to merge that pretty much as is, let's wait until that gets merged so that you don't have to worry about the methods conflicting.

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor

vyasr commented Nov 12, 2021

Also @isVoid you'll need to fix style :)

@isVoid isVoid changed the base branch from branch-21.12 to branch-22.02 November 15, 2021 17:32
@vyasr
Copy link
Contributor

vyasr commented Nov 19, 2021

rerun tests

@vyasr
Copy link
Contributor

vyasr commented Nov 19, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 05dd541 into rapidsai:branch-22.02 Nov 19, 2021
rapids-bot bot pushed a commit that referenced this pull request Jan 11, 2022
This PR brings changes from #9558 to `apply_boolean_mask` and removes the `as_frame` -> `as_column` round trip. Benchmark the column method:

```
------------------------------------- benchmark 'col0': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col0] (afte)      87.0090 (1.0)      132.8980 (1.0)       95.8815 (1.0)    
column_apply_boolean_mask[col0] (befo)     210.4580 (2.42)     307.8270 (2.32)     225.4821 (2.35)   
-----------------------------------------------------------------------------------------------------

------------------------------------- benchmark 'col1': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col1] (afte)      74.2240 (1.0)      110.0600 (1.0)       75.6356 (1.0)    
column_apply_boolean_mask[col1] (befo)     172.5240 (2.32)     278.5250 (2.53)     176.5672 (2.33)   
-----------------------------------------------------------------------------------------------------

------------------------------------- benchmark 'col2': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col2] (afte)     101.5740 (1.0)      141.8850 (1.0)      110.2334 (1.0)    
column_apply_boolean_mask[col2] (befo)     234.1140 (2.30)     312.7140 (2.20)     245.5453 (2.23)   
-----------------------------------------------------------------------------------------------------

------------------------------------- benchmark 'col3': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col3] (afte)      88.7710 (1.0)      142.7500 (1.0)       90.5082 (1.0)    
column_apply_boolean_mask[col3] (befo)     195.0980 (2.20)     303.1020 (2.12)     199.8368 (2.21)   
-----------------------------------------------------------------------------------------------------
```

Dataframe benchmark
```
----------------------------------- benchmark '100': 2 tests -----------------------------------
Name (time in us)                          Min                 Max                Mean          
------------------------------------------------------------------------------------------------
df_apply_boolean_mask[100] (afte)     380.6770 (1.05)     654.7080 (1.18)     389.3374 (1.03)   
df_apply_boolean_mask[100] (befo)     362.3220 (1.0)      554.6130 (1.0)      378.7087 (1.0)    
------------------------------------------------------------------------------------------------

----------------------------------- benchmark '10000': 2 tests -----------------------------------
Name (time in us)                            Min                 Max                Mean          
--------------------------------------------------------------------------------------------------
df_apply_boolean_mask[10000] (afte)     399.5240 (1.05)     461.6310 (1.0)      405.1225 (1.04)   
df_apply_boolean_mask[10000] (befo)     379.4080 (1.0)      564.5770 (1.22)     389.6990 (1.0)    
--------------------------------------------------------------------------------------------------
```

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9832
rapids-bot bot pushed a commit that referenced this pull request Jan 19, 2022
…d create their `_base_index` counterparts (#9807)

This PR is a follow up of #9558 (Part 1 of 3)

One remaining problem from #9558 is that `Frame` is index agnostic, however the above functions needs to perform index-aware operations when building the list of columns to pass to libcudf. For example, to remove duplicates of `BaseIndex`, it should only construct the list with all its columns. But in a dataframe, it would need to pass in all data columns plus the index columns, while specifying the indices of the data columns to consider duplicates. This complicates for `_gather` which supports `keep_index` argument. This PR moves aforementioned functions to `IndexedFrames`, and create its counterparts in `_base_index`.

A couple noteworthy changes:
- Merge object added with two new arguments `l(r)hs_is_index`
- DataFrame/Series.take `keep_index` argument is removed. For internal usage it's more advised to use `_gather`. (And thus this PR is labeled breaking)

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/brandon-b-miller

URL: #9807
rapids-bot bot pushed a commit that referenced this pull request Jan 20, 2022
Follow up to #9558

On a return trip from libcudf, it is a common pattern for cudf frame to apply its own metadata to the columns. This PR generalizes this procedure as a new factory function `_from_colums_like_self`

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Paul Taylor (https://github.com/trxcllnt)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants