Skip to content

Commit

Permalink
Reduce cudf library size (#7583)
Browse files Browse the repository at this point in the history
This PR combines two major changes that improve cudf's final binary size.

The optimizations are:

- Explicitly telling fatbin to always compress device code
- Not generating asserts with `__FILE__`, `__LINE__`, etc when in release mode.

With this PR we take cudf when built for `ALL` archs from  `1.3GB` to `329MB`.

Since this has such a dramatic change, I did some analysis on library sizes, compile times, runtime startup times, and
runtime performance.

### Library sizes
| variant | release 1 arch | release all archs |
| -- | -- | -- |
| branch-0.19 				| 335MB          | 1.3GB             |
| compression				| 133MB          | 476MB 			 |
| compression + no_assert	| 110MB          | 329MB 			 |
| rdc + comp + no_assert    | 150MB          | Not done 		 |

We see that compression and the removal of the asserts while staying with whole compilation ( no `-rdc` ) is best for performance. RDC taking more space makes sense to me as it has a less agressive optimizer and therefore generates more
sass/ptx.


### Compile times

| variant | release 1 arch | release all archs |
| -- | -- | -- |
| branch-0.19 				| 15m50.411s     | 44m12.849s        |
| compression				| 15m14.021s     | 44m11.658s 		 |
| compression + no_assert	| 12m53.049s     | 33m59.388s		 |
| rdc + comp + no_assert    | 12m58.550s     | Not done			 |

Compile times are interesting as we see that compression has no negative effect. I expect that the reduced IO ( from smaller binaries ) offsets the time spent compressing.


## Runtime starup times for `all archs`:
| variant | branch-0.19	| compression + no_assert	|
| -- | -- | -- |
| ctest -j1 				| 4m0.326s 		| 3m57.500s                 |
| DISPATCHER_TEST * 12      | 0m3.441s  	| 0m3.520s 					|


The goal here was to time the tests to see if we had any measurable differences.
The `DISPATCHER_TEST * 12` was selected as the runtime `~0.3s` would hopefully highligh any performance differences, and therefore running it 12 times would
magnify the difference and wash out any system load impacts.

I think from these rather crude measurements we are safe to assume no massive runtime loading time costs.


## Runtime performance for `all archs`:

I used the REDUCTION_BENCH as my baseline for detecting any performance changes. The benchmarks executed on a V100 on a shared lab machine.

<details>
<summary>REDUCTION_BENCH ./compare benchmarks branch_0.19 this_compression_branch</summary>

```
Comparing ./reduction_0.19 to ./reduction_compressed
Benchmark                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------
Reduction/bool_all/10000/manual_time                                  +0.1038         +0.0785         16621         18346         37366         40299
Reduction/bool_all/100000/manual_time                                 +0.0969         +0.0779         17057         18710         37738         40676
Reduction/bool_all/1000000/manual_time                                +0.0709         +0.0694         18175         19463         38256         40913
Reduction/bool_all/10000000/manual_time                               +0.0472         +0.0479         35489         37165         53792         56368
Reduction/bool_all/100000000/manual_time                              +0.0026         +0.0120        145630        146014        164349        166325
Reduction/int8_t_all/10000/manual_time                                +0.1415         +0.1098         16286         18591         36729         40762
Reduction/int8_t_all/100000/manual_time                               +0.0457         +0.0500         17180         17966         37798         39688
Reduction/int8_t_all/1000000/manual_time                              +0.0656         +0.0565         18297         19498         38522         40700
Reduction/int8_t_all/10000000/manual_time                             +0.0322         +0.0345         36117         37279         54645         56528
Reduction/int8_t_all/100000000/manual_time                            +0.0041         +0.0132        146222        146819        165053        167232
Reduction/int32_t_all/10000/manual_time                               +0.0931         +0.0781         16849         18417         37471         40397
Reduction/int32_t_all/100000/manual_time                              +0.0841         +0.0775         17371         18831         37772         40701
Reduction/int32_t_all/1000000/manual_time                             +0.0660         +0.0709         22323         23797         41633         44584
Reduction/int32_t_all/10000000/manual_time                            +0.0171         +0.0324         68303         69470         87006         89822
Reduction/int32_t_all/100000000/manual_time                           +0.0028         +0.0065        472041        473370        490693        493869
Reduction/float_all/10000/manual_time                                 +0.0847         +0.0734         16951         18387         37708         40474
Reduction/float_all/100000/manual_time                                +0.0962         +0.0880         16939         18569         37229         40504
Reduction/float_all/1000000/manual_time                               +0.1879         +0.1679         22348         26547         41549         48526
Reduction/float_all/10000000/manual_time                              +0.0465         +0.0652         68851         72054         87438         93139
Reduction/float_all/100000000/manual_time                             +0.0081         +0.0126        472210        476037        491030        497241
Reduction/bool_any/10000/manual_time                                  +0.0764         +0.0743         16292         17537         36830         39568
Reduction/bool_any/100000/manual_time                                 +0.0568         +0.0591         16633         17578         37197         39394
Reduction/bool_any/1000000/manual_time                                +0.0771         +0.0593         18024         19414         38439         40720
Reduction/bool_any/10000000/manual_time                               +0.0388         +0.0452         34990         36347         53210         55614
Reduction/bool_any/100000000/manual_time                              +0.0012         +0.0096        145729        145903        164726        166315
Reduction/int8_t_any/10000/manual_time                                +0.1207         +0.0954         15686         17579         36213         39669
Reduction/int8_t_any/100000/manual_time                               +0.0546         +0.0580         16597         17503         37103         39254
Reduction/int8_t_any/1000000/manual_time                              +0.0801         +0.0669         17938         19374         38027         40572
Reduction/int8_t_any/10000000/manual_time                             +0.0492         +0.0541         34726         36434         52812         55671
Reduction/int8_t_any/100000000/manual_time                            +0.0028         +0.0119        145924        146340        164711        166670
Reduction/int32_t_any/10000/manual_time                               +0.0489         +0.0539         16805         17626         37587         39614
Reduction/int32_t_any/100000/manual_time                              +0.0824         +0.0717         16985         18384         37432         40114
Reduction/int32_t_any/1000000/manual_time                             +0.0588         +0.0590         22338         23652         41842         44311
Reduction/int32_t_any/10000000/manual_time                            +0.0203         +0.0339         67672         69045         86353         89280
Reduction/int32_t_any/100000000/manual_time                           +0.0023         +0.0054        471708        472770        490461        493130
Reduction/float_any/10000/manual_time                                 +0.0454         +0.0494         16730         17490         37526         39382
Reduction/float_any/100000/manual_time                                +0.0895         +0.0739         16872         18381         37349         40110
Reduction/float_any/1000000/manual_time                               +0.0554         +0.0527         22402         23643         42014         44228
Reduction/float_any/10000000/manual_time                              +0.0026         +0.0081         68877         69060         88443         89159
Reduction/float_any/100000000/manual_time                             +0.0022         +0.0049        471781        472801        490753        493170
ReductionDictionary/int32_t_all/10000/manual_time                     +0.0503         +0.0535         30648         32190         51412         54160
ReductionDictionary/int32_t_all/100000/manual_time                    +0.0696         +0.0642         31076         33240         51545         54855
ReductionDictionary/int32_t_all/1000000/manual_time                   +0.9359         +0.4786         38709         74935         58597         86644
ReductionDictionary/int32_t_all/10000000/manual_time                  +0.0289         +0.0625         95351         98111        113516        120611
ReductionDictionary/int32_t_all/100000000/manual_time                 -0.0963         -0.0910        645050        582915        663356        602971
ReductionDictionary/float_all/10000/manual_time                       +0.0476         +0.0504         30177         31615         50814         53374
ReductionDictionary/float_all/100000/manual_time                      -0.0305         -0.0166         34208         33166         55546         54626
ReductionDictionary/float_all/1000000/manual_time                     -0.0258         -0.0137         40372         39329         61080         60242
ReductionDictionary/float_all/10000000/manual_time                    -0.0366         -0.0210         95681         92175        113937        111545
ReductionDictionary/float_all/100000000/manual_time                   -0.0784         -0.0733        644233        593744        662793        614231
ReductionDictionary/int32_t_any/10000/manual_time                     +0.0718         +0.0680         29882         32026         50723         54172
ReductionDictionary/int32_t_any/100000/manual_time                    +0.0776         +0.0704         30316         32670         50640         54203
ReductionDictionary/int32_t_any/1000000/manual_time                   +0.0291         +0.0372         37311         38397         57039         59159
ReductionDictionary/int32_t_any/10000000/manual_time                  -0.0497         -0.0285         95928         91163        113924        110675
ReductionDictionary/int32_t_any/100000000/manual_time                 -0.1160         -0.1109        655803        579715        673920        599181
ReductionDictionary/float_any/10000/manual_time                       +0.0561         +0.0563         29905         31583         50566         53413
ReductionDictionary/float_any/100000/manual_time                      +0.0622         +0.0581         30431         32322         50788         53740
ReductionDictionary/float_any/1000000/manual_time                     +0.0403         +0.0473         36955         38444         56562         59237
ReductionDictionary/float_any/10000000/manual_time                    -0.0626         -0.0421         96949         90881        115099        110252
ReductionDictionary/float_any/100000000/manual_time                   -0.1203         -0.1154        661222        581658        679363        600965
ReductionDictionary/int32_t_min/10000/manual_time                     +0.0539         +0.0525         45796         48265         66802         70306
ReductionDictionary/int32_t_min/100000/manual_time                    +0.0955         +0.0868         46170         50579         66882         72687
ReductionDictionary/int32_t_min/1000000/manual_time                   +0.0683         +0.0647         47495         50738         67291         71646
ReductionDictionary/int32_t_min/10000000/manual_time                  +0.0065         +0.0160         97469         98105        115997        117850
ReductionDictionary/int32_t_min/100000000/manual_time                 +0.0045         +0.0070        501086        503357        519638        523253
ReductionDictionary/float_min/10000/manual_time                       +0.0721         +0.0653         45551         48834         66485         70829
ReductionDictionary/float_min/100000/manual_time                      +0.0966         +0.0865         44953         49295         65457         71120
ReductionDictionary/float_min/1000000/manual_time                     +0.0618         +0.0582         47227         50145         67004         70900
ReductionDictionary/float_min/10000000/manual_time                    +0.0052         +0.0167         97166         97669        115559        117490
ReductionDictionary/float_min/100000000/manual_time                   +0.0060         +0.0083        501044        504029        519674        523988
ReductionDictionary/int32_t_max/10000/manual_time                     +0.0596         +0.0558         44987         47667         65991         69670
ReductionDictionary/int32_t_max/100000/manual_time                    +0.0735         +0.0625         44640         47921         65423         69515
ReductionDictionary/int32_t_max/1000000/manual_time                   +0.0610         +0.0569         46224         49042         65957         69707
ReductionDictionary/int32_t_max/10000000/manual_time                  -0.0100         -0.0009         97078         96105        115787        115685
ReductionDictionary/int32_t_max/100000000/manual_time                 +0.0021         +0.0041        500466        501504        518959        521082
ReductionDictionary/float_max/10000/manual_time                       +0.0815         +0.0716         44141         47738         65134         69798
ReductionDictionary/float_max/100000/manual_time                      +0.0481         +0.0466         46378         48609         67284         70422
ReductionDictionary/float_max/1000000/manual_time                     +0.0091         +0.0180         48140         48579         67837         69055
ReductionDictionary/float_max/10000000/manual_time                    -0.0000         +0.0063         96823         96820        115498        116228
ReductionDictionary/float_max/100000000/manual_time                   +0.0044         +0.0061        501050        503259        519938        523126
ReductionDictionary/int32_t_mean/10000/manual_time                    -0.0259         -0.0018         45053         43885         66162         66041
ReductionDictionary/int32_t_mean/100000/manual_time                   +0.0448         +0.0484         44776         46782         65662         68843
ReductionDictionary/int32_t_mean/1000000/manual_time                  +0.0134         +0.0263         52069         52766         72163         74061
ReductionDictionary/int32_t_mean/10000000/manual_time                 -0.1592         -0.1321        135248        113712        153825        133504
ReductionDictionary/int32_t_mean/100000000/manual_time                -0.2259         -0.2180        856793        663232        875288        684472
ReductionDictionary/float_mean/10000/manual_time                      +0.1263         +0.1185         44256         49847         65389         73139
ReductionDictionary/float_mean/100000/manual_time                     +0.1304         +0.1231         44038         49780         64743         72713
ReductionDictionary/float_mean/1000000/manual_time                    +0.0792         +0.0887         51678         55771         71756         78124
ReductionDictionary/float_mean/10000000/manual_time                   -0.1397         -0.1058        136928        117802        155376        138931
ReductionDictionary/float_mean/100000000/manual_time                  -0.2322         -0.2256        861014        661106        879508        681082
Reduction/bool_sum/10000/manual_time                                  +0.1048         +0.0881         16256         17960         36958         40216
Reduction/bool_sum/100000/manual_time                                 +0.0424         +0.0473         17106         17832         37986         39781
Reduction/bool_sum/1000000/manual_time                                +0.0837         +0.0620         17928         19428         38311         40688
Reduction/bool_sum/10000000/manual_time                               +0.0408         +0.0395         34927         36352         53394         55502
Reduction/bool_sum/100000000/manual_time                              +0.0030         +0.0102        145074        145516        164126        165798
Reduction/int8_t_sum/10000/manual_time                                +0.1105         +0.0832         15704         17438         36410         39440
Reduction/int8_t_sum/100000/manual_time                               +0.0694         +0.0629         16539         17688         37233         39575
Reduction/int8_t_sum/1000000/manual_time                              +0.0783         +0.0574         17880         19279         38316         40517
Reduction/int8_t_sum/10000000/manual_time                             +0.0416         +0.0406         34809         36256         53305         55467
Reduction/int8_t_sum/100000000/manual_time                            +0.0010         +0.0098        144214        144358        163105        164700
Reduction/int32_t_sum/10000/manual_time                               +0.0074         +0.0264         17459         17588         38551         39570
Reduction/int32_t_sum/100000/manual_time                              +0.0791         +0.0684         17086         18438         37705         40286
Reduction/int32_t_sum/1000000/manual_time                             -0.0064         -0.0027         23751         23599         44384         44263
Reduction/int32_t_sum/10000000/manual_time                            +0.0238         +0.0358         67606         69212         86499         89593
Reduction/int32_t_sum/100000000/manual_time                           +0.0002         -0.0043        472981        473093        495882        493743
Reduction/int64_t_sum/10000/manual_time                               +0.0340         +0.0383         17246         17833         38306         39772
Reduction/int64_t_sum/100000/manual_time                              +0.0755         +0.0719         17841         19188         38242         40990
Reduction/int64_t_sum/1000000/manual_time                             +0.0576         +0.0635         28702         30354         47263         50266
Reduction/int64_t_sum/10000000/manual_time                            +0.0119         +0.0141        112911        114253        132783        134658
Reduction/int64_t_sum/100000000/manual_time                           +0.0013         +0.0021        920797        921986        940549        942485
Reduction/float_sum/10000/manual_time                                 +0.0882         +0.0707         16479         17932         37421         40067
Reduction/float_sum/100000/manual_time                                +0.0903         +0.0774         16677         18183         37163         40041
Reduction/float_sum/1000000/manual_time                               +0.0571         +0.0622         22393         23673         41960         44569
Reduction/float_sum/10000000/manual_time                              +0.0179         +0.0345         67968         69184         86691         89678
Reduction/float_sum/100000000/manual_time                             +0.0015         -0.0015        472247        472973        494236        493474
Reduction/double_sum/10000/manual_time                                +0.0982         +0.0794         16036         17611         36657         39568
Reduction/double_sum/100000/manual_time                               +0.0153         +0.0340         18453         18735         38955         40281
Reduction/double_sum/1000000/manual_time                              +0.0596         +0.0636         28623         30328         47153         50154
Reduction/double_sum/10000000/manual_time                             +0.0089         +0.0130        113426        114441        132746        134476
Reduction/double_sum/100000000/manual_time                            +0.0026         +0.0044        920131        922540        939374        943549
Reduction/int32_t_product/10000/manual_time                           +0.1183         +0.0902         16570         18529         37443         40819
Reduction/int32_t_product/100000/manual_time                          +0.1097         +0.0905         17076         18948         37741         41158
Reduction/int32_t_product/1000000/manual_time                         +0.0655         +0.0622         22023         23465         41604         44194
Reduction/int32_t_product/10000000/manual_time                        +0.0237         +0.0374         67629         69233         86354         89583
Reduction/int32_t_product/100000000/manual_time                       +0.0003         +0.0024        473110        473265        492461        493666
Reduction/float_product/10000/manual_time                             +0.0839         +0.0615         16972         18395         38034         40374
Reduction/float_product/100000/manual_time                            +0.0787         +0.0718         17091         18436         37602         40302
Reduction/float_product/1000000/manual_time                           +0.0686         +0.0714         22282         23810         41856         44844
Reduction/float_product/10000000/manual_time                          +0.0135         +0.0191         68313         69233         87798         89474
Reduction/float_product/100000000/manual_time                         +0.0016         -0.0034        472681        473450        495621        493925
Reduction/int64_t_min/10000/manual_time                               +0.0885         +0.0697         16886         18380         37763         40393
Reduction/int64_t_min/100000/manual_time                              +0.0490         +0.0534         17988         18869         38269         40311
Reduction/int64_t_min/1000000/manual_time                             +0.0087         +0.0218         30355         30618         49407         50485
Reduction/int64_t_min/10000000/manual_time                            +0.0092         +0.0179        113906        114949        132977        135352
Reduction/int64_t_min/100000000/manual_time                           +0.0011         +0.0020        921089        922086        940543        942460
Reduction/double_min/10000/manual_time                                +0.0818         +0.0662         16848         18226         37674         40169
Reduction/double_min/100000/manual_time                               +0.0839         +0.0761         17470         18936         37537         40394
Reduction/double_min/1000000/manual_time                              +0.0582         +0.0629         28967         30653         47503         50490
Reduction/double_min/10000000/manual_time                             +0.0135         +0.0246        113413        114940        132250        135504
Reduction/double_min/100000000/manual_time                            +0.0017         +0.0027        920812        922344        940237        942769
Reduction/timestamp_ms_min/10000/manual_time                          +0.0910         +0.0673         16926         18466         37673         40210
Reduction/timestamp_ms_min/100000/manual_time                         +0.0830         +0.0738         17894         19379         38112         40925
Reduction/timestamp_ms_min/1000000/manual_time                        +0.0339         +0.0429         30113         31133         48915         51012
Reduction/timestamp_ms_min/10000000/manual_time                       +0.0109         +0.0189        114185        115433        133132        135650
Reduction/timestamp_ms_min/100000000/manual_time                      +0.0053         +0.0076        920924        925845        940096        947215
Reduction/int8_t_mean/10000/manual_time                               +0.0489         +0.0517         29025         30445         50078         52670
Reduction/int8_t_mean/100000/manual_time                              +0.0401         +0.0493         29218         30389         50050         52515
Reduction/int8_t_mean/1000000/manual_time                             -0.0215         -0.0036         33316         32600         54162         53965
Reduction/int8_t_mean/10000000/manual_time                            +0.0387         +0.0407         46726         48536         64936         67581
Reduction/int8_t_mean/100000000/manual_time                           +0.0007         +0.0058        161956        162077        180866        181918
Reduction/float_mean/10000/manual_time                                +0.0916         +0.0827         28959         31611         49748         53860
Reduction/float_mean/100000/manual_time                               +0.0369         +0.0485         30720         31854         51489         53988
Reduction/float_mean/1000000/manual_time                              +0.0218         +0.0310         36650         37448         56777         58539
Reduction/float_mean/10000000/manual_time                             +0.0266         +0.0402         81361         83528         99648        103653
Reduction/float_mean/100000000/manual_time                            -0.0036         -0.0003        489644        487869        508151        507999
Reduction/int32_t_variance/10000/manual_time                          +0.0337         +0.0343         29718         30718         50967         52713
Reduction/int32_t_variance/100000/manual_time                         -0.0506         -0.0189         33774         32065         55094         54054
Reduction/int32_t_variance/1000000/manual_time                        +0.0364         +0.0316         36062         37374         56511         58297
Reduction/int32_t_variance/10000000/manual_time                       +0.0107         +0.0197         83950         84849        102507        104529
Reduction/int32_t_variance/100000000/manual_time                      -0.0003         +0.0018        491601        491463        510405        511323
Reduction/double_variance/10000/manual_time                           +0.0536         +0.0457         29157         30721         50266         52565
Reduction/double_variance/100000/manual_time                          +0.0367         +0.0415         30552         31674         51014         53130
Reduction/double_variance/1000000/manual_time                         +0.0238         +0.0334         43737         44780         62726         64821
Reduction/double_variance/10000000/manual_time                        +0.0046         +0.0084        129382        129980        148232        149476
Reduction/double_variance/100000000/manual_time                       +0.0027         +0.0042        934085        936595        952680        956708
Reduction/int64_t_std/10000/manual_time                               +0.0638         +0.0592         28956         30803         49847         52798
Reduction/int64_t_std/100000/manual_time                              -0.0742         -0.0356         34976         32380         56179         54180
Reduction/int64_t_std/1000000/manual_time                             -0.1048         -0.0836         49651         44446         70034         64177
Reduction/int64_t_std/10000000/manual_time                            +0.0077         +0.0118        129373        130373        148033        149784
Reduction/int64_t_std/100000000/manual_time                           +0.0025         +0.0037        934055        936408        952668        956168
Reduction/float_std/10000/manual_time                                 +0.0754         +0.0699         28573         30728         49316         52764
Reduction/float_std/100000/manual_time                                +0.0945         +0.0846         28682         31394         49008         53155
Reduction/float_std/1000000/manual_time                               +0.0356         +0.0437         36195         37485         56074         58526
Reduction/float_std/10000000/manual_time                              +0.0175         +0.0292         83490         84951        101651        104623
Reduction/float_std/100000000/manual_time                             +0.0018         +0.0049        490773        491670        509017        511496
ReductionScan/int8_no_nulls/10000/manual_time                         +0.0523         +0.0511         15603         16420         36157         38005
ReductionScan/int8_no_nulls/100000/manual_time                        +0.0181         +0.0295         17358         17672         37950         39070
ReductionScan/int8_no_nulls/1000000/manual_time                       +0.0251         +0.0390         26876         27550         45671         47451
ReductionScan/int8_no_nulls/10000000/manual_time                      +0.0166         +0.0278         89319         90800        109601        112649
ReductionScan/int8_no_nulls/100000000/manual_time                     +0.0007         +0.0019        735210        735689        756198        757634
ReductionScan/int32_no_nulls/10000/manual_time                        +0.0515         +0.0559         15234         16019         35716         37712
ReductionScan/int32_no_nulls/100000/manual_time                       +0.0593         +0.0583         16808         17805         36980         39134
ReductionScan/int32_no_nulls/1000000/manual_time                      +0.0390         +0.0469         26405         27436         44665         46759
ReductionScan/int32_no_nulls/10000000/manual_time                     +0.0103         +0.0196        114948        116137        135580        138232
ReductionScan/int32_no_nulls/100000000/manual_time                    +0.0032         +0.0056       1016197       1019401       1037281       1043090
ReductionScan/uint64_no_nulls/10000/manual_time                       +0.0162         +0.0308         16222         16485         36939         38077
ReductionScan/uint64_no_nulls/100000/manual_time                      +0.0588         +0.0685         18751         19852         38174         40789
ReductionScan/uint64_no_nulls/1000000/manual_time                     +0.0444         +0.0604         36339         37953         56819         60253
ReductionScan/uint64_no_nulls/10000000/manual_time                    -0.0072         +0.0016        217943        216376        238710        239081
ReductionScan/uint64_no_nulls/100000000/manual_time                   -0.0009         -0.0007       2001645       1999848       2023397       2021947
ReductionScan/float_no_nulls/10000/manual_time                        +0.0298         +0.0399         15775         16246         36567         38028
ReductionScan/float_no_nulls/100000/manual_time                       +0.0534         +0.0505         16888         17789         37162         39037
ReductionScan/float_no_nulls/1000000/manual_time                      +0.0339         +0.0416         26655         27558         45083         46960
ReductionScan/float_no_nulls/10000000/manual_time                     +0.0071         +0.0155        115565        116383        136602        138716
ReductionScan/float_no_nulls/100000000/manual_time                    -0.0019         -0.0006       1019440       1017475       1040591       1040016
ReductionScan/int16_nulls/10000/manual_time                           +0.0510         +0.0442         35588         37402         57805         60363
ReductionScan/int16_nulls/100000/manual_time                          +0.0442         +0.0450         37789         39458         59383         62055
ReductionScan/int16_nulls/1000000/manual_time                         -0.0270         -0.0174         48836         47519         69731         68521
ReductionScan/int16_nulls/10000000/manual_time                        -0.0046         +0.0034        122849        122281        145028        145520
ReductionScan/int16_nulls/100000000/manual_time                       -0.0025         -0.0107        864450        862291        892772        883246
ReductionScan/uint32_nulls/10000/manual_time                          -0.0178         -0.0017         37685         37016         59988         59883
ReductionScan/uint32_nulls/100000/manual_time                         +0.0478         +0.0494         37349         39134         58805         61708
ReductionScan/uint32_nulls/1000000/manual_time                        +0.0085         +0.0130         49350         49771         69758         70663
ReductionScan/uint32_nulls/10000000/manual_time                       +0.0013         +0.0086        146704        146896        169127        170574
ReductionScan/uint32_nulls/100000000/manual_time                      +0.0085         +0.0111       1112967       1122427       1133374       1145988
ReductionScan/double_nulls/10000/manual_time                          +0.0623         +0.0611         35538         37753         57332         60833
ReductionScan/double_nulls/100000/manual_time                         +0.0602         +0.0635         38508         40827         59025         62775
ReductionScan/double_nulls/1000000/manual_time                        +0.0202         +0.0275         55522         56645         77436         79567
ReductionScan/double_nulls/10000000/manual_time                       +0.0084         +0.0118        243996        246039        265992        269142
ReductionScan/double_nulls/100000000/manual_time                      -0.0013         -0.0008       2120668       2117888       2140996       2139376
Reduction/bool_minmax/10000/manual_time                               -0.1377         -0.0650         22410         19325         43937         41081
Reduction/bool_minmax/100000/manual_time                              +0.0523         +0.0537         18947         19937         39665         41795
Reduction/bool_minmax/1000000/manual_time                             +0.0734         +0.0617         19146         20551         39413         41846
Reduction/bool_minmax/10000000/manual_time                            +0.0358         +0.0438         37836         39189         56031         58483
Reduction/bool_minmax/100000000/manual_time                           +0.0000         -0.0063        149637        149641        170798        169728
Reduction/int8_t_minmax/10000/manual_time                             +0.0361         +0.0438         18831         19512         39671         41408
Reduction/int8_t_minmax/100000/manual_time                            +0.0629         +0.0571         18429         19588         39093         41327
Reduction/int8_t_minmax/1000000/manual_time                           +0.0874         +0.0696         18956         20613         39054         41772
Reduction/int8_t_minmax/10000000/manual_time                          +0.0369         +0.0413         37781         39175         55998         58312
Reduction/int8_t_minmax/100000000/manual_time                         -0.0002         +0.0059        149816        149783        168898        169894
Reduction/int32_t_minmax/10000/manual_time                            +0.0813         +0.0613         18423         19921         39271         41676
Reduction/int32_t_minmax/100000/manual_time                           +0.0848         +0.0668         18560         20133         39161         41779
Reduction/int32_t_minmax/1000000/manual_time                          +0.0630         +0.0610         23265         24730         42732         45337
Reduction/int32_t_minmax/10000000/manual_time                         +0.0107         +0.0196         69505         70248         88260         89989
Reduction/int32_t_minmax/100000000/manual_time                        -0.0004         +0.0009        474594        474410        493944        494395
Reduction/timestamp_ms_minmax/10000/manual_time                       +0.0716         +0.0605         19241         20618         40084         42509
Reduction/timestamp_ms_minmax/100000/manual_time                      +0.0418         +0.0513         19638         20459         39825         41867
Reduction/timestamp_ms_minmax/1000000/manual_time                     +0.0559         +0.0635         30346         32042         48851         51954
Reduction/timestamp_ms_minmax/10000000/manual_time                    +0.0030         +0.0025        117085        117437        136688        137033
Reduction/timestamp_ms_minmax/100000000/manual_time                   +0.0010         -0.0037        921557        922453        945924        942449
Reduction/float_minmax/10000/manual_time                              +0.0650         +0.0521         18507         19710         39467         41526
Reduction/float_minmax/100000/manual_time                             +0.0740         +0.0619         18504         19874         39105         41525
Reduction/float_minmax/1000000/manual_time                            +0.0499         +0.0373         23385         24552         43561         45185
Reduction/float_minmax/10000000/manual_time                           +0.0163         +0.0314         69400         70529         88043         90807
Reduction/float_minmax/100000000/manual_time                          +0.0000         +0.0025        474346        474361        493215        494461
```
</details>

It looks like the compression has a minor cost? I would have thought the removal of the `release_assert` would have improved performance. But maybe `reduce` isn't using it?

Authors:
  - Robert Maynard (@robertmaynard)

Approvers:
  - Keith Kraus (@kkraus14)
  - Mark Harris (@harrism)
  - Jake Hemstad (@jrhemstad)

URL: #7583
  • Loading branch information
robertmaynard authored Mar 23, 2021
1 parent 2bf22d1 commit 0f8035d
Show file tree
Hide file tree
Showing 18 changed files with 61 additions and 54 deletions.
3 changes: 3 additions & 0 deletions cpp/cmake/Modules/ConfigureCUDA.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ if(DISABLE_DEPRECATION_WARNING)
list(APPEND CUDF_CUDA_FLAGS -Xcompiler=-Wno-deprecated-declarations)
endif()

# make sure we produce smallest binary size
list(APPEND CUDF_CUDA_FLAGS -Xfatbin=-compress-all)

# Option to enable line info in CUDA device compilation to allow introspection when profiling / memchecking
if(CUDA_ENABLE_LINEINFO)
list(APPEND CUDF_CUDA_FLAGS -lineinfo)
Expand Down
12 changes: 6 additions & 6 deletions cpp/include/cudf/ast/detail/operators.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ CUDA_HOST_DEVICE_CALLABLE constexpr void ast_operator_dispatcher(ast_operator op
#ifndef __CUDA_ARCH__
CUDF_FAIL("Invalid operator.");
#else
release_assert(false && "Invalid operator.");
cudf_assert(false && "Invalid operator.");
#endif
break;
}
Expand Down Expand Up @@ -784,7 +784,7 @@ struct double_dispatch_binary_operator_types {
#ifndef __CUDA_ARCH__
CUDF_FAIL("Invalid binary operation.");
#else
release_assert(false && "Invalid binary operation.");
cudf_assert(false && "Invalid binary operation.");
#endif
}
};
Expand Down Expand Up @@ -819,7 +819,7 @@ struct single_dispatch_binary_operator_types {
#ifndef __CUDA_ARCH__
CUDF_FAIL("Invalid binary operation.");
#else
release_assert(false && "Invalid binary operation.");
cudf_assert(false && "Invalid binary operation.");
#endif
}
};
Expand Down Expand Up @@ -924,7 +924,7 @@ struct dispatch_unary_operator_types {
#ifndef __CUDA_ARCH__
CUDF_FAIL("Invalid unary operation.");
#else
release_assert(false && "Invalid unary operation.");
cudf_assert(false && "Invalid unary operation.");
#endif
}
};
Expand Down Expand Up @@ -996,7 +996,7 @@ struct return_type_functor {
#ifndef __CUDA_ARCH__
CUDF_FAIL("Invalid binary operation. Return type cannot be determined.");
#else
release_assert(false && "Invalid binary operation. Return type cannot be determined.");
cudf_assert(false && "Invalid binary operation. Return type cannot be determined.");
#endif
}

Expand Down Expand Up @@ -1024,7 +1024,7 @@ struct return_type_functor {
#ifndef __CUDA_ARCH__
CUDF_FAIL("Invalid unary operation. Return type cannot be determined.");
#else
release_assert(false && "Invalid unary operation. Return type cannot be determined.");
cudf_assert(false && "Invalid unary operation. Return type cannot be determined.");
#endif
}
};
Expand Down
8 changes: 4 additions & 4 deletions cpp/include/cudf/ast/detail/transform.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ struct unary_row_output : public row_output {
Input input,
detail::device_data_reference output) const
{
release_assert(false && "Invalid unary dispatch operator for the provided input.");
cudf_assert(false && "Invalid unary dispatch operator for the provided input.");
}
};

Expand Down Expand Up @@ -116,7 +116,7 @@ struct binary_row_output : public row_output {
RHS rhs,
detail::device_data_reference output) const
{
release_assert(false && "Invalid binary dispatch operator for the provided input.");
cudf_assert(false && "Invalid binary dispatch operator for the provided input.");
}
};

Expand Down Expand Up @@ -239,7 +239,7 @@ struct row_evaluator {
detail::device_data_reference rhs,
detail::device_data_reference output) const
{
release_assert(false && "Invalid binary dispatch operator for the provided input.");
cudf_assert(false && "Invalid binary dispatch operator for the provided input.");
}

private:
Expand Down Expand Up @@ -311,7 +311,7 @@ __device__ void evaluate_row_expression(detail::row_evaluator const& evaluator,
output,
op);
} else {
release_assert(false && "Invalid operator arity.");
cudf_assert(false && "Invalid operator arity.");
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/column/column_device_view.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -774,7 +774,7 @@ struct index_element_fn {
std::is_unsigned<IndexType>::value)>* = nullptr>
__device__ size_type operator()(Args&&... args)
{
release_assert(false and "dictionary indices must be an unsigned integral type");
cudf_assert(false and "dictionary indices must be an unsigned integral type");
return 0;
}
};
Expand Down
4 changes: 2 additions & 2 deletions cpp/include/cudf/detail/aggregation/aggregation.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
#include <cudf/aggregation.hpp>
#include <cudf/column/column_device_view.cuh>
#include <cudf/detail/aggregation/aggregation.hpp>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/detail/utilities/device_atomics.cuh>
#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/dictionary/dictionary_column_view.hpp>
#include <cudf/table/table_device_view.cuh>

Expand Down Expand Up @@ -103,7 +103,7 @@ struct update_target_element {
column_device_view source,
size_type source_index) const noexcept
{
release_assert(false and "Invalid source type and aggregation combination.");
cudf_assert(false and "Invalid source type and aggregation combination.");
}
};

Expand Down
4 changes: 2 additions & 2 deletions cpp/include/cudf/detail/aggregation/aggregation.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
#pragma once

#include <cudf/aggregation.hpp>
#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/types.hpp>
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/traits.hpp>
Expand Down Expand Up @@ -627,7 +627,7 @@ CUDA_HOST_DEVICE_CALLABLE decltype(auto) aggregation_dispatcher(aggregation::Kin
#ifndef __CUDA_ARCH__
CUDF_FAIL("Unsupported aggregation.");
#else
release_assert(false && "Unsupported aggregation.");
cudf_assert(false && "Unsupported aggregation.");

// The following code will never be reached, but the compiler generates a
// warning if there isn't a return value.
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/detail/gather.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@
#include <cudf/detail/copy.hpp>
#include <cudf/detail/indexalator.cuh>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/detail/utilities/cuda.cuh>
#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/detail/valid_if.cuh>
#include <cudf/dictionary/dictionary_column_view.hpp>
#include <cudf/dictionary/dictionary_factories.hpp>
Expand Down
4 changes: 2 additions & 2 deletions cpp/include/cudf/detail/indexalator.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ struct input_indexalator : base_indexalator<input_indexalator> {
template <typename T, std::enable_if_t<not is_index_type<T>()>* = nullptr>
__device__ size_type operator()(void const* tp)
{
release_assert(false and "only index types are supported");
cudf_assert(false and "only index types are supported");
return 0;
}
};
Expand Down Expand Up @@ -366,7 +366,7 @@ struct output_indexalator : base_indexalator<output_indexalator> {
template <typename T, std::enable_if_t<not is_index_type<T>()>* = nullptr>
__device__ void operator()(void* tp, size_type const value)
{
release_assert(false and "only index types are supported");
cudf_assert(false and "only index types are supported");
}
};

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@
*
* Relies on the `__PRETTY_FUNCTION__` macro which is specific to GCC and Clang.
*/
#if defined(__CUDA_ARCH__) && (defined(__clang__) || defined(__GNUC__))
#if !defined(NDEBUG) && defined(__CUDA_ARCH__) && (defined(__clang__) || defined(__GNUC__))
#define __ASSERT_STR_HELPER(x) #x
#define release_assert(e) \
#define cudf_assert(e) \
((e) ? static_cast<void>(0) \
: __assert_fail(__ASSERT_STR_HELPER(e), __FILE__, __LINE__, __PRETTY_FUNCTION__))
#else
#define release_assert(e) (static_cast<void>(0))
#define cudf_assert(e) (static_cast<void>(0))
#endif
14 changes: 7 additions & 7 deletions cpp/include/cudf/detail/utilities/hash_functions.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
#pragma once

#include <cudf/column/column_device_view.cuh>
#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/strings/string_view.cuh>
#include <hash/hash_constants.hpp>

Expand Down Expand Up @@ -154,7 +154,7 @@ struct MD5ListHasher {
size_type offset_end,
md5_intermediate_data* hash_state) const
{
release_assert(false && "MD5 Unsupported chrono type column");
cudf_assert(false && "MD5 Unsupported chrono type column");
}

template <typename T, std::enable_if_t<!is_fixed_width<T>()>* = nullptr>
Expand All @@ -163,7 +163,7 @@ struct MD5ListHasher {
size_type offset_end,
md5_intermediate_data* hash_state) const
{
release_assert(false && "MD5 Unsupported non-fixed-width type column");
cudf_assert(false && "MD5 Unsupported non-fixed-width type column");
}

template <typename T, std::enable_if_t<is_floating_point<T>()>* = nullptr>
Expand Down Expand Up @@ -273,15 +273,15 @@ struct MD5Hash {
size_type row_index,
md5_intermediate_data* hash_state) const
{
release_assert(false && "MD5 Unsupported chrono type column");
cudf_assert(false && "MD5 Unsupported chrono type column");
}

template <typename T, std::enable_if_t<!is_fixed_width<T>()>* = nullptr>
void __device__ operator()(column_device_view col,
size_type row_index,
md5_intermediate_data* hash_state) const
{
release_assert(false && "MD5 Unsupported non-fixed-width type column");
cudf_assert(false && "MD5 Unsupported non-fixed-width type column");
}

template <typename T, std::enable_if_t<is_floating_point<T>()>* = nullptr>
Expand Down Expand Up @@ -344,7 +344,7 @@ void CUDA_DEVICE_CALLABLE MD5Hash::operator()<list_view>(column_device_view col,
column_device_view offsets = col.child(offsets_column_index);
column_device_view data = col.child(data_column_index);

if (data.type().id() == type_id::LIST) release_assert(false && "Nested list unsupported");
if (data.type().id() == type_id::LIST) cudf_assert(false && "Nested list unsupported");

cudf::type_dispatcher(data.type(),
MD5ListHasher{},
Expand Down Expand Up @@ -724,7 +724,7 @@ struct IdentityHash {
CUDA_HOST_DEVICE_CALLABLE std::enable_if_t<!std::is_arithmetic<Key>::value, return_type>
operator()(Key const& key) const
{
release_assert(false && "IdentityHash does not support this data type");
cudf_assert(false && "IdentityHash does not support this data type");
return 0;
}

Expand Down
4 changes: 2 additions & 2 deletions cpp/include/cudf/fixed_point/fixed_point.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

#pragma once

#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/types.hpp>

// Note: The <cuda/std/*> versions are used in order for Jitify to work with our fixed_point type.
Expand Down Expand Up @@ -91,7 +91,7 @@ template <typename Rep,
is_supported_representation_type<Rep>())>* = nullptr>
CUDA_HOST_DEVICE_CALLABLE Rep ipow(T exponent)
{
release_assert(exponent >= 0 && "integer exponentiation with negative exponent is not possible.");
cudf_assert(exponent >= 0 && "integer exponentiation with negative exponent is not possible.");
if (exponent == 0) return static_cast<Rep>(1);
auto extra = static_cast<Rep>(1);
auto square = static_cast<Rep>(Base);
Expand Down
14 changes: 7 additions & 7 deletions cpp/include/cudf/lists/list_device_view.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,12 @@ class list_device_view {
: lists_column(lists_column), _row_index(row_index)
{
column_device_view const& offsets = lists_column.offsets();
release_assert(row_index >= 0 && row_index < lists_column.size() &&
row_index < offsets.size() && "row_index out of bounds");
cudf_assert(row_index >= 0 && row_index < lists_column.size() && row_index < offsets.size() &&
"row_index out of bounds");

begin_offset = offsets.element<size_type>(row_index);
release_assert(begin_offset >= 0 && begin_offset <= lists_column.child().size() &&
"begin_offset out of bounds.");
cudf_assert(begin_offset >= 0 && begin_offset <= lists_column.child().size() &&
"begin_offset out of bounds.");
_size = offsets.element<size_type>(row_index + 1) - begin_offset;
}

Expand Down Expand Up @@ -71,7 +71,7 @@ class list_device_view {
*/
CUDA_DEVICE_CALLABLE size_type element_offset(size_type idx) const
{
release_assert(idx >= 0 && idx < size() && "idx out of bounds");
cudf_assert(idx >= 0 && idx < size() && "idx out of bounds");
return begin_offset + idx;
}

Expand All @@ -93,7 +93,7 @@ class list_device_view {
*/
CUDA_DEVICE_CALLABLE bool is_null(size_type idx) const
{
release_assert(idx >= 0 && idx < size() && "Index out of bounds.");
cudf_assert(idx >= 0 && idx < size() && "Index out of bounds.");
auto element_offset = begin_offset + idx;
return lists_column.child().is_null(element_offset);
}
Expand Down Expand Up @@ -294,7 +294,7 @@ struct list_size_functor {
CUDA_HOST_DEVICE_CALLABLE list_size_functor(column_device_view const& d_col) : d_column(d_col)
{
#if defined(__CUDA_ARCH__)
release_assert(d_col.type().id() == type_id::LIST && "Only list type column is supported");
cudf_assert(d_col.type().id() == type_id::LIST && "Only list type column is supported");
#else
CUDF_EXPECTS(d_col.type().id() == type_id::LIST, "Only list type column is supported");
#endif
Expand Down
6 changes: 3 additions & 3 deletions cpp/include/cudf/table/row_operators.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
#pragma once

#include <cudf/column/column_device_view.cuh>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/detail/utilities/hash_functions.cuh>
#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/sorting.hpp>
#include <cudf/table/table_device_view.cuh>
#include <cudf/utilities/traits.hpp>
Expand Down Expand Up @@ -190,7 +190,7 @@ class element_equality_comparator {
std::enable_if_t<not cudf::is_equality_comparable<Element, Element>()>* = nullptr>
__device__ bool operator()(size_type lhs_element_index, size_type rhs_element_index)
{
release_assert(false && "Attempted to compare elements of uncomparable types.");
cudf_assert(false && "Attempted to compare elements of uncomparable types.");
return false;
}

Expand Down Expand Up @@ -291,7 +291,7 @@ class element_relational_comparator {
std::enable_if_t<not cudf::is_relationally_comparable<Element, Element>()>* = nullptr>
__device__ weak_ordering operator()(size_type lhs_element_index, size_type rhs_element_index)
{
release_assert(false && "Attempted to compare elements of uncomparable types.");
cudf_assert(false && "Attempted to compare elements of uncomparable types.");
return weak_ordering::LESS;
}

Expand Down
4 changes: 2 additions & 2 deletions cpp/include/cudf/utilities/type_dispatcher.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

#pragma once

#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/fixed_point/fixed_point.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/error.hpp>
Expand Down Expand Up @@ -501,7 +501,7 @@ CUDA_HOST_DEVICE_CALLABLE constexpr decltype(auto) type_dispatcher(cudf::data_ty
#ifndef __CUDA_ARCH__
CUDF_FAIL("Unsupported type_id.");
#else
release_assert(false && "Unsupported type_id.");
cudf_assert(false && "Unsupported type_id.");

// The following code will never be reached, but the compiler generates a
// warning if there isn't a return value.
Expand Down
4 changes: 2 additions & 2 deletions cpp/src/groupby/hash/multi_pass_kernels.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
#include <cudf/aggregation.hpp>
#include <cudf/column/column_device_view.cuh>
#include <cudf/detail/aggregation/aggregation.hpp>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/detail/utilities/device_atomics.cuh>
#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/table/table_device_view.cuh>
#include <cudf/utilities/type_dispatcher.hpp>

Expand Down Expand Up @@ -65,7 +65,7 @@ struct var_hash_functor {
size_type source_index,
size_type target_index) noexcept
{
release_assert(false and "Invalid source type for std, var aggregation combination.");
cudf_assert(false and "Invalid source type for std, var aggregation combination.");
}

template <typename Source>
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/io/parquet/page_data.cu
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#include <io/utilities/block_utils.cuh>
#include <io/utilities/column_buffer.hpp>

#include <cudf/detail/utilities/release_assert.cuh>
#include <cudf/detail/utilities/assert.cuh>
#include <cudf/utilities/bit.hpp>

#include <rmm/cuda_stream_view.hpp>
Expand Down
Loading

0 comments on commit 0f8035d

Please sign in to comment.