From 0f8035d68cfe45fb8da7cc1bca9b71e35aa05fe4 Mon Sep 17 00:00:00 2001
From: Robert Maynard <rmaynard@nvidia.com>
Date: Mon, 22 Mar 2021 23:12:48 -0400
Subject: [PATCH] Reduce cudf library size (#7583)

This PR combines two major changes that improve cudf's final binary size.

The optimizations are:

- Explicitly telling fatbin to always compress device code
- Not generating asserts with `__FILE__`, `__LINE__`, etc when in release mode.

With this PR we take cudf when built for `ALL` archs from  `1.3GB` to `329MB`.

Since this has such a dramatic change, I did some analysis on library sizes, compile times, runtime startup times, and
runtime performance.

### Library sizes
| variant | release 1 arch | release all archs |
| -- | -- | -- |
| branch-0.19 				| 335MB          | 1.3GB             |
| compression				| 133MB          | 476MB 			 |
| compression + no_assert	| 110MB          | 329MB 			 |
| rdc + comp + no_assert    | 150MB          | Not done 		 |

We see that compression and the removal of the asserts while staying with whole compilation ( no `-rdc` ) is best for performance. RDC taking more space makes sense to me as it has a less agressive optimizer and therefore generates more
sass/ptx.


### Compile times

| variant | release 1 arch | release all archs |
| -- | -- | -- |
| branch-0.19 				| 15m50.411s     | 44m12.849s        |
| compression				| 15m14.021s     | 44m11.658s 		 |
| compression + no_assert	| 12m53.049s     | 33m59.388s		 |
| rdc + comp + no_assert    | 12m58.550s     | Not done			 |

Compile times are interesting as we see that compression has no negative effect. I expect that the reduced IO ( from smaller binaries ) offsets the time spent compressing.


## Runtime starup times for `all archs`:
| variant | branch-0.19	| compression + no_assert	|
| -- | -- | -- |
| ctest -j1 				| 4m0.326s 		| 3m57.500s                 |
| DISPATCHER_TEST * 12      | 0m3.441s  	| 0m3.520s 					|


The goal here was to time the tests to see if we had any measurable differences.
The `DISPATCHER_TEST * 12` was selected as the runtime `~0.3s` would hopefully highligh any performance differences, and therefore running it 12 times would
magnify the difference and wash out any system load impacts.

I think from these rather crude measurements we are safe to assume no massive runtime loading time costs.


## Runtime performance for `all archs`:

I used the REDUCTION_BENCH as my baseline for detecting any performance changes. The benchmarks executed on a V100 on a shared lab machine.

<details>
<summary>REDUCTION_BENCH ./compare benchmarks branch_0.19 this_compression_branch</summary>

```
Comparing ./reduction_0.19 to ./reduction_compressed
Benchmark                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------
Reduction/bool_all/10000/manual_time                                  +0.1038         +0.0785         16621         18346         37366         40299
Reduction/bool_all/100000/manual_time                                 +0.0969         +0.0779         17057         18710         37738         40676
Reduction/bool_all/1000000/manual_time                                +0.0709         +0.0694         18175         19463         38256         40913
Reduction/bool_all/10000000/manual_time                               +0.0472         +0.0479         35489         37165         53792         56368
Reduction/bool_all/100000000/manual_time                              +0.0026         +0.0120        145630        146014        164349        166325
Reduction/int8_t_all/10000/manual_time                                +0.1415         +0.1098         16286         18591         36729         40762
Reduction/int8_t_all/100000/manual_time                               +0.0457         +0.0500         17180         17966         37798         39688
Reduction/int8_t_all/1000000/manual_time                              +0.0656         +0.0565         18297         19498         38522         40700
Reduction/int8_t_all/10000000/manual_time                             +0.0322         +0.0345         36117         37279         54645         56528
Reduction/int8_t_all/100000000/manual_time                            +0.0041         +0.0132        146222        146819        165053        167232
Reduction/int32_t_all/10000/manual_time                               +0.0931         +0.0781         16849         18417         37471         40397
Reduction/int32_t_all/100000/manual_time                              +0.0841         +0.0775         17371         18831         37772         40701
Reduction/int32_t_all/1000000/manual_time                             +0.0660         +0.0709         22323         23797         41633         44584
Reduction/int32_t_all/10000000/manual_time                            +0.0171         +0.0324         68303         69470         87006         89822
Reduction/int32_t_all/100000000/manual_time                           +0.0028         +0.0065        472041        473370        490693        493869
Reduction/float_all/10000/manual_time                                 +0.0847         +0.0734         16951         18387         37708         40474
Reduction/float_all/100000/manual_time                                +0.0962         +0.0880         16939         18569         37229         40504
Reduction/float_all/1000000/manual_time                               +0.1879         +0.1679         22348         26547         41549         48526
Reduction/float_all/10000000/manual_time                              +0.0465         +0.0652         68851         72054         87438         93139
Reduction/float_all/100000000/manual_time                             +0.0081         +0.0126        472210        476037        491030        497241
Reduction/bool_any/10000/manual_time                                  +0.0764         +0.0743         16292         17537         36830         39568
Reduction/bool_any/100000/manual_time                                 +0.0568         +0.0591         16633         17578         37197         39394
Reduction/bool_any/1000000/manual_time                                +0.0771         +0.0593         18024         19414         38439         40720
Reduction/bool_any/10000000/manual_time                               +0.0388         +0.0452         34990         36347         53210         55614
Reduction/bool_any/100000000/manual_time                              +0.0012         +0.0096        145729        145903        164726        166315
Reduction/int8_t_any/10000/manual_time                                +0.1207         +0.0954         15686         17579         36213         39669
Reduction/int8_t_any/100000/manual_time                               +0.0546         +0.0580         16597         17503         37103         39254
Reduction/int8_t_any/1000000/manual_time                              +0.0801         +0.0669         17938         19374         38027         40572
Reduction/int8_t_any/10000000/manual_time                             +0.0492         +0.0541         34726         36434         52812         55671
Reduction/int8_t_any/100000000/manual_time                            +0.0028         +0.0119        145924        146340        164711        166670
Reduction/int32_t_any/10000/manual_time                               +0.0489         +0.0539         16805         17626         37587         39614
Reduction/int32_t_any/100000/manual_time                              +0.0824         +0.0717         16985         18384         37432         40114
Reduction/int32_t_any/1000000/manual_time                             +0.0588         +0.0590         22338         23652         41842         44311
Reduction/int32_t_any/10000000/manual_time                            +0.0203         +0.0339         67672         69045         86353         89280
Reduction/int32_t_any/100000000/manual_time                           +0.0023         +0.0054        471708        472770        490461        493130
Reduction/float_any/10000/manual_time                                 +0.0454         +0.0494         16730         17490         37526         39382
Reduction/float_any/100000/manual_time                                +0.0895         +0.0739         16872         18381         37349         40110
Reduction/float_any/1000000/manual_time                               +0.0554         +0.0527         22402         23643         42014         44228
Reduction/float_any/10000000/manual_time                              +0.0026         +0.0081         68877         69060         88443         89159
Reduction/float_any/100000000/manual_time                             +0.0022         +0.0049        471781        472801        490753        493170
ReductionDictionary/int32_t_all/10000/manual_time                     +0.0503         +0.0535         30648         32190         51412         54160
ReductionDictionary/int32_t_all/100000/manual_time                    +0.0696         +0.0642         31076         33240         51545         54855
ReductionDictionary/int32_t_all/1000000/manual_time                   +0.9359         +0.4786         38709         74935         58597         86644
ReductionDictionary/int32_t_all/10000000/manual_time                  +0.0289         +0.0625         95351         98111        113516        120611
ReductionDictionary/int32_t_all/100000000/manual_time                 -0.0963         -0.0910        645050        582915        663356        602971
ReductionDictionary/float_all/10000/manual_time                       +0.0476         +0.0504         30177         31615         50814         53374
ReductionDictionary/float_all/100000/manual_time                      -0.0305         -0.0166         34208         33166         55546         54626
ReductionDictionary/float_all/1000000/manual_time                     -0.0258         -0.0137         40372         39329         61080         60242
ReductionDictionary/float_all/10000000/manual_time                    -0.0366         -0.0210         95681         92175        113937        111545
ReductionDictionary/float_all/100000000/manual_time                   -0.0784         -0.0733        644233        593744        662793        614231
ReductionDictionary/int32_t_any/10000/manual_time                     +0.0718         +0.0680         29882         32026         50723         54172
ReductionDictionary/int32_t_any/100000/manual_time                    +0.0776         +0.0704         30316         32670         50640         54203
ReductionDictionary/int32_t_any/1000000/manual_time                   +0.0291         +0.0372         37311         38397         57039         59159
ReductionDictionary/int32_t_any/10000000/manual_time                  -0.0497         -0.0285         95928         91163        113924        110675
ReductionDictionary/int32_t_any/100000000/manual_time                 -0.1160         -0.1109        655803        579715        673920        599181
ReductionDictionary/float_any/10000/manual_time                       +0.0561         +0.0563         29905         31583         50566         53413
ReductionDictionary/float_any/100000/manual_time                      +0.0622         +0.0581         30431         32322         50788         53740
ReductionDictionary/float_any/1000000/manual_time                     +0.0403         +0.0473         36955         38444         56562         59237
ReductionDictionary/float_any/10000000/manual_time                    -0.0626         -0.0421         96949         90881        115099        110252
ReductionDictionary/float_any/100000000/manual_time                   -0.1203         -0.1154        661222        581658        679363        600965
ReductionDictionary/int32_t_min/10000/manual_time                     +0.0539         +0.0525         45796         48265         66802         70306
ReductionDictionary/int32_t_min/100000/manual_time                    +0.0955         +0.0868         46170         50579         66882         72687
ReductionDictionary/int32_t_min/1000000/manual_time                   +0.0683         +0.0647         47495         50738         67291         71646
ReductionDictionary/int32_t_min/10000000/manual_time                  +0.0065         +0.0160         97469         98105        115997        117850
ReductionDictionary/int32_t_min/100000000/manual_time                 +0.0045         +0.0070        501086        503357        519638        523253
ReductionDictionary/float_min/10000/manual_time                       +0.0721         +0.0653         45551         48834         66485         70829
ReductionDictionary/float_min/100000/manual_time                      +0.0966         +0.0865         44953         49295         65457         71120
ReductionDictionary/float_min/1000000/manual_time                     +0.0618         +0.0582         47227         50145         67004         70900
ReductionDictionary/float_min/10000000/manual_time                    +0.0052         +0.0167         97166         97669        115559        117490
ReductionDictionary/float_min/100000000/manual_time                   +0.0060         +0.0083        501044        504029        519674        523988
ReductionDictionary/int32_t_max/10000/manual_time                     +0.0596         +0.0558         44987         47667         65991         69670
ReductionDictionary/int32_t_max/100000/manual_time                    +0.0735         +0.0625         44640         47921         65423         69515
ReductionDictionary/int32_t_max/1000000/manual_time                   +0.0610         +0.0569         46224         49042         65957         69707
ReductionDictionary/int32_t_max/10000000/manual_time                  -0.0100         -0.0009         97078         96105        115787        115685
ReductionDictionary/int32_t_max/100000000/manual_time                 +0.0021         +0.0041        500466        501504        518959        521082
ReductionDictionary/float_max/10000/manual_time                       +0.0815         +0.0716         44141         47738         65134         69798
ReductionDictionary/float_max/100000/manual_time                      +0.0481         +0.0466         46378         48609         67284         70422
ReductionDictionary/float_max/1000000/manual_time                     +0.0091         +0.0180         48140         48579         67837         69055
ReductionDictionary/float_max/10000000/manual_time                    -0.0000         +0.0063         96823         96820        115498        116228
ReductionDictionary/float_max/100000000/manual_time                   +0.0044         +0.0061        501050        503259        519938        523126
ReductionDictionary/int32_t_mean/10000/manual_time                    -0.0259         -0.0018         45053         43885         66162         66041
ReductionDictionary/int32_t_mean/100000/manual_time                   +0.0448         +0.0484         44776         46782         65662         68843
ReductionDictionary/int32_t_mean/1000000/manual_time                  +0.0134         +0.0263         52069         52766         72163         74061
ReductionDictionary/int32_t_mean/10000000/manual_time                 -0.1592         -0.1321        135248        113712        153825        133504
ReductionDictionary/int32_t_mean/100000000/manual_time                -0.2259         -0.2180        856793        663232        875288        684472
ReductionDictionary/float_mean/10000/manual_time                      +0.1263         +0.1185         44256         49847         65389         73139
ReductionDictionary/float_mean/100000/manual_time                     +0.1304         +0.1231         44038         49780         64743         72713
ReductionDictionary/float_mean/1000000/manual_time                    +0.0792         +0.0887         51678         55771         71756         78124
ReductionDictionary/float_mean/10000000/manual_time                   -0.1397         -0.1058        136928        117802        155376        138931
ReductionDictionary/float_mean/100000000/manual_time                  -0.2322         -0.2256        861014        661106        879508        681082
Reduction/bool_sum/10000/manual_time                                  +0.1048         +0.0881         16256         17960         36958         40216
Reduction/bool_sum/100000/manual_time                                 +0.0424         +0.0473         17106         17832         37986         39781
Reduction/bool_sum/1000000/manual_time                                +0.0837         +0.0620         17928         19428         38311         40688
Reduction/bool_sum/10000000/manual_time                               +0.0408         +0.0395         34927         36352         53394         55502
Reduction/bool_sum/100000000/manual_time                              +0.0030         +0.0102        145074        145516        164126        165798
Reduction/int8_t_sum/10000/manual_time                                +0.1105         +0.0832         15704         17438         36410         39440
Reduction/int8_t_sum/100000/manual_time                               +0.0694         +0.0629         16539         17688         37233         39575
Reduction/int8_t_sum/1000000/manual_time                              +0.0783         +0.0574         17880         19279         38316         40517
Reduction/int8_t_sum/10000000/manual_time                             +0.0416         +0.0406         34809         36256         53305         55467
Reduction/int8_t_sum/100000000/manual_time                            +0.0010         +0.0098        144214        144358        163105        164700
Reduction/int32_t_sum/10000/manual_time                               +0.0074         +0.0264         17459         17588         38551         39570
Reduction/int32_t_sum/100000/manual_time                              +0.0791         +0.0684         17086         18438         37705         40286
Reduction/int32_t_sum/1000000/manual_time                             -0.0064         -0.0027         23751         23599         44384         44263
Reduction/int32_t_sum/10000000/manual_time                            +0.0238         +0.0358         67606         69212         86499         89593
Reduction/int32_t_sum/100000000/manual_time                           +0.0002         -0.0043        472981        473093        495882        493743
Reduction/int64_t_sum/10000/manual_time                               +0.0340         +0.0383         17246         17833         38306         39772
Reduction/int64_t_sum/100000/manual_time                              +0.0755         +0.0719         17841         19188         38242         40990
Reduction/int64_t_sum/1000000/manual_time                             +0.0576         +0.0635         28702         30354         47263         50266
Reduction/int64_t_sum/10000000/manual_time                            +0.0119         +0.0141        112911        114253        132783        134658
Reduction/int64_t_sum/100000000/manual_time                           +0.0013         +0.0021        920797        921986        940549        942485
Reduction/float_sum/10000/manual_time                                 +0.0882         +0.0707         16479         17932         37421         40067
Reduction/float_sum/100000/manual_time                                +0.0903         +0.0774         16677         18183         37163         40041
Reduction/float_sum/1000000/manual_time                               +0.0571         +0.0622         22393         23673         41960         44569
Reduction/float_sum/10000000/manual_time                              +0.0179         +0.0345         67968         69184         86691         89678
Reduction/float_sum/100000000/manual_time                             +0.0015         -0.0015        472247        472973        494236        493474
Reduction/double_sum/10000/manual_time                                +0.0982         +0.0794         16036         17611         36657         39568
Reduction/double_sum/100000/manual_time                               +0.0153         +0.0340         18453         18735         38955         40281
Reduction/double_sum/1000000/manual_time                              +0.0596         +0.0636         28623         30328         47153         50154
Reduction/double_sum/10000000/manual_time                             +0.0089         +0.0130        113426        114441        132746        134476
Reduction/double_sum/100000000/manual_time                            +0.0026         +0.0044        920131        922540        939374        943549
Reduction/int32_t_product/10000/manual_time                           +0.1183         +0.0902         16570         18529         37443         40819
Reduction/int32_t_product/100000/manual_time                          +0.1097         +0.0905         17076         18948         37741         41158
Reduction/int32_t_product/1000000/manual_time                         +0.0655         +0.0622         22023         23465         41604         44194
Reduction/int32_t_product/10000000/manual_time                        +0.0237         +0.0374         67629         69233         86354         89583
Reduction/int32_t_product/100000000/manual_time                       +0.0003         +0.0024        473110        473265        492461        493666
Reduction/float_product/10000/manual_time                             +0.0839         +0.0615         16972         18395         38034         40374
Reduction/float_product/100000/manual_time                            +0.0787         +0.0718         17091         18436         37602         40302
Reduction/float_product/1000000/manual_time                           +0.0686         +0.0714         22282         23810         41856         44844
Reduction/float_product/10000000/manual_time                          +0.0135         +0.0191         68313         69233         87798         89474
Reduction/float_product/100000000/manual_time                         +0.0016         -0.0034        472681        473450        495621        493925
Reduction/int64_t_min/10000/manual_time                               +0.0885         +0.0697         16886         18380         37763         40393
Reduction/int64_t_min/100000/manual_time                              +0.0490         +0.0534         17988         18869         38269         40311
Reduction/int64_t_min/1000000/manual_time                             +0.0087         +0.0218         30355         30618         49407         50485
Reduction/int64_t_min/10000000/manual_time                            +0.0092         +0.0179        113906        114949        132977        135352
Reduction/int64_t_min/100000000/manual_time                           +0.0011         +0.0020        921089        922086        940543        942460
Reduction/double_min/10000/manual_time                                +0.0818         +0.0662         16848         18226         37674         40169
Reduction/double_min/100000/manual_time                               +0.0839         +0.0761         17470         18936         37537         40394
Reduction/double_min/1000000/manual_time                              +0.0582         +0.0629         28967         30653         47503         50490
Reduction/double_min/10000000/manual_time                             +0.0135         +0.0246        113413        114940        132250        135504
Reduction/double_min/100000000/manual_time                            +0.0017         +0.0027        920812        922344        940237        942769
Reduction/timestamp_ms_min/10000/manual_time                          +0.0910         +0.0673         16926         18466         37673         40210
Reduction/timestamp_ms_min/100000/manual_time                         +0.0830         +0.0738         17894         19379         38112         40925
Reduction/timestamp_ms_min/1000000/manual_time                        +0.0339         +0.0429         30113         31133         48915         51012
Reduction/timestamp_ms_min/10000000/manual_time                       +0.0109         +0.0189        114185        115433        133132        135650
Reduction/timestamp_ms_min/100000000/manual_time                      +0.0053         +0.0076        920924        925845        940096        947215
Reduction/int8_t_mean/10000/manual_time                               +0.0489         +0.0517         29025         30445         50078         52670
Reduction/int8_t_mean/100000/manual_time                              +0.0401         +0.0493         29218         30389         50050         52515
Reduction/int8_t_mean/1000000/manual_time                             -0.0215         -0.0036         33316         32600         54162         53965
Reduction/int8_t_mean/10000000/manual_time                            +0.0387         +0.0407         46726         48536         64936         67581
Reduction/int8_t_mean/100000000/manual_time                           +0.0007         +0.0058        161956        162077        180866        181918
Reduction/float_mean/10000/manual_time                                +0.0916         +0.0827         28959         31611         49748         53860
Reduction/float_mean/100000/manual_time                               +0.0369         +0.0485         30720         31854         51489         53988
Reduction/float_mean/1000000/manual_time                              +0.0218         +0.0310         36650         37448         56777         58539
Reduction/float_mean/10000000/manual_time                             +0.0266         +0.0402         81361         83528         99648        103653
Reduction/float_mean/100000000/manual_time                            -0.0036         -0.0003        489644        487869        508151        507999
Reduction/int32_t_variance/10000/manual_time                          +0.0337         +0.0343         29718         30718         50967         52713
Reduction/int32_t_variance/100000/manual_time                         -0.0506         -0.0189         33774         32065         55094         54054
Reduction/int32_t_variance/1000000/manual_time                        +0.0364         +0.0316         36062         37374         56511         58297
Reduction/int32_t_variance/10000000/manual_time                       +0.0107         +0.0197         83950         84849        102507        104529
Reduction/int32_t_variance/100000000/manual_time                      -0.0003         +0.0018        491601        491463        510405        511323
Reduction/double_variance/10000/manual_time                           +0.0536         +0.0457         29157         30721         50266         52565
Reduction/double_variance/100000/manual_time                          +0.0367         +0.0415         30552         31674         51014         53130
Reduction/double_variance/1000000/manual_time                         +0.0238         +0.0334         43737         44780         62726         64821
Reduction/double_variance/10000000/manual_time                        +0.0046         +0.0084        129382        129980        148232        149476
Reduction/double_variance/100000000/manual_time                       +0.0027         +0.0042        934085        936595        952680        956708
Reduction/int64_t_std/10000/manual_time                               +0.0638         +0.0592         28956         30803         49847         52798
Reduction/int64_t_std/100000/manual_time                              -0.0742         -0.0356         34976         32380         56179         54180
Reduction/int64_t_std/1000000/manual_time                             -0.1048         -0.0836         49651         44446         70034         64177
Reduction/int64_t_std/10000000/manual_time                            +0.0077         +0.0118        129373        130373        148033        149784
Reduction/int64_t_std/100000000/manual_time                           +0.0025         +0.0037        934055        936408        952668        956168
Reduction/float_std/10000/manual_time                                 +0.0754         +0.0699         28573         30728         49316         52764
Reduction/float_std/100000/manual_time                                +0.0945         +0.0846         28682         31394         49008         53155
Reduction/float_std/1000000/manual_time                               +0.0356         +0.0437         36195         37485         56074         58526
Reduction/float_std/10000000/manual_time                              +0.0175         +0.0292         83490         84951        101651        104623
Reduction/float_std/100000000/manual_time                             +0.0018         +0.0049        490773        491670        509017        511496
ReductionScan/int8_no_nulls/10000/manual_time                         +0.0523         +0.0511         15603         16420         36157         38005
ReductionScan/int8_no_nulls/100000/manual_time                        +0.0181         +0.0295         17358         17672         37950         39070
ReductionScan/int8_no_nulls/1000000/manual_time                       +0.0251         +0.0390         26876         27550         45671         47451
ReductionScan/int8_no_nulls/10000000/manual_time                      +0.0166         +0.0278         89319         90800        109601        112649
ReductionScan/int8_no_nulls/100000000/manual_time                     +0.0007         +0.0019        735210        735689        756198        757634
ReductionScan/int32_no_nulls/10000/manual_time                        +0.0515         +0.0559         15234         16019         35716         37712
ReductionScan/int32_no_nulls/100000/manual_time                       +0.0593         +0.0583         16808         17805         36980         39134
ReductionScan/int32_no_nulls/1000000/manual_time                      +0.0390         +0.0469         26405         27436         44665         46759
ReductionScan/int32_no_nulls/10000000/manual_time                     +0.0103         +0.0196        114948        116137        135580        138232
ReductionScan/int32_no_nulls/100000000/manual_time                    +0.0032         +0.0056       1016197       1019401       1037281       1043090
ReductionScan/uint64_no_nulls/10000/manual_time                       +0.0162         +0.0308         16222         16485         36939         38077
ReductionScan/uint64_no_nulls/100000/manual_time                      +0.0588         +0.0685         18751         19852         38174         40789
ReductionScan/uint64_no_nulls/1000000/manual_time                     +0.0444         +0.0604         36339         37953         56819         60253
ReductionScan/uint64_no_nulls/10000000/manual_time                    -0.0072         +0.0016        217943        216376        238710        239081
ReductionScan/uint64_no_nulls/100000000/manual_time                   -0.0009         -0.0007       2001645       1999848       2023397       2021947
ReductionScan/float_no_nulls/10000/manual_time                        +0.0298         +0.0399         15775         16246         36567         38028
ReductionScan/float_no_nulls/100000/manual_time                       +0.0534         +0.0505         16888         17789         37162         39037
ReductionScan/float_no_nulls/1000000/manual_time                      +0.0339         +0.0416         26655         27558         45083         46960
ReductionScan/float_no_nulls/10000000/manual_time                     +0.0071         +0.0155        115565        116383        136602        138716
ReductionScan/float_no_nulls/100000000/manual_time                    -0.0019         -0.0006       1019440       1017475       1040591       1040016
ReductionScan/int16_nulls/10000/manual_time                           +0.0510         +0.0442         35588         37402         57805         60363
ReductionScan/int16_nulls/100000/manual_time                          +0.0442         +0.0450         37789         39458         59383         62055
ReductionScan/int16_nulls/1000000/manual_time                         -0.0270         -0.0174         48836         47519         69731         68521
ReductionScan/int16_nulls/10000000/manual_time                        -0.0046         +0.0034        122849        122281        145028        145520
ReductionScan/int16_nulls/100000000/manual_time                       -0.0025         -0.0107        864450        862291        892772        883246
ReductionScan/uint32_nulls/10000/manual_time                          -0.0178         -0.0017         37685         37016         59988         59883
ReductionScan/uint32_nulls/100000/manual_time                         +0.0478         +0.0494         37349         39134         58805         61708
ReductionScan/uint32_nulls/1000000/manual_time                        +0.0085         +0.0130         49350         49771         69758         70663
ReductionScan/uint32_nulls/10000000/manual_time                       +0.0013         +0.0086        146704        146896        169127        170574
ReductionScan/uint32_nulls/100000000/manual_time                      +0.0085         +0.0111       1112967       1122427       1133374       1145988
ReductionScan/double_nulls/10000/manual_time                          +0.0623         +0.0611         35538         37753         57332         60833
ReductionScan/double_nulls/100000/manual_time                         +0.0602         +0.0635         38508         40827         59025         62775
ReductionScan/double_nulls/1000000/manual_time                        +0.0202         +0.0275         55522         56645         77436         79567
ReductionScan/double_nulls/10000000/manual_time                       +0.0084         +0.0118        243996        246039        265992        269142
ReductionScan/double_nulls/100000000/manual_time                      -0.0013         -0.0008       2120668       2117888       2140996       2139376
Reduction/bool_minmax/10000/manual_time                               -0.1377         -0.0650         22410         19325         43937         41081
Reduction/bool_minmax/100000/manual_time                              +0.0523         +0.0537         18947         19937         39665         41795
Reduction/bool_minmax/1000000/manual_time                             +0.0734         +0.0617         19146         20551         39413         41846
Reduction/bool_minmax/10000000/manual_time                            +0.0358         +0.0438         37836         39189         56031         58483
Reduction/bool_minmax/100000000/manual_time                           +0.0000         -0.0063        149637        149641        170798        169728
Reduction/int8_t_minmax/10000/manual_time                             +0.0361         +0.0438         18831         19512         39671         41408
Reduction/int8_t_minmax/100000/manual_time                            +0.0629         +0.0571         18429         19588         39093         41327
Reduction/int8_t_minmax/1000000/manual_time                           +0.0874         +0.0696         18956         20613         39054         41772
Reduction/int8_t_minmax/10000000/manual_time                          +0.0369         +0.0413         37781         39175         55998         58312
Reduction/int8_t_minmax/100000000/manual_time                         -0.0002         +0.0059        149816        149783        168898        169894
Reduction/int32_t_minmax/10000/manual_time                            +0.0813         +0.0613         18423         19921         39271         41676
Reduction/int32_t_minmax/100000/manual_time                           +0.0848         +0.0668         18560         20133         39161         41779
Reduction/int32_t_minmax/1000000/manual_time                          +0.0630         +0.0610         23265         24730         42732         45337
Reduction/int32_t_minmax/10000000/manual_time                         +0.0107         +0.0196         69505         70248         88260         89989
Reduction/int32_t_minmax/100000000/manual_time                        -0.0004         +0.0009        474594        474410        493944        494395
Reduction/timestamp_ms_minmax/10000/manual_time                       +0.0716         +0.0605         19241         20618         40084         42509
Reduction/timestamp_ms_minmax/100000/manual_time                      +0.0418         +0.0513         19638         20459         39825         41867
Reduction/timestamp_ms_minmax/1000000/manual_time                     +0.0559         +0.0635         30346         32042         48851         51954
Reduction/timestamp_ms_minmax/10000000/manual_time                    +0.0030         +0.0025        117085        117437        136688        137033
Reduction/timestamp_ms_minmax/100000000/manual_time                   +0.0010         -0.0037        921557        922453        945924        942449
Reduction/float_minmax/10000/manual_time                              +0.0650         +0.0521         18507         19710         39467         41526
Reduction/float_minmax/100000/manual_time                             +0.0740         +0.0619         18504         19874         39105         41525
Reduction/float_minmax/1000000/manual_time                            +0.0499         +0.0373         23385         24552         43561         45185
Reduction/float_minmax/10000000/manual_time                           +0.0163         +0.0314         69400         70529         88043         90807
Reduction/float_minmax/100000000/manual_time                          +0.0000         +0.0025        474346        474361        493215        494461
```
</details>

It looks like the compression has a minor cost? I would have thought the removal of the `release_assert` would have improved performance. But maybe `reduce` isn't using it?

Authors:
  - Robert Maynard (@robertmaynard)

Approvers:
  - Keith Kraus (@kkraus14)
  - Mark Harris (@harrism)
  - Jake Hemstad (@jrhemstad)

URL: https://github.com/rapidsai/cudf/pull/7583
---
 cpp/cmake/Modules/ConfigureCUDA.cmake              |  3 +++
 cpp/include/cudf/ast/detail/operators.hpp          | 12 ++++++------
 cpp/include/cudf/ast/detail/transform.cuh          |  8 ++++----
 cpp/include/cudf/column/column_device_view.cuh     |  2 +-
 .../cudf/detail/aggregation/aggregation.cuh        |  4 ++--
 .../cudf/detail/aggregation/aggregation.hpp        |  4 ++--
 cpp/include/cudf/detail/gather.cuh                 |  2 +-
 cpp/include/cudf/detail/indexalator.cuh            |  4 ++--
 .../utilities/{release_assert.cuh => assert.cuh}   |  6 +++---
 .../cudf/detail/utilities/hash_functions.cuh       | 14 +++++++-------
 cpp/include/cudf/fixed_point/fixed_point.hpp       |  4 ++--
 cpp/include/cudf/lists/list_device_view.cuh        | 14 +++++++-------
 cpp/include/cudf/table/row_operators.cuh           |  6 +++---
 cpp/include/cudf/utilities/type_dispatcher.hpp     |  4 ++--
 cpp/src/groupby/hash/multi_pass_kernels.cuh        |  4 ++--
 cpp/src/io/parquet/page_data.cu                    |  2 +-
 cpp/src/quantiles/quantiles_util.hpp               |  8 ++++----
 cpp/tests/error/error_handling_test.cu             | 14 +++++++++-----
 18 files changed, 61 insertions(+), 54 deletions(-)
 rename cpp/include/cudf/detail/utilities/{release_assert.cuh => assert.cuh} (87%)
diff --git a/cpp/cmake/Modules/ConfigureCUDA.cmake b/cpp/cmake/Modules/ConfigureCUDA.cmake
index d4be6e65021..142bb84f413 100644
--- a/cpp/cmake/Modules/ConfigureCUDA.cmake
+++ b/cpp/cmake/Modules/ConfigureCUDA.cmake
@@ -46,6 +46,9 @@ if(DISABLE_DEPRECATION_WARNING)
     list(APPEND CUDF_CUDA_FLAGS -Xcompiler=-Wno-deprecated-declarations)
 endif()
 
+# make sure we produce smallest binary size
+list(APPEND CUDF_CUDA_FLAGS -Xfatbin=-compress-all)
+
 # Option to enable line info in CUDA device compilation to allow introspection when profiling / memchecking
 if(CUDA_ENABLE_LINEINFO)
     list(APPEND CUDF_CUDA_FLAGS -lineinfo)
diff --git a/cpp/include/cudf/ast/detail/operators.hpp b/cpp/include/cudf/ast/detail/operators.hpp
index 8ec26cf5eb7..27bcb0d320b 100644
--- a/cpp/include/cudf/ast/detail/operators.hpp
+++ b/cpp/include/cudf/ast/detail/operators.hpp
@@ -187,7 +187,7 @@ CUDA_HOST_DEVICE_CALLABLE constexpr void ast_operator_dispatcher(ast_operator op
 #ifndef __CUDA_ARCH__
       CUDF_FAIL("Invalid operator.");
 #else
-      release_assert(false && "Invalid operator.");
+      cudf_assert(false && "Invalid operator.");
 #endif
       break;
   }
@@ -784,7 +784,7 @@ struct double_dispatch_binary_operator_types {
 #ifndef __CUDA_ARCH__
     CUDF_FAIL("Invalid binary operation.");
 #else
-    release_assert(false && "Invalid binary operation.");
+    cudf_assert(false && "Invalid binary operation.");
 #endif
   }
 };
@@ -819,7 +819,7 @@ struct single_dispatch_binary_operator_types {
 #ifndef __CUDA_ARCH__
     CUDF_FAIL("Invalid binary operation.");
 #else
-    release_assert(false && "Invalid binary operation.");
+    cudf_assert(false && "Invalid binary operation.");
 #endif
   }
 };
@@ -924,7 +924,7 @@ struct dispatch_unary_operator_types {
 #ifndef __CUDA_ARCH__
     CUDF_FAIL("Invalid unary operation.");
 #else
-    release_assert(false && "Invalid unary operation.");
+    cudf_assert(false && "Invalid unary operation.");
 #endif
   }
 };
@@ -996,7 +996,7 @@ struct return_type_functor {
 #ifndef __CUDA_ARCH__
     CUDF_FAIL("Invalid binary operation. Return type cannot be determined.");
 #else
-    release_assert(false && "Invalid binary operation. Return type cannot be determined.");
+    cudf_assert(false && "Invalid binary operation. Return type cannot be determined.");
 #endif
   }
 
@@ -1024,7 +1024,7 @@ struct return_type_functor {
 #ifndef __CUDA_ARCH__
     CUDF_FAIL("Invalid unary operation. Return type cannot be determined.");
 #else
-    release_assert(false && "Invalid unary operation. Return type cannot be determined.");
+    cudf_assert(false && "Invalid unary operation. Return type cannot be determined.");
 #endif
   }
 };
diff --git a/cpp/include/cudf/ast/detail/transform.cuh b/cpp/include/cudf/ast/detail/transform.cuh
index ee08742d871..2719a8b5077 100644
--- a/cpp/include/cudf/ast/detail/transform.cuh
+++ b/cpp/include/cudf/ast/detail/transform.cuh
@@ -87,7 +87,7 @@ struct unary_row_output : public row_output {
                              Input input,
                              detail::device_data_reference output) const
   {
-    release_assert(false && "Invalid unary dispatch operator for the provided input.");
+    cudf_assert(false && "Invalid unary dispatch operator for the provided input.");
   }
 };
 
@@ -116,7 +116,7 @@ struct binary_row_output : public row_output {
                              RHS rhs,
                              detail::device_data_reference output) const
   {
-    release_assert(false && "Invalid binary dispatch operator for the provided input.");
+    cudf_assert(false && "Invalid binary dispatch operator for the provided input.");
   }
 };
 
@@ -239,7 +239,7 @@ struct row_evaluator {
                              detail::device_data_reference rhs,
                              detail::device_data_reference output) const
   {
-    release_assert(false && "Invalid binary dispatch operator for the provided input.");
+    cudf_assert(false && "Invalid binary dispatch operator for the provided input.");
   }
 
  private:
@@ -311,7 +311,7 @@ __device__ void evaluate_row_expression(detail::row_evaluator const& evaluator,
                       output,
                       op);
     } else {
-      release_assert(false && "Invalid operator arity.");
+      cudf_assert(false && "Invalid operator arity.");
     }
   }
 }
diff --git a/cpp/include/cudf/column/column_device_view.cuh b/cpp/include/cudf/column/column_device_view.cuh
index b2f152180b0..5a02f5bbe55 100644
--- a/cpp/include/cudf/column/column_device_view.cuh
+++ b/cpp/include/cudf/column/column_device_view.cuh
@@ -774,7 +774,7 @@ struct index_element_fn {
                                  std::is_unsigned<IndexType>::value)>* = nullptr>
   __device__ size_type operator()(Args&&... args)
   {
-    release_assert(false and "dictionary indices must be an unsigned integral type");
+    cudf_assert(false and "dictionary indices must be an unsigned integral type");
     return 0;
   }
 };
diff --git a/cpp/include/cudf/detail/aggregation/aggregation.cuh b/cpp/include/cudf/detail/aggregation/aggregation.cuh
index 3d006449044..3f5f5a91632 100644
--- a/cpp/include/cudf/detail/aggregation/aggregation.cuh
+++ b/cpp/include/cudf/detail/aggregation/aggregation.cuh
@@ -19,8 +19,8 @@
 #include <cudf/aggregation.hpp>
 #include <cudf/column/column_device_view.cuh>
 #include <cudf/detail/aggregation/aggregation.hpp>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/detail/utilities/device_atomics.cuh>
-#include <cudf/detail/utilities/release_assert.cuh>
 #include <cudf/dictionary/dictionary_column_view.hpp>
 #include <cudf/table/table_device_view.cuh>
 
@@ -103,7 +103,7 @@ struct update_target_element {
                              column_device_view source,
                              size_type source_index) const noexcept
   {
-    release_assert(false and "Invalid source type and aggregation combination.");
+    cudf_assert(false and "Invalid source type and aggregation combination.");
   }
 };
 
diff --git a/cpp/include/cudf/detail/aggregation/aggregation.hpp b/cpp/include/cudf/detail/aggregation/aggregation.hpp
index 1cafad25c9c..1a4847dad12 100644
--- a/cpp/include/cudf/detail/aggregation/aggregation.hpp
+++ b/cpp/include/cudf/detail/aggregation/aggregation.hpp
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <cudf/aggregation.hpp>
-#include <cudf/detail/utilities/release_assert.cuh>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/types.hpp>
 #include <cudf/utilities/error.hpp>
 #include <cudf/utilities/traits.hpp>
@@ -627,7 +627,7 @@ CUDA_HOST_DEVICE_CALLABLE decltype(auto) aggregation_dispatcher(aggregation::Kin
 #ifndef __CUDA_ARCH__
       CUDF_FAIL("Unsupported aggregation.");
 #else
-      release_assert(false && "Unsupported aggregation.");
+      cudf_assert(false && "Unsupported aggregation.");
 
       // The following code will never be reached, but the compiler generates a
       // warning if there isn't a return value.
diff --git a/cpp/include/cudf/detail/gather.cuh b/cpp/include/cudf/detail/gather.cuh
index 87f5c9251c7..73647ac2292 100644
--- a/cpp/include/cudf/detail/gather.cuh
+++ b/cpp/include/cudf/detail/gather.cuh
@@ -18,8 +18,8 @@
 #include <cudf/detail/copy.hpp>
 #include <cudf/detail/indexalator.cuh>
 #include <cudf/detail/null_mask.hpp>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/detail/utilities/cuda.cuh>
-#include <cudf/detail/utilities/release_assert.cuh>
 #include <cudf/detail/valid_if.cuh>
 #include <cudf/dictionary/dictionary_column_view.hpp>
 #include <cudf/dictionary/dictionary_factories.hpp>
diff --git a/cpp/include/cudf/detail/indexalator.cuh b/cpp/include/cudf/detail/indexalator.cuh
index 8568bd68bfd..8bbd0d1aada 100644
--- a/cpp/include/cudf/detail/indexalator.cuh
+++ b/cpp/include/cudf/detail/indexalator.cuh
@@ -268,7 +268,7 @@ struct input_indexalator : base_indexalator<input_indexalator> {
     template <typename T, std::enable_if_t<not is_index_type<T>()>* = nullptr>
     __device__ size_type operator()(void const* tp)
     {
-      release_assert(false and "only index types are supported");
+      cudf_assert(false and "only index types are supported");
       return 0;
     }
   };
@@ -366,7 +366,7 @@ struct output_indexalator : base_indexalator<output_indexalator> {
     template <typename T, std::enable_if_t<not is_index_type<T>()>* = nullptr>
     __device__ void operator()(void* tp, size_type const value)
     {
-      release_assert(false and "only index types are supported");
+      cudf_assert(false and "only index types are supported");
     }
   };
 
diff --git a/cpp/include/cudf/detail/utilities/release_assert.cuh b/cpp/include/cudf/detail/utilities/assert.cuh
similarity index 87%
rename from cpp/include/cudf/detail/utilities/release_assert.cuh
rename to cpp/include/cudf/detail/utilities/assert.cuh
index e0db88d8fcb..69f9e2d3791 100644
--- a/cpp/include/cudf/detail/utilities/release_assert.cuh
+++ b/cpp/include/cudf/detail/utilities/assert.cuh
@@ -27,11 +27,11 @@
  *
  * Relies on the `__PRETTY_FUNCTION__` macro which is specific to GCC and Clang.
  */
-#if defined(__CUDA_ARCH__) && (defined(__clang__) || defined(__GNUC__))
+#if !defined(NDEBUG) && defined(__CUDA_ARCH__) && (defined(__clang__) || defined(__GNUC__))
 #define __ASSERT_STR_HELPER(x) #x
-#define release_assert(e)     \
+#define cudf_assert(e)        \
   ((e) ? static_cast<void>(0) \
        : __assert_fail(__ASSERT_STR_HELPER(e), __FILE__, __LINE__, __PRETTY_FUNCTION__))
 #else
-#define release_assert(e) (static_cast<void>(0))
+#define cudf_assert(e) (static_cast<void>(0))
 #endif
diff --git a/cpp/include/cudf/detail/utilities/hash_functions.cuh b/cpp/include/cudf/detail/utilities/hash_functions.cuh
index e9d66d125dd..a2e7d6d4aae 100644
--- a/cpp/include/cudf/detail/utilities/hash_functions.cuh
+++ b/cpp/include/cudf/detail/utilities/hash_functions.cuh
@@ -17,7 +17,7 @@
 #pragma once
 
 #include <cudf/column/column_device_view.cuh>
-#include <cudf/detail/utilities/release_assert.cuh>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/strings/string_view.cuh>
 #include <hash/hash_constants.hpp>
 
@@ -154,7 +154,7 @@ struct MD5ListHasher {
                              size_type offset_end,
                              md5_intermediate_data* hash_state) const
   {
-    release_assert(false && "MD5 Unsupported chrono type column");
+    cudf_assert(false && "MD5 Unsupported chrono type column");
   }
 
   template <typename T, std::enable_if_t<!is_fixed_width<T>()>* = nullptr>
@@ -163,7 +163,7 @@ struct MD5ListHasher {
                              size_type offset_end,
                              md5_intermediate_data* hash_state) const
   {
-    release_assert(false && "MD5 Unsupported non-fixed-width type column");
+    cudf_assert(false && "MD5 Unsupported non-fixed-width type column");
   }
 
   template <typename T, std::enable_if_t<is_floating_point<T>()>* = nullptr>
@@ -273,7 +273,7 @@ struct MD5Hash {
                              size_type row_index,
                              md5_intermediate_data* hash_state) const
   {
-    release_assert(false && "MD5 Unsupported chrono type column");
+    cudf_assert(false && "MD5 Unsupported chrono type column");
   }
 
   template <typename T, std::enable_if_t<!is_fixed_width<T>()>* = nullptr>
@@ -281,7 +281,7 @@ struct MD5Hash {
                              size_type row_index,
                              md5_intermediate_data* hash_state) const
   {
-    release_assert(false && "MD5 Unsupported non-fixed-width type column");
+    cudf_assert(false && "MD5 Unsupported non-fixed-width type column");
   }
 
   template <typename T, std::enable_if_t<is_floating_point<T>()>* = nullptr>
@@ -344,7 +344,7 @@ void CUDA_DEVICE_CALLABLE MD5Hash::operator()<list_view>(column_device_view col,
   column_device_view offsets = col.child(offsets_column_index);
   column_device_view data    = col.child(data_column_index);
 
-  if (data.type().id() == type_id::LIST) release_assert(false && "Nested list unsupported");
+  if (data.type().id() == type_id::LIST) cudf_assert(false && "Nested list unsupported");
 
   cudf::type_dispatcher(data.type(),
                         MD5ListHasher{},
@@ -724,7 +724,7 @@ struct IdentityHash {
   CUDA_HOST_DEVICE_CALLABLE std::enable_if_t<!std::is_arithmetic<Key>::value, return_type>
   operator()(Key const& key) const
   {
-    release_assert(false && "IdentityHash does not support this data type");
+    cudf_assert(false && "IdentityHash does not support this data type");
     return 0;
   }
 
diff --git a/cpp/include/cudf/fixed_point/fixed_point.hpp b/cpp/include/cudf/fixed_point/fixed_point.hpp
index 8f8e2b7394c..eb752a8a0ea 100644
--- a/cpp/include/cudf/fixed_point/fixed_point.hpp
+++ b/cpp/include/cudf/fixed_point/fixed_point.hpp
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <cudf/detail/utilities/release_assert.cuh>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/types.hpp>
 
 // Note: The <cuda/std/*> versions are used in order for Jitify to work with our fixed_point type.
@@ -91,7 +91,7 @@ template <typename Rep,
                                            is_supported_representation_type<Rep>())>* = nullptr>
 CUDA_HOST_DEVICE_CALLABLE Rep ipow(T exponent)
 {
-  release_assert(exponent >= 0 && "integer exponentiation with negative exponent is not possible.");
+  cudf_assert(exponent >= 0 && "integer exponentiation with negative exponent is not possible.");
   if (exponent == 0) return static_cast<Rep>(1);
   auto extra  = static_cast<Rep>(1);
   auto square = static_cast<Rep>(Base);
diff --git a/cpp/include/cudf/lists/list_device_view.cuh b/cpp/include/cudf/lists/list_device_view.cuh
index 3afafe9d1fa..4f207474526 100644
--- a/cpp/include/cudf/lists/list_device_view.cuh
+++ b/cpp/include/cudf/lists/list_device_view.cuh
@@ -37,12 +37,12 @@ class list_device_view {
     : lists_column(lists_column), _row_index(row_index)
   {
     column_device_view const& offsets = lists_column.offsets();
-    release_assert(row_index >= 0 && row_index < lists_column.size() &&
-                   row_index < offsets.size() && "row_index out of bounds");
+    cudf_assert(row_index >= 0 && row_index < lists_column.size() && row_index < offsets.size() &&
+                "row_index out of bounds");
 
     begin_offset = offsets.element<size_type>(row_index);
-    release_assert(begin_offset >= 0 && begin_offset <= lists_column.child().size() &&
-                   "begin_offset out of bounds.");
+    cudf_assert(begin_offset >= 0 && begin_offset <= lists_column.child().size() &&
+                "begin_offset out of bounds.");
     _size = offsets.element<size_type>(row_index + 1) - begin_offset;
   }
 
@@ -71,7 +71,7 @@ class list_device_view {
    */
   CUDA_DEVICE_CALLABLE size_type element_offset(size_type idx) const
   {
-    release_assert(idx >= 0 && idx < size() && "idx out of bounds");
+    cudf_assert(idx >= 0 && idx < size() && "idx out of bounds");
     return begin_offset + idx;
   }
 
@@ -93,7 +93,7 @@ class list_device_view {
    */
   CUDA_DEVICE_CALLABLE bool is_null(size_type idx) const
   {
-    release_assert(idx >= 0 && idx < size() && "Index out of bounds.");
+    cudf_assert(idx >= 0 && idx < size() && "Index out of bounds.");
     auto element_offset = begin_offset + idx;
     return lists_column.child().is_null(element_offset);
   }
@@ -294,7 +294,7 @@ struct list_size_functor {
   CUDA_HOST_DEVICE_CALLABLE list_size_functor(column_device_view const& d_col) : d_column(d_col)
   {
 #if defined(__CUDA_ARCH__)
-    release_assert(d_col.type().id() == type_id::LIST && "Only list type column is supported");
+    cudf_assert(d_col.type().id() == type_id::LIST && "Only list type column is supported");
 #else
     CUDF_EXPECTS(d_col.type().id() == type_id::LIST, "Only list type column is supported");
 #endif
diff --git a/cpp/include/cudf/table/row_operators.cuh b/cpp/include/cudf/table/row_operators.cuh
index d9840e78be2..04d215ff7cb 100644
--- a/cpp/include/cudf/table/row_operators.cuh
+++ b/cpp/include/cudf/table/row_operators.cuh
@@ -17,8 +17,8 @@
 #pragma once
 
 #include <cudf/column/column_device_view.cuh>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/detail/utilities/hash_functions.cuh>
-#include <cudf/detail/utilities/release_assert.cuh>
 #include <cudf/sorting.hpp>
 #include <cudf/table/table_device_view.cuh>
 #include <cudf/utilities/traits.hpp>
@@ -190,7 +190,7 @@ class element_equality_comparator {
             std::enable_if_t<not cudf::is_equality_comparable<Element, Element>()>* = nullptr>
   __device__ bool operator()(size_type lhs_element_index, size_type rhs_element_index)
   {
-    release_assert(false && "Attempted to compare elements of uncomparable types.");
+    cudf_assert(false && "Attempted to compare elements of uncomparable types.");
     return false;
   }
 
@@ -291,7 +291,7 @@ class element_relational_comparator {
             std::enable_if_t<not cudf::is_relationally_comparable<Element, Element>()>* = nullptr>
   __device__ weak_ordering operator()(size_type lhs_element_index, size_type rhs_element_index)
   {
-    release_assert(false && "Attempted to compare elements of uncomparable types.");
+    cudf_assert(false && "Attempted to compare elements of uncomparable types.");
     return weak_ordering::LESS;
   }
 
diff --git a/cpp/include/cudf/utilities/type_dispatcher.hpp b/cpp/include/cudf/utilities/type_dispatcher.hpp
index 26c51d0435a..bd9ea015a32 100644
--- a/cpp/include/cudf/utilities/type_dispatcher.hpp
+++ b/cpp/include/cudf/utilities/type_dispatcher.hpp
@@ -16,7 +16,7 @@
 
 #pragma once
 
-#include <cudf/detail/utilities/release_assert.cuh>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/fixed_point/fixed_point.hpp>
 #include <cudf/types.hpp>
 #include <cudf/utilities/error.hpp>
@@ -501,7 +501,7 @@ CUDA_HOST_DEVICE_CALLABLE constexpr decltype(auto) type_dispatcher(cudf::data_ty
 #ifndef __CUDA_ARCH__
       CUDF_FAIL("Unsupported type_id.");
 #else
-      release_assert(false && "Unsupported type_id.");
+      cudf_assert(false && "Unsupported type_id.");
 
       // The following code will never be reached, but the compiler generates a
       // warning if there isn't a return value.
diff --git a/cpp/src/groupby/hash/multi_pass_kernels.cuh b/cpp/src/groupby/hash/multi_pass_kernels.cuh
index a491b50478a..24de22705a9 100644
--- a/cpp/src/groupby/hash/multi_pass_kernels.cuh
+++ b/cpp/src/groupby/hash/multi_pass_kernels.cuh
@@ -20,8 +20,8 @@
 #include <cudf/aggregation.hpp>
 #include <cudf/column/column_device_view.cuh>
 #include <cudf/detail/aggregation/aggregation.hpp>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/detail/utilities/device_atomics.cuh>
-#include <cudf/detail/utilities/release_assert.cuh>
 #include <cudf/table/table_device_view.cuh>
 #include <cudf/utilities/type_dispatcher.hpp>
 
@@ -65,7 +65,7 @@ struct var_hash_functor {
                                                                   size_type source_index,
                                                                   size_type target_index) noexcept
   {
-    release_assert(false and "Invalid source type for std, var aggregation combination.");
+    cudf_assert(false and "Invalid source type for std, var aggregation combination.");
   }
 
   template <typename Source>
diff --git a/cpp/src/io/parquet/page_data.cu b/cpp/src/io/parquet/page_data.cu
index 6e8937607b9..437d4b56e22 100644
--- a/cpp/src/io/parquet/page_data.cu
+++ b/cpp/src/io/parquet/page_data.cu
@@ -18,7 +18,7 @@
 #include <io/utilities/block_utils.cuh>
 #include <io/utilities/column_buffer.hpp>
 
-#include <cudf/detail/utilities/release_assert.cuh>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/utilities/bit.hpp>
 
 #include <rmm/cuda_stream_view.hpp>
diff --git a/cpp/src/quantiles/quantiles_util.hpp b/cpp/src/quantiles/quantiles_util.hpp
index 67323126751..1df0a4ab41a 100644
--- a/cpp/src/quantiles/quantiles_util.hpp
+++ b/cpp/src/quantiles/quantiles_util.hpp
@@ -15,7 +15,7 @@
  */
 
 #include <cmath>
-#include <cudf/detail/utilities/release_assert.cuh>
+#include <cudf/detail/utilities/assert.cuh>
 #include <cudf/types.hpp>
 #include <cudf/utilities/error.hpp>
 
@@ -144,7 +144,7 @@ select_quantile(ValueAccessor get_value, size_type size, double q, interpolation
 
     default:
 #if defined(__CUDA_ARCH__)
-      release_assert(false && "Invalid interpolation operation for quantiles");
+      cudf_assert(false && "Invalid interpolation operation for quantiles");
       return Result();
 #else
       CUDF_FAIL("Invalid interpolation operation for quantiles.");
@@ -173,7 +173,7 @@ select_quantile_data(Iterator begin, size_type size, double q, interpolation int
   }
 
 #if defined(__CUDA_ARCH__)
-  release_assert(false && "Invalid interpolation operation for quantiles");
+  cudf_assert(false && "Invalid interpolation operation for quantiles");
   return Result();
 #else
   CUDF_FAIL("Invalid interpolation operation for quantiles.");
@@ -200,7 +200,7 @@ CUDA_HOST_DEVICE_CALLABLE bool select_quantile_validity(Iterator begin,
   }
 
 #if defined(__CUDA_ARCH__)
-  release_assert(false && "Invalid interpolation operation for quantiles");
+  cudf_assert(false && "Invalid interpolation operation for quantiles");
   return false;
 #else
   CUDF_FAIL("Invalid interpolation operation for quantiles.");
diff --git a/cpp/tests/error/error_handling_test.cu b/cpp/tests/error/error_handling_test.cu
index debf540ea8e..da9509e94a6 100644
--- a/cpp/tests/error/error_handling_test.cu
+++ b/cpp/tests/error/error_handling_test.cu
@@ -83,11 +83,13 @@ TEST(StreamCheck, CatchFailedKernel)
                             "invalid configuration argument");
 }
 
-__global__ void assert_false_kernel() { release_assert(false && "this kernel should die"); }
+#ifndef NDEBUG
+
+__global__ void assert_false_kernel() { cudf_assert(false && "this kernel should die"); }
 
-__global__ void assert_true_kernel() { release_assert(true && "this kernel should live"); }
+__global__ void assert_true_kernel() { cudf_assert(true && "this kernel should live"); }
 
-TEST(ReleaseAssertDeathTest, release_assert_false)
+TEST(DebugAssertDeathTest, cudf_assert_false)
 {
   testing::FLAGS_gtest_death_test_style = "threadsafe";
 
@@ -100,19 +102,21 @@ TEST(ReleaseAssertDeathTest, release_assert_false)
     // each attempted kernel launch
     if (cudaErrorAssert == cudaDeviceSynchronize()) { std::abort(); }
 
-    // If we reach this point, the release_assert didn't work so we exit normally, which will cause
+    // If we reach this point, the cudf_assert didn't work so we exit normally, which will cause
     // EXPECT_DEATH to fail.
   };
 
   EXPECT_DEATH(call_kernel(), "this kernel should die");
 }
 
-TEST(ReleaseAssert, release_assert_true)
+TEST(DebugAssert, cudf_assert_true)
 {
   assert_true_kernel<<<1, 1>>>();
   ASSERT_EQ(cudaSuccess, cudaDeviceSynchronize());
 }
 
+#endif
+
 // These tests don't use CUDF_TEST_PROGRAM_MAIN because :
 // 1.) They don't need the RMM Pool
 // 2.) The RMM Pool interferes with the death test