-
Notifications
You must be signed in to change notification settings - Fork 917
/
DEVELOPER_GUIDE.md
1513 lines (1169 loc) · 69.5 KB
/
DEVELOPER_GUIDE.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# libcudf C++ Developer Guide {#DEVELOPER_GUIDE}
This document serves as a guide for contributors to libcudf C++ code. Developers should also refer
to these additional files for further documentation of libcudf best practices.
* [Documentation Guide](DOCUMENTATION.md) for guidelines on documenting libcudf code.
* [Testing Guide](TESTING.md) for guidelines on writing unit tests.
* [Benchmarking Guide](BENCHMARKING.md) for guidelines on writing unit benchmarks.
# Overview
libcudf is a C++ library that provides GPU-accelerated data-parallel algorithms for processing
column-oriented tabular data. libcudf provides algorithms including slicing, filtering, sorting,
various types of aggregations, and database-type operations such as grouping and joins. libcudf
serves a number of clients via multiple language interfaces, including Python and Java. Users may
also use libcudf directly from C++ code.
## Lexicon
This section defines terminology used within libcudf.
### Column
A column is an array of data of a single type. Along with Tables, columns are the fundamental data
structures used in libcudf. Most libcudf algorithms operate on columns. Columns may have a validity
mask representing whether each element is valid or null (invalid). Columns of nested types are
supported, meaning that a column may have child columns. A column is the C++ equivalent to a cuDF
Python [Series](https://docs.rapids.ai/api/cudf/stable/api_docs/series.html).
### Element
An individual data item within a column. Also known as a row.
### Scalar
A type representing a single element of a data type.
### Table
A table is a collection of columns with equal number of elements. A table is the C++ equivalent to
a cuDF Python [DataFrame](https://docs.rapids.ai/api/cudf/stable/api_docs/dataframe.html).
### View
A view is a non-owning object that provides zero-copy access (possibly with slicing or offsets) to
data owned by another object. Examples are column views and table views.
# Directory Structure and File Naming
External/public libcudf APIs are grouped based on functionality into an appropriately titled
header file in `cudf/cpp/include/cudf/`. For example, `cudf/cpp/include/cudf/copying.hpp`
contains the APIs for functions related to copying from one column to another. Note the `.hpp`
file extension used to indicate a C++ header file.
External/public libcudf C++ API header files need to mark all symbols inside of them with `CUDF_EXPORT`.
This is done by placing the macro on the `namespace cudf` as seen below. Markup on namespace
require them not to be nested, so the `cudf` namespace must be kept by itself.
```c++
#pragma once
namespace CUDF_EXPORT cudf {
namespace lists {
...
} // namespace lists
} // namespace CUDF_EXPORT cudf
```
The naming of external API headers should be consistent with the name of the folder that contains
the source files that implement the API. For example, the implementation of the APIs found in
`cudf/cpp/include/cudf/copying.hpp` are located in `cudf/src/copying`. Likewise, the unit tests for
the APIs reside in `cudf/tests/copying/`.
Internal API headers containing `detail` namespace definitions that are either used across translation
units inside libcudf should be placed in `include/cudf/detail`. Just like the public C++ API headers, any
internal C++ API header requires `CUDF_EXPORT` markup on the `cudf` namespace so that the functions can be tested.
All headers in cudf should use `#pragma once` for include guards.
## File extensions
- `.hpp` : C++ header files
- `.cpp` : C++ source files
- `.cu` : CUDA C++ source files
- `.cuh` : Headers containing CUDA device code
Only use `.cu` and `.cuh` if necessary. A good indicator is the inclusion of `__device__` and other
symbols that are only recognized by `nvcc`. Another indicator is Thrust algorithm APIs with a device
execution policy (always `rmm::exec_policy` in libcudf).
## Code and Documentation Style and Formatting
libcudf code uses [snake_case](https://en.wikipedia.org/wiki/Snake_case) for all names except in a
few cases: template parameters, unit tests and test case names may use Pascal case, aka
[UpperCamelCase](https://en.wikipedia.org/wiki/Camel_case). We do not use
[Hungarian notation](https://en.wikipedia.org/wiki/Hungarian_notation), except sometimes when naming
device data variables and their corresponding host copies. Private member variables are typically
prefixed with an underscore.
```c++
template <typename IteratorType>
void algorithm_function(int x, rmm::cuda_stream_view s, rmm::device_async_resource_ref mr)
{
...
}
class utility_class
{
...
private:
int _rating{};
std::unique_ptr<cudf::column> _column{};
}
TYPED_TEST_SUITE(RepeatTypedTestFixture, cudf::test::FixedWidthTypes);
TYPED_TEST(RepeatTypedTestFixture, RepeatScalarCount)
{
...
}
```
C++ formatting is enforced using `clang-format`. You should configure `clang-format` on your
machine to use the `cudf/cpp/.clang-format` configuration file, and run `clang-format` on all
changed code before committing it. The easiest way to do this is to configure your editor to
"format on save."
Aspects of code style not discussed in this document and not automatically enforceable are typically
caught during code review, or not enforced.
### C++ Guidelines
In general, we recommend following
[C++ Core Guidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines). We also
recommend watching Sean Parent's [C++ Seasoning talk](https://www.youtube.com/watch?v=W2tWOdzgXHA),
and we try to follow his rules: "No raw loops. No raw pointers. No raw synchronization primitives."
* Prefer algorithms from STL and Thrust to raw loops.
* Prefer libcudf and RMM [owning data structures and views](#libcudf-data-structures) to raw
pointers and raw memory allocation.
* libcudf doesn't have a lot of CPU-thread concurrency, but there is some. And currently libcudf
does use raw synchronization primitives. So we should revisit Parent's third rule and improve
here.
Additional style guidelines for libcudf code:
* Prefer "east const", placing `const` after the type. This is not
automatically enforced by `clang-format` because the option
`QualifierAlignment: Right` has been observed to produce false negatives and
false positives.
* [NL.11: Make Literals
Readable](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#nl11-make-literals-readable):
Decimal values should use integer separators every thousands place, like
`1'234'567`. Hexadecimal values should use separators every 4 characters,
like `0x0123'ABCD`.
Documentation is discussed in the [Documentation Guide](DOCUMENTATION.md).
### Includes
The following guidelines apply to organizing `#include` lines.
* Group includes by library (e.g. cuDF, RMM, Thrust, STL). `clang-format` will respect the
groupings and sort the individual includes within a group lexicographically.
* Separate groups by a blank line.
* Order the groups from "nearest" to "farthest". In other words, local includes, then includes
from other RAPIDS libraries, then includes from related libraries, like `<thrust/...>`, then
includes from dependencies installed with cuDF, and then standard headers (for example
`<string>`, `<iostream>`).
* We use clang-format for grouping and sorting headers automatically. See the
`cudf/cpp/.clang-format` file for specifics.
* Use `<>` for all includes except for internal headers that are not in the `include`
directory. In other words, if it is a cuDF internal header (e.g. in the `src` or `test`
directory), the path will not start with `cudf` (e.g. `#include <cudf/some_header.hpp>`) so it
should use quotes. Example: `#include "io/utilities/hostdevice_vector.hpp"`.
* `cudf_test` and `nvtext` are separate libraries within the `libcudf` repo. As such, they have
public headers in `include` that should be included with `<>`.
* Tools like `clangd` often auto-insert includes when they can, but they usually get the grouping
and brackets wrong. Correct the usage of quotes or brackets and then run clang-format to correct
the grouping.
* Always check that includes are only necessary for the file in which they are included.
Try to avoid excessive including especially in header files. Double check this when you remove
code.
* Avoid relative paths with `..` when possible. Paths with `..` are necessary when including
(internal) headers from source paths not in the same directory as the including file,
because source paths are not passed with `-I`.
* Avoid including library internal headers from non-internal files. For example, try not to include
headers from libcudf `src` directories in tests or in libcudf public headers. If you find
yourself doing this, start a discussion about moving (parts of) the included internal header
to a public header.
# libcudf Data Structures
Application data in libcudf is contained in Columns and Tables, but there are a variety of other
data structures you will use when developing libcudf code.
## Views and Ownership
Resource ownership is an essential concept in libcudf. In short, an "owning" object owns a
resource (such as device memory). It acquires that resource during construction and releases the
resource in destruction ([RAII](https://en.cppreference.com/w/cpp/language/raii)). A "non-owning"
object does not own resources. Any class in libcudf with the `*_view` suffix is non-owning. For more
detail see the [`libcudf` presentation.](https://docs.google.com/presentation/d/1zKzAtc1AWFKfMhiUlV5yRZxSiPLwsObxMlWRWz_f5hA/edit?usp=sharing)
libcudf functions typically take views as input (`column_view` or `table_view`)
and produce `unique_ptr`s to owning objects as output. For example,
```c++
std::unique_ptr<table> sort(table_view const& input);
```
## Memory Resources
libcudf allocates all device memory via RMM memory resources (MR) or CUDA MRs. Either type
can be passed to libcudf functions via `rmm::device_async_resource_ref` parameters. See the
[RMM documentation](https://github.com/rapidsai/rmm/blob/main/README.md) for details.
### Current Device Memory Resource
RMM provides a "default" memory resource for each device and functions to access and set it. libcudf
provides wrappers for these functions in `cpp/include/cudf/utilities/memory_resource.hpp`.
All memory resource parameters should be defaulted to use the return value of
`cudf::get_current_device_resource_ref()`.
### Resource Refs
Memory resources are passed via resource ref parameters. A resource ref is a memory resource wrapper
that enables consumers to specify properties of resources that they expect. These are defined
in the `cuda::mr` namespace of libcu++, but RMM provides some convenience aliases in
`rmm/resource_ref.hpp`.
- `rmm::device_resource_ref` accepts a memory resource that provides synchronous allocation
of device-accessible memory.
- `rmm::device_async_resource_ref` accepts a memory resource that provides stream-ordered allocation
of device-accessible memory.
- `rmm::host_resource_ref` accepts a memory resource that provides synchronous allocation of host-
accessible memory.
- `rmm::host_async_resource_ref` accepts a memory resource that provides stream-ordered allocation
of host-accessible memory.
- `rmm::host_device_resource_ref` accepts a memory resource that provides synchronous allocation of
host- and device-accessible memory.
- `rmm::host_async_resource_ref` accepts a memory resource that provides stream-ordered allocation
of host- and device-accessible memory.
See the libcu++ [docs on `resource_ref`](https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_resource/resource_ref.html)
for more information.
## cudf::column
`cudf::column` is a core owning data structure in libcudf. Most libcudf public APIs produce either
a `cudf::column` or a `cudf::table` as output. A `column` contains `device_buffer`s which own the
device memory for the elements of a column and an optional null indicator bitmask.
Implicitly convertible to `column_view` and `mutable_column_view`.
Movable and copyable. A copy performs a deep copy of the column's contents, whereas a move moves
the contents from one column to another.
Example:
```c++
cudf::column col{...};
cudf::column copy{col}; // Copies the contents of `col`
cudf::column const moved_to{std::move(col)}; // Moves contents from `col`
column_view v = moved_to; // Implicit conversion to non-owning column_view
// mutable_column_view m = moved_to; // Cannot create mutable view to const column
```
A `column` may have nested (child) columns, depending on the data type of the column. For example,
`LIST`, `STRUCT`, and `STRING` type columns.
### cudf::column_view
`cudf::column_view` is a core non-owning data structure in libcudf. It is an immutable,
non-owning view of device memory as a column. Most libcudf public APIs take views as inputs.
A `column_view` may be a view of a "slice" of a column. For example, it might view rows 75-150 of a
column with 1000 rows. The `size()` of this `column_view` would be `75`, and accessing index `0` of
the view would return the element at index `75` of the owning `column`. Internally, this is
implemented by storing in the view a pointer, an offset, and a size. `column_view::data<T>()`
returns a pointer iterator to `column_view::head<T>() + offset`.
### cudf::mutable_column_view
A *mutable*, non-owning view of device memory as a column. Used for detail APIs and (rare) public
APIs that modify columns in place.
### cudf::column_device_view
An immutable, non-owning view of device data as a column of elements that is trivially copyable and
usable in CUDA device code. Used to pass `column_view` data as input to CUDA kernels and device
functions (including Thrust algorithms)
### cudf::mutable_column_device_view
A mutable, non-owning view of device data as a column of elements that is trivially copyable and
usable in CUDA device code. Used to pass `column_view` data to be modified on the device by CUDA
kernels and device functions (including Thrust algorithms).
## cudf::table
Owning class for a set of `cudf::column`s all with equal number of elements. This is the C++
equivalent to a data frame.
Implicitly convertible to `cudf::table_view` and `cudf::mutable_table_view`
Movable and copyable. A copy performs a deep copy of all columns, whereas a move moves all columns
from one table to another.
### cudf::table_view
An *immutable*, non-owning view of a table.
### cudf::mutable_table_view
A *mutable*, non-owning view of a table.
## cudf::size_type
The `cudf::size_type` is the type used for the number of elements in a column, offsets to elements
within a column, indices to address specific elements, segments for subsets of column elements, etc.
It is equivalent to a signed, 32-bit integer type and therefore has a maximum value of 2147483647.
Some APIs also accept negative index values and those functions support a minimum value of
-2147483648. This fundamental type also influences output values not just for column size limits
but for counting elements as well.
## Spans
libcudf provides `span` classes that mimic C++20 `std::span`, which is a lightweight
view of a contiguous sequence of objects. libcudf provides two classes, `host_span` and
`device_span`, which can be constructed from multiple container types, or from a pointer
(host or device, respectively) and size, or from iterators. `span` types are useful for defining
generic (internal) interfaces which work with multiple input container types. `device_span` can be
constructed from `thrust::device_vector`, `rmm::device_vector`, or `rmm::device_uvector`.
`host_span` can be constructed from `thrust::host_vector`, `std::vector`, or `std::basic_string`.
If you are defining internal (detail) functions that operate on vectors, use spans for the input
vector parameters rather than a specific vector type, to make your functions more widely applicable.
When a `span` refers to immutable elements, use `span<T const>`, not `span<T> const`. Since a span
is lightweight view, it does not propagate `const`-ness. Therefore, `const` should be applied to
the template type parameter, not to the `span` itself. Also, `span` should be passed by value
because it is a lightweight view. APIS in libcudf that take spans as input will look like the
following function that copies device data to a host `std::vector`.
```c++
template <typename T>
std::vector<T> make_std_vector_async(device_span<T const> v, rmm::cuda_stream_view stream)
```
## cudf::scalar
A `cudf::scalar` is an object that can represent a singular, nullable value of any of the types
currently supported by cudf. Each type of value is represented by a separate type of scalar class
which are all derived from `cudf::scalar`. e.g. A `numeric_scalar` holds a single numerical value,
a `string_scalar` holds a single string. The data for the stored value resides in device memory.
A `list_scalar` holds the underlying data of a single list. This means the underlying data can be
any type that cudf supports. For example, a `list_scalar` representing a list of integers stores a
`cudf::column` of type `INT32`. A `list_scalar` representing a list of lists of integers stores a
`cudf::column` of type `LIST`, which in turn stores a column of type `INT32`.
|Value type|Scalar class|Notes|
|-|-|-|
|fixed-width|`fixed_width_scalar<T>`| `T` can be any fixed-width type|
|numeric|`numeric_scalar<T>` | `T` can be `int8_t`, `int16_t`, `int32_t`, `int_64_t`, `float` or `double`|
|fixed-point|`fixed_point_scalar<T>` | `T` can be `numeric::decimal32` or `numeric::decimal64`|
|timestamp|`timestamp_scalar<T>` | `T` can be `timestamp_D`, `timestamp_s`, etc.|
|duration|`duration_scalar<T>` | `T` can be `duration_D`, `duration_s`, etc.|
|string|`string_scalar`| This class object is immutable|
|list|`list_scalar`| Underlying data can be any type supported by cudf |
### Construction
`scalar`s can be created using either their respective constructors or using factory functions like
`make_numeric_scalar()`, `make_timestamp_scalar()` or `make_string_scalar()`.
### Casting
All the factory methods return a `unique_ptr<scalar>` which needs to be statically downcasted to
its respective scalar class type before accessing its value. Their validity (nullness) can be
accessed without casting. Generally, the value needs to be accessed from a function that is aware
of the value type e.g. a functor that is dispatched from `type_dispatcher`. To cast to the
requisite scalar class type given the value type, use the mapping utility `scalar_type_t` provided
in `type_dispatcher.hpp` :
```c++
//unique_ptr<scalar> s = make_numeric_scalar(...);
using ScalarType = cudf::scalar_type_t<T>;
// ScalarType is now numeric_scalar<T>
auto s1 = static_cast<ScalarType *>(s.get());
```
### Passing to device
Each scalar type, except `list_scalar`, has a corresponding non-owning device view class which
allows access to the value and its validity from the device. This can be obtained using the function
`get_scalar_device_view(ScalarType s)`. Note that a device view is not provided for a base scalar
object, only for the derived typed scalar class objects.
The underlying data for `list_scalar` can be accessed via `view()` method. For non-nested data,
the device view can be obtained via function `column_device_view::create(column_view)`. For nested
data, a specialized device view for list columns can be constructed via
`lists_column_device_view(column_device_view)`.
# libcudf Policies and Design Principles
`libcudf` is designed to provide thread-safe, single-GPU accelerated algorithm primitives for
solving a wide variety of problems that arise in data science. APIs are written to execute on the
default GPU, which can be controlled by the caller through standard CUDA device APIs or environment
variables like `CUDA_VISIBLE_DEVICES`. Our goal is to enable diverse use cases like Spark or Pandas
to benefit from the performance of GPUs, and libcudf relies on these higher-level layers like Spark
or Dask to orchestrate multi-GPU tasks.
To best satisfy these use-cases, libcudf prioritizes performance and flexibility, which sometimes
may come at the cost of convenience. While we welcome users to use libcudf directly, we design with
the expectation that most users will be consuming libcudf through higher-level layers like Spark or
cuDF Python that handle some of details that direct users of libcudf must handle on their own. We
document these policies and the reasons behind them here.
## libcudf does not introspect data
libcudf APIs generally do not perform deep introspection and validation of input data.
There are numerous reasons for this:
1. It violates the single responsibility principle: validation is separate from execution.
2. Since libcudf data structures store data on the GPU, any validation incurs _at minimum_ the
overhead of a kernel launch, and may in general be prohibitively expensive.
3. API promises around data introspection often significantly complicate implementation.
Users are therefore responsible for passing valid data into such APIs.
_Note that this policy does not mean that libcudf performs no validation whatsoever_.
libcudf APIs should still perform any validation that does not require introspection.
To give some idea of what should or should not be validated, here are (non-exhaustive) lists of
examples.
**Things that libcudf should validate**:
- Input column/table sizes or data types
**Things that libcudf should not validate**:
- Integer overflow
- Ensuring that outputs will not exceed the [2GB size](#cudfsize_type) limit for a given set of
inputs
## libcudf expects nested types to have sanitized null masks
Various libcudf APIs accepting columns of nested data types (such as `LIST` or `STRUCT`) may assume
that these columns have been sanitized. In this context, sanitization refers to ensuring that the
null elements in a column with a nested dtype are compatible with the elements of nested columns.
Specifically:
- Null elements of list columns should also be empty. The starting offset of a null element should
be equal to the ending offset.
- Null elements of struct columns should also be null elements in the underlying structs.
- For compound columns, nulls should only be present at the level of the parent column. Child
columns should not contain nulls.
- Slice operations on nested columns do not propagate offsets to child columns.
libcudf APIs _should_ promise to never return "dirty" columns, i.e. columns containing unsanitized
data. Therefore, the only problem is if users construct input columns that are not correctly
sanitized and then pass those into libcudf APIs.
## Treat libcudf APIs as if they were asynchronous
libcudf APIs called on the host do not guarantee that the stream is synchronized before returning.
Work in libcudf occurs on `cudf::get_default_stream().value`, which defaults to the CUDA default
stream (stream 0). Note that the stream 0 behavior differs if [per-thread default stream is
enabled](https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html) via
`CUDF_USE_PER_THREAD_DEFAULT_STREAM`. Any data provided to or returned by libcudf that uses a
separate non-blocking stream requires synchronization with the default libcudf stream to ensure
stream safety.
## libcudf generally does not make ordering guarantees
Functions like merge or groupby in libcudf make no guarantees about the order of entries in the
output. Promising deterministic ordering is not, in general, conducive to fast parallel algorithms.
Calling code is responsible for performing sorts after the fact if sorted outputs are needed.
## libcudf does not promise specific exception messages
libcudf documents the exceptions that will be thrown by an API for different kinds of invalid
inputs. The types of those exceptions (e.g. `cudf::logic_error`) are part of the public API.
However, the explanatory string returned by the `what` method of those exceptions is not part of the
API and is subject to change. Calling code should not rely on the contents of libcudf error
messages to determine the nature of the error. For information on the types of exceptions that
libcudf throws under different circumstances, see the [section on error handling](#errors).
# libcudf API and Implementation
## Streams {#streams}
libcudf is in the process of adding support for asynchronous execution using
CUDA streams. In order to facilitate the usage of streams, all new libcudf APIs
that allocate device memory or execute a kernel should accept an
`rmm::cuda_stream_view` parameter at the end with a default value of
`cudf::get_default_stream()`. There is one exception to this rule: if the API
also accepts a memory resource parameter, the stream parameter should be placed
just *before* the memory resource. This API should then forward the call to a
corresponding `detail` API with an identical signature, except that the
`detail` API should not have a default parameter for the stream ([detail APIs
should always avoid default parameters](#default-parameters)). The
implementation should be wholly contained in the `detail` API definition and
use only asynchronous versions of CUDA APIs with the stream parameter.
In order to make the `detail` API callable from other libcudf functions, it should be exposed in a
header placed in the `cudf/cpp/include/detail/` directory.
The declaration is not necessary if no other libcudf functions call the `detail` function.
For example:
```c++
// cpp/include/cudf/header.hpp
void external_function(...,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
// cpp/include/cudf/detail/header.hpp
namespace detail{
void external_function(..., rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr)
} // namespace detail
// cudf/src/implementation.cpp
namespace detail{
// Use the stream parameter in the detail implementation.
void external_function(..., rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr){
// Implementation uses the stream with async APIs.
rmm::device_buffer buff(..., stream, mr);
CUDF_CUDA_TRY(cudaMemcpyAsync(...,stream.value()));
kernel<<<..., stream>>>(...);
thrust::algorithm(rmm::exec_policy(stream), ...);
}
} // namespace detail
void external_function(..., rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr)
{
CUDF_FUNC_RANGE(); // Generates an NVTX range for the lifetime of this function.
detail::external_function(..., stream, mr);
}
```
**Note:** It is important to synchronize the stream if *and only if* it is necessary. For example,
when a non-pointer value is returned from the API that is the result of an asynchronous
device-to-host copy, the stream used for the copy should be synchronized before returning. However,
when a column is returned, the stream should not be synchronized because doing so will break
asynchrony.
**Note:** `cudaDeviceSynchronize()` should *never* be used.
This limits the ability to do any multi-stream/multi-threaded work with libcudf APIs.
### Stream Creation
There may be times in implementing libcudf features where it would be advantageous to use streams
*internally*, i.e., to accomplish overlap in implementing an algorithm. However, dynamically
creating a stream can be expensive. RMM has a stream pool class to help avoid dynamic stream
creation. However, this is not yet exposed in libcudf, so for the time being, libcudf features
should avoid creating streams (even if it is slightly less efficient). It is a good idea to leave a
`// TODO:` note indicating where using a stream would be beneficial.
## Memory Allocation
Device [memory resources](#rmmdevice_memory_resource) are used in libcudf to abstract and control
how device memory is allocated.
### Output Memory
Any libcudf API that allocates memory that is *returned* to a user must accept a
`rmm::device_async_resource_ref` as the last parameter. Inside the API, this memory resource must
be used to allocate any memory for returned objects. It should therefore be passed into functions
whose outputs will be returned. Example:
```c++
// Returned `column` contains newly allocated memory,
// therefore the API must accept a memory resource pointer
std::unique_ptr<column> returns_output_memory(
..., rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
// This API does not allocate any new *output* memory, therefore
// a memory resource is unnecessary
void does_not_allocate_output_memory(...);
```
This rule automatically applies to all detail APIs that allocate memory. Any detail API may be
called by any public API, and therefore could be allocating memory that is returned to the user.
To support such uses cases, all detail APIs allocating memory resources should accept an `mr`
parameter. Callers are responsible for either passing through a provided `mr` or
`cudf::get_current_device_resource_ref()` as needed.
### Temporary Memory
Not all memory allocated within a libcudf API is returned to the caller. Often algorithms must
allocate temporary, scratch memory for intermediate results. Always use the default resource
obtained from `cudf::get_current_device_resource_ref()` for temporary memory allocations. Example:
```c++
rmm::device_buffer some_function(
..., rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref()) {
rmm::device_buffer returned_buffer(..., mr); // Returned buffer uses the passed in MR
...
rmm::device_buffer temporary_buffer(...); // Temporary buffer uses default MR
...
return returned_buffer;
}
```
### Memory Management
libcudf code generally eschews raw pointers and direct memory allocation. Use RMM classes built to
use memory resources for device memory allocation with automated lifetime management.
#### rmm::device_buffer
Allocates a specified number of bytes of untyped, uninitialized device memory using a
memory resource. If no `rmm::device_async_resource_ref` is explicitly provided, it uses
`cudf::get_current_device_resource_ref()`.
`rmm::device_buffer` is movable and copyable on a stream. A copy performs a deep copy of the
`device_buffer`'s device memory on the specified stream, whereas a move moves ownership of the
device memory from one `device_buffer` to another.
```c++
// Allocates at least 100 bytes of uninitialized device memory
// using the specified resource and stream
rmm::device_buffer buff(100, stream, mr);
void * raw_data = buff.data(); // Raw pointer to underlying device memory
// Deep copies `buff` into `copy` on `stream`
rmm::device_buffer copy(buff, stream);
// Moves contents of `buff` into `moved_to`
rmm::device_buffer moved_to(std::move(buff));
custom_memory_resource *mr...;
// Allocates 100 bytes from the custom_memory_resource
rmm::device_buffer custom_buff(100, mr, stream);
```
#### rmm::device_scalar<T>
Allocates a single element of the specified type initialized to the specified value. Use this for
scalar input/outputs into device kernels, e.g., reduction results, null count, etc. This is
effectively a convenience wrapper around a `rmm::device_vector<T>` of length 1.
```c++
// Allocates device memory for a single int using the specified resource and stream
// and initializes the value to 42
rmm::device_scalar<int> int_scalar{42, stream, mr};
// scalar.data() returns pointer to value in device memory
kernel<<<...>>>(int_scalar.data(),...);
// scalar.value() synchronizes the scalar's stream and copies the
// value from device to host and returns the value
int host_value = int_scalar.value();
```
#### rmm::device_vector<T>
Allocates a specified number of elements of the specified type. If no initialization value is
provided, all elements are default initialized (this incurs a kernel launch).
**Note**: We have removed all usage of `rmm::device_vector` and `thrust::device_vector` from
libcudf, and you should not use it in new code in libcudf without careful consideration. Instead,
use `rmm::device_uvector` along with the utility factories in `device_factories.hpp`. These
utilities enable creation of `uvector`s from host-side vectors, or creating zero-initialized
`uvector`s, so that they are as convenient to use as `device_vector`. Avoiding `device_vector` has
a number of benefits, as described in the following section on `rmm::device_uvector`.
#### rmm::device_uvector<T>
Similar to a `device_vector`, allocates a contiguous set of elements in device memory but with key
differences:
- As an optimization, elements are uninitialized and no synchronization occurs at construction.
This limits the types `T` to trivially copyable types.
- All operations are stream ordered (i.e., they accept a `cuda_stream_view` specifying the stream
on which the operation is performed). This improves safety when using non-default streams.
- `device_uvector.hpp` does not include any `__device__` code, unlike `thrust/device_vector.hpp`,
which means `device_uvector`s can be used in `.cpp` files, rather than just in `.cu` files.
```c++
cuda_stream s;
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the
// default resource
rmm::device_uvector<int32_t> v(100, s);
// Initializes the elements to 0
thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0});
auto mr = new my_custom_resource{...};
// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the resource `mr`
rmm::device_uvector<int32_t> v2{100, s, mr};
```
## Default Parameters
While public libcudf APIs are free to include default function parameters, detail functions should
not. Default memory resource parameters make it easy for developers to accidentally allocate memory
using the incorrect resource. Avoiding default memory resources forces developers to consider each
memory allocation carefully.
While streams are not currently exposed in libcudf's API, we plan to do so eventually. As a result,
the same reasons for memory resources also apply to streams. Public APIs default to using
`cudf::get_default_stream()`. However, including the same default in detail APIs opens the door for
developers to forget to pass in a user-provided stream if one is passed to a public API. Forcing
every detail API call to explicitly pass a stream is intended to prevent such mistakes.
The memory resources (and eventually, the stream) are the final parameters for essentially all
public APIs. For API consistency, the same is true throughout libcudf's internals. Therefore, a
consequence of not allowing default streams or MRs is that no parameters in detail APIs may have
defaults.
## NVTX Ranges
In order to aid in performance optimization and debugging, all compute intensive libcudf functions
should have a corresponding NVTX range. Choose between `CUDF_FUNC_RANGE` or `cudf::scoped_range`
for declaring NVTX ranges in the current scope:
- Use the `CUDF_FUNC_RANGE()` macro if you want to use the name of the function as the name of the
NVTX range
- Use `cudf::scoped_range rng{"custom_name"};` to provide a custom name for the current scope's
NVTX range
For more information about NVTX, see [here](https://github.com/NVIDIA/NVTX/tree/dev/c).
## Input/Output Style
The preferred style for how inputs are passed in and outputs are returned is the following:
- Inputs
- Columns:
- `column_view const&`
- Tables:
- `table_view const&`
- Scalar:
- `scalar const&`
- Everything else:
- Trivial or inexpensively copied types
- Pass by value
- Non-trivial or expensive to copy types
- Pass by `const&`
- In/Outs
- Columns:
- `mutable_column_view&`
- Tables:
- `mutable_table_view&`
- Everything else:
- Pass by via raw pointer
- Outputs
- Outputs should be *returned*, i.e., no output parameters
- Columns:
- `std::unique_ptr<column>`
- Tables:
- `std::unique_ptr<table>`
- Scalars:
- `std::unique_ptr<scalar>`
### Multiple Return Values
Sometimes it is necessary for functions to have multiple outputs. There are a few ways this can be
done in C++ (including creating a `struct` for the output). One convenient way to do this is
using `std::tie` and `std::pair`. Note that objects passed to `std::pair` will invoke
either the copy constructor or the move constructor of the object, and it may be preferable to move
non-trivially copyable objects (and required for types with deleted copy constructors, like
`std::unique_ptr`).
```c++
std::pair<table, table> return_two_tables(void){
cudf::table out0;
cudf::table out1;
...
// Do stuff with out0, out1
// Return a std::pair of the two outputs
return std::pair(std::move(out0), std::move(out1));
}
cudf::table out0;
cudf::table out1;
std::tie(out0, out1) = cudf::return_two_outputs();
```
Note: `std::tuple` _could_ be used if not for the fact that Cython does not support
`std::tuple`. Therefore, libcudf APIs must use `std::pair`, and are therefore limited to return
only two objects of different types. Multiple objects of the same type may be returned via a
`std::vector<T>`.
Alternatively, with C++17 (supported from cudf v0.20),
[structured binding](https://en.cppreference.com/w/cpp/language/structured_binding)
may be used to disaggregate multiple return values:
```c++
auto [out0, out1] = cudf::return_two_outputs();
```
Note that the compiler might not support capturing aliases defined in a structured binding
in a lambda. One may work around this by using a capture with an initializer instead:
```c++
auto [out0, out1] = cudf::return_two_outputs();
// Direct capture of alias from structured binding might fail with:
// "error: structured binding cannot be captured"
// auto foo = [out0]() {...};
// Use an initializing capture:
auto foo = [&out0 = out0] {
// Use out0 to compute something.
// ...
};
```
## Iterator-based interfaces
Increasingly, libcudf is moving toward internal (`detail`) APIs with iterator parameters rather
than explicit `column`/`table`/`scalar` parameters. As with STL, iterators enable generic
algorithms to be applied to arbitrary containers. A good example of this is `cudf::copy_if_else`.
This function takes two inputs, and a Boolean mask. It copies the corresponding element from the
first or second input depending on whether the mask at that index is `true` or `false`. Implementing
`copy_if_else` for all combinations of `column` and `scalar` parameters is simplified by using
iterators in the `detail` API.
```c++
template <typename FilterFn, typename LeftIter, typename RightIter>
std::unique_ptr<column> copy_if_else(
bool nullable,
LeftIter lhs_begin,
LeftIter lhs_end,
RightIter rhs,
FilterFn filter,
...);
```
`LeftIter` and `RightIter` need only implement the necessary interface for an iterator. libcudf
provides a number of iterator types and utilities that are useful with iterator-based APIs from
libcudf as well as Thrust algorithms. Most are defined in `include/detail/iterator.cuh`.
### Pair iterator
The pair iterator is used to access elements of nullable columns as a pair containing an element's
value and validity. `cudf::detail::make_pair_iterator` can be used to create a pair iterator from a
`column_device_view` or a `cudf::scalar`. `make_pair_iterator` is not available for
`mutable_column_device_view`.
### Null-replacement iterator
This iterator replaces the null/validity value for each element with a specified constant (`true` or
`false`). Created using `cudf::detail::make_null_replacement_iterator`.
### Validity iterator
This iterator returns the validity of the underlying element (`true` or `false`). Created using
`cudf::detail::make_validity_iterator`.
### Index-normalizing iterators
The proliferation of data types supported by libcudf can result in long compile times. One area
where compile time was a problem is in types used to store indices, which can be any integer type.
The "indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
used for index types (integers) without requiring a type-specific instance. It can be used for any
iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`,
`int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always returns a
[`cudf::size_type`](#cudfsize_type) integer.
Use the `indexalator_factory` to create an appropriate input iterator from a column_view. Example
input iterator usage:
```c++
auto begin = indexalator_factory::create_input_iterator(gather_map);
auto end = begin + gather_map.size();
auto result = detail::gather( source, begin, end, IGNORE, stream, mr );
```
Example output iterator usage:
```c++
auto result_itr = indexalator_factory::create_output_iterator(indices->mutable_view());
thrust::lower_bound(rmm::exec_policy(stream),
input->begin<Element>(),
input->end<Element>(),
values->begin<Element>(),
values->end<Element>(),
result_itr,
thrust::less<Element>());
```
### Offset-normalizing iterators
Like the [indexalator](#index-normalizing-iterators),
the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
used for offset column types (`INT32` or `INT64` only) without requiring a type-specific instance.
This is helpful when reading or building [strings columns](#strings-columns).
The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values
for both `INT32` and `INT64` offsets columns.
Likewise, an `output_offselator` can accept `int64` type values to store into either an
`INT32` or `INT64` output offsets column created appropriately.
Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view.
Example input iterator usage:
```c++
// convert the sizes to offsets
auto [offsets, char_bytes] = cudf::strings::detail::make_offsets_child_column(
output_sizes.begin(), output_sizes.end(), stream, mr);
auto d_offsets =
cudf::detail::offsetalator_factory::make_input_iterator(offsets->view());
// use d_offsets to address the output row bytes
```
Example output iterator usage:
```c++
// create offsets column as either INT32 or INT64 depending on the number of bytes
auto offsets_column = cudf::strings::detail::create_offsets_child_column(total_bytes,
offsets_count,
stream, mr);
auto d_offsets =
cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view());
// write appropriate offset values to d_offsets
```
## Namespaces
### External
All public libcudf APIs should be placed in the `cudf` namespace. Example:
```c++
namespace cudf{
void public_function(...);
} // namespace cudf
```
The top-level `cudf` namespace is sufficient for most of the public API. However, to logically
group a broad set of functions, further namespaces may be used. For example, there are numerous
functions that are specific to columns of Strings. These functions reside in the `cudf::strings::`
namespace. Similarly, functionality used exclusively for unit testing is in the `cudf::test::`
namespace.
The public function is expected to contain a call to `CUDF_FUNC_RANGE()` followed by a call to
a `detail` function with same name and parameters as the public function.
See the [Streams](#streams) section for an example of this pattern.
### Internal
Many functions are not meant for public use, so place them in either the `detail` or an *anonymous*
namespace, depending on the situation.
#### detail namespace
Functions or objects that will be used across *multiple* translation units (i.e., source files),
should be exposed in an internal header file and placed in the `detail` namespace. Example:
```c++
// some_utilities.hpp
namespace cudf{
namespace detail{
void reusable_helper_function(...);
} // namespace detail
} // namespace cudf
```
#### Anonymous namespace
Functions or objects that will only be used in a *single* translation unit should be defined in an
*anonymous* namespace in the source file where it is used. Example:
```c++
// some_file.cpp
namespace{
void isolated_helper_function(...);
} // anonymous namespace
```
[**Anonymous namespaces should *never* be used in a header file.**](https://wiki.sei.cmu.edu/confluence/display/cplusplus/DCL59-CPP.+Do+not+define+an+unnamed+namespace+in+a+header+file)
# Deprecating and Removing Code
libcudf is constantly evolving to improve performance and better meet our users' needs. As a
result, we occasionally need to break or entirely remove APIs to respond to new and improved
understanding of the functionality we provide. Remaining free to do this is essential to making
libcudf an agile library that can rapidly accommodate our users needs. As a result, we do not
always provide a warning or any lead time prior to releasing breaking changes. On a best effort
basis, the libcudf team will notify users of changes that we expect to have significant or
widespread effects.
Where possible, indicate pending API removals using the
[deprecated](https://en.cppreference.com/w/cpp/language/attributes/deprecated) attribute and
document them using Doxygen's
[deprecated](https://www.doxygen.nl/manual/commands.html#cmddeprecated) command prior to removal.
When a replacement API is available for a deprecated API, mention the replacement in both the
deprecation message and the deprecation documentation. Pull requests that introduce deprecations
should be labeled "deprecation" to facilitate discovery and removal in the subsequent release.
Advertise breaking changes by labeling any pull request that breaks or removes an existing API with
the "breaking" tag. This ensures that the "Breaking" section of the release notes includes a
description of what has broken from the past release. Label pull requests that contain deprecations
with the "non-breaking" tag.
# Error Handling {#errors}
libcudf follows conventions (and provides utilities) enforcing compile-time and run-time