Initial draft of policies and guidelines for libcudf usage. (#11853)

This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481. Partly addresses #5505, #1781. Resolves #4511. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #11853
rapidsai · Oct 18, 2022 · 5d57159 · 5d57159
1 parent 1effe19
commit 5d57159
Showing 1 changed file with 57 additions and 0 deletions.
diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
@@ -346,6 +346,63 @@ the device view can be obtained via function `column_device_view::create(column_
 data, a specialized device view for list columns can be constructed via
 `lists_column_device_view(column_device_view)`.
 
+# libcudf Policies and Design Principles
+
+`libcudf` is designed to provide thread-safe, single-GPU accelerated algorithm primitives for solving a wide variety of problems that arise in data science.
+APIs are written to execute on the default GPU, which can be controlled by the caller through standard CUDA device APIs or environment variables like `CUDA_VISIBLE_DEVICES`.
+Our goal is to enable diverse use cases like Spark or Pandas to benefit from the performance of GPUs, and libcudf relies on these higher-level layers like Spark or Dask to orchestrate multi-GPU tasks.
+
+To best satisfy these use-cases, libcudf prioritizes performance and flexibility, which sometimes may come at the cost of convenience.
+While we welcome users to use libcudf directly, we design with the expectation that most users will be consuming libcudf through higher-level layers like Spark or cuDF Python that handle some of details that direct users of libcudf must handle on their own.
+We document these policies and the reasons behind them here.
+
+## libcudf does not introspect data
+
+libcudf APIs generally do not perform deep introspection and validation of input data.
+There are numerous reasons for this:
+1. It violates the single responsibility principle: validation is separate from execution.
+2. Since libcudf data structures store data on the GPU, any validation incurs _at minimum_ the overhead of a kernel launch, and may in general be prohibitively expensive.
+3. API promises around data introspection often significantly complicate implementation.
+
+Users are therefore responsible for passing valid data into such APIs.
+_Note that this policy does not mean that libcudf performs no validation whatsoever_.
+libcudf APIs should still perform any validation that does not require introspection.
+To give some idea of what should or should not be validated, here are (non-exhaustive) lists of examples.
+
+**Things that libcudf should validate**:
+- Input column/table sizes or dtypes
+
+**Things that libcudf should not validate**:
+- Integer overflow
+- Ensuring that outputs will not exceed the 2GB size limit for a given set of inputs
+
+
+## libcudf expects nested types to have sanitized null masks
+
+Various libcudf APIs accepting columns of nested dtypes (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
+In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns.
+Specifically:
+- Null elements of list columns should also be empty. The starting offset of a null element should be equal to the ending offset.
+- Null elements of struct columns should also be null elements in the underlying structs.
+- For compound columns, nulls should only be present at the level of the parent column. Child columns should not contain nulls.
+- Slice operations on nested columns do not propagate offsets to child columns.
+
+libcudf APIs _should_ promise to never return "dirty" columns, i.e. columns containing unsanitized data.
+Therefore, the only problem is if users construct input columns that are not correctly sanitized and then pass those into libcudf APIs.
+
+## Treat libcudf APIs as if they were asynchronous
+
+libcudf APIs called on the host do not guarantee that the stream is synchronized before returning.
+Work in libcudf occurs on `cudf::get_default_stream().value`, which defaults to the CUDA default stream (stream 0).
+Note that the stream 0 behavior differs if [per-thread default stream is enabled](https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html) via `CUDF_USE_PER_THREAD_DEFAULT_STREAM`.
+Any data provided to or returned by libcudf that uses a separate non-blocking stream requires synchronization with the default libcudf stream to ensure stream safety.
+
+## libcudf generally does not make ordering guarantees
+
+Functions like merge or groupby in libcudf make no guarantees about the order of entries in the output.
+Promising deterministic ordering is not, in general, conducive to fast parallel algorithms.
+Calling code is responsible for performing sorts after the fact if sorted outputs are needed.
+
 # libcudf++ API and Implementation
 
 ## Streams