[ENH]: Reimagining cudf user-input boundaries #13544

wence- · 2023-06-09T17:26:44Z

tl;dr: If you squint, everything is really a compiler problem in disguise.

Overview

The pandas API is very broad and has numerous places where "best effort" attempts are made to DWIM (do what I mean). One example is during dataframe indexing, where a significant effort is made to "desugar" the user input into a canonical form.

For example, turning df.loc[[1, 2, 3]] into df.loc[[1, 2, 3], :] (that is, a single entry on a dataframe is equivalent to taking rows and all the columns).

This pattern is pervasive, to take another different example, read_parquet accepts as the "file" object:

a string (indicating a file to open)
a Path object
raw bytes
Anything with a read method
A list (maybe sequence?) of the above

The status quo

To handle this kind of multi-modal unstructured input, the user-visible API of cudf carries out inspection and validation of the inputs before dispatching to appropriate lower-level routines. By the time we reach libcudf, all decisions must have been made.

This works, but has a number of problems:

It is not always clear whose job it is to do the validation. Is any "pseudo-public" method required to validate correctness of its inputs? Or is it always the job of the definitively public API to validate all inputs before handing off. One sees this in private Frame methods like _gather which have an explicit check_bounds flag (which often is not set when it could be, because the only safe default is True).
Validation just checks for valid inputs and then returns types unchanged (so defensive programming requires consumers of the input to assume worst-case scenarios and check things again).
Consequently, validation and inspection often occur (on any given input):
i. More than once
ii. Inconsistently (generally this is not deliberate, it's just hard to keep track).

Proposal

I propose that we take a leaf out of the type-driven design crowd's book and treat the user-facing input validation as a parsing rather than validating problem. What does this mean? Rather than just checking input in user-facing API functions and dispatching to internal functions that receive the same type we should tighten the types as we go, "parsing" the unstructured user input into more structured data as we transit through the call stack.

This has, I think, a number of advantages:

It clarifies where in the implementation validation in dispatch takes place (any time a type changes), separating the business logic of making sense of the input from the concrete implementation for the different paths.
It makes cross-calling between internal APIs that don't do validation safe. For example, rather than _gather having the signature a -> Column -> Maybe a for some dataframe type a and always having to check that the provided column is inbounds as a gather map, we would tighten the type to a -> InBoundsFor a Column -> a. Now the gather map comes with a "proof" that it is inbounds for the dataframe we are gathering, and we therefore do not have to do bounds checking¹. Now, we can't statically enforce this (in the sense that in the implementation, someone will be able to conjure the "proof" out of thin air if they really want to), but type checking will at least indicate when we don't promise an in-bounds column.
By separating the business logic from the backend dispatch we keep the good work we've done in producing a pandas-like API and make it easier to (on the long term) slot in other backends (for example a distributed, multi-node, backend underneath the cudf front-facing API).

A nice overview of this idea with more examples: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/
Similar ideas as the "typestate" pattern: https://cliffle.com/blog/rust-typestate/
"Make illegal states unrepresentable": https://buttondown.email/hillelwayne/archive/making-illegal-states-unrepresentable/

In a statically typed language with rank-N types you can actually make these proofs part of the type system, though we can't here in Python. ↩

The text was updated successfully, but these errors were encountered:

vyasr · 2023-06-09T17:51:49Z

100% agree. I've made very similar points on multiple occasions (everywhere from design documents to issues) specifically focused on: 1) the performance impacts of multiple validation, 2) the correctness impacts of inconsistent and hard to track validation, and 3) the readability impacts of the resulting spaghetti code. One of the most pervasive issues is simply tracking calls to as_column, which on certain particularly convoluted code paths can happen a dozen times to the same inputs.

I like the approach that you're proposing. It is analogous in many ways to the approach I've taken in various specific aspects of refactoring. For instance, here are a handful of PRs around an extended binops refactor that essentially implement more internal APIs that are allowed to make assumptions about input sanitization: #10509, #10563, #10421, #8598. Certainly not sufficient, and not entirely the same as what you're proposing, but along the same lines and aimed at tackling the same types of problems. Using types to formalize the structures would go a long way to expressing the dispatch cleanly.

wence- added code quality Performance Performance related issue improvement Improvement / enhancement to an existing function labels Jun 13, 2023

wence- mentioned this issue Aug 21, 2023

Rewrite DataFrame.stack to support multi level column names #13927

Merged

3 tasks

vyasr removed the code quality label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH]: Reimagining cudf user-input boundaries #13544

[ENH]: Reimagining cudf user-input boundaries #13544

wence- commented Jun 9, 2023

vyasr commented Jun 9, 2023

[ENH]: Reimagining cudf user-input boundaries #13544

[ENH]: Reimagining cudf user-input boundaries #13544

Comments

wence- commented Jun 9, 2023

Overview

The status quo

Proposal

Further reading

Footnotes

vyasr commented Jun 9, 2023