Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: Reimagining cudf user-input boundaries #13544

Open
wence- opened this issue Jun 9, 2023 · 1 comment
Open

[ENH]: Reimagining cudf user-input boundaries #13544

wence- opened this issue Jun 9, 2023 · 1 comment
Labels
improvement Improvement / enhancement to an existing function Performance Performance related issue

Comments

@wence-
Copy link
Contributor

wence- commented Jun 9, 2023

tl;dr: If you squint, everything is really a compiler problem in disguise.

Overview

The pandas API is very broad and has numerous places where "best effort" attempts are made to DWIM (do what I mean). One example is during dataframe indexing, where a significant effort is made to "desugar" the user input into a canonical form.

For example, turning df.loc[[1, 2, 3]] into df.loc[[1, 2, 3], :] (that is, a single entry on a dataframe is equivalent to taking rows and all the columns).

This pattern is pervasive, to take another different example, read_parquet accepts as the "file" object:

  1. a string (indicating a file to open)
  2. a Path object
  3. raw bytes
  4. Anything with a read method
  5. A list (maybe sequence?) of the above

The status quo

To handle this kind of multi-modal unstructured input, the user-visible API of cudf carries out inspection and validation of the inputs before dispatching to appropriate lower-level routines. By the time we reach libcudf, all decisions must have been made.

This works, but has a number of problems:

  1. It is not always clear whose job it is to do the validation. Is any "pseudo-public" method required to validate correctness of its inputs? Or is it always the job of the definitively public API to validate all inputs before handing off. One sees this in private Frame methods like _gather which have an explicit check_bounds flag (which often is not set when it could be, because the only safe default is True).
  2. Validation just checks for valid inputs and then returns types unchanged (so defensive programming requires consumers of the input to assume worst-case scenarios and check things again).
  3. Consequently, validation and inspection often occur (on any given input):
    i. More than once
    ii. Inconsistently (generally this is not deliberate, it's just hard to keep track).

Proposal

I propose that we take a leaf out of the type-driven design crowd's book and treat the user-facing input validation as a parsing rather than validating problem. What does this mean? Rather than just checking input in user-facing API functions and dispatching to internal functions that receive the same type we should tighten the types as we go, "parsing" the unstructured user input into more structured data as we transit through the call stack.

This has, I think, a number of advantages:

  1. It clarifies where in the implementation validation in dispatch takes place (any time a type changes), separating the business logic of making sense of the input from the concrete implementation for the different paths.
  2. It makes cross-calling between internal APIs that don't do validation safe. For example, rather than _gather having the signature a -> Column -> Maybe a for some dataframe type a and always having to check that the provided column is inbounds as a gather map, we would tighten the type to a -> InBoundsFor a Column -> a. Now the gather map comes with a "proof" that it is inbounds for the dataframe we are gathering, and we therefore do not have to do bounds checking1. Now, we can't statically enforce this (in the sense that in the implementation, someone will be able to conjure the "proof" out of thin air if they really want to), but type checking will at least indicate when we don't promise an in-bounds column.
  3. By separating the business logic from the backend dispatch we keep the good work we've done in producing a pandas-like API and make it easier to (on the long term) slot in other backends (for example a distributed, multi-node, backend underneath the cudf front-facing API).

Further reading

Footnotes

  1. In a statically typed language with rank-N types you can actually make these proofs part of the type system, though we can't here in Python.

@vyasr
Copy link
Contributor

vyasr commented Jun 9, 2023

100% agree. I've made very similar points on multiple occasions (everywhere from design documents to issues) specifically focused on: 1) the performance impacts of multiple validation, 2) the correctness impacts of inconsistent and hard to track validation, and 3) the readability impacts of the resulting spaghetti code. One of the most pervasive issues is simply tracking calls to as_column, which on certain particularly convoluted code paths can happen a dozen times to the same inputs.

I like the approach that you're proposing. It is analogous in many ways to the approach I've taken in various specific aspects of refactoring. For instance, here are a handful of PRs around an extended binops refactor that essentially implement more internal APIs that are allowed to make assumptions about input sanitization: #10509, #10563, #10421, #8598. Certainly not sufficient, and not entirely the same as what you're proposing, but along the same lines and aimed at tackling the same types of problems. Using types to formalize the structures would go a long way to expressing the dispatch cleanly.

@wence- wence- added code quality Performance Performance related issue improvement Improvement / enhancement to an existing function labels Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants