You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr: If you squint, everything is really a compiler problem in disguise.
Overview
The pandas API is very broad and has numerous places where "best effort" attempts are made to DWIM (do what I mean). One example is during dataframe indexing, where a significant effort is made to "desugar" the user input into a canonical form.
For example, turning df.loc[[1, 2, 3]] into df.loc[[1, 2, 3], :] (that is, a single entry on a dataframe is equivalent to taking rows and all the columns).
This pattern is pervasive, to take another different example, read_parquet accepts as the "file" object:
a string (indicating a file to open)
a Path object
raw bytes
Anything with a read method
A list (maybe sequence?) of the above
The status quo
To handle this kind of multi-modal unstructured input, the user-visible API of cudf carries out inspection and validation of the inputs before dispatching to appropriate lower-level routines. By the time we reach libcudf, all decisions must have been made.
This works, but has a number of problems:
It is not always clear whose job it is to do the validation. Is any "pseudo-public" method required to validate correctness of its inputs? Or is it always the job of the definitively public API to validate all inputs before handing off. One sees this in private Frame methods like _gather which have an explicit check_bounds flag (which often is not set when it could be, because the only safe default is True).
Validation just checks for valid inputs and then returns types unchanged (so defensive programming requires consumers of the input to assume worst-case scenarios and check things again).
Consequently, validation and inspection often occur (on any given input):
i. More than once
ii. Inconsistently (generally this is not deliberate, it's just hard to keep track).
Proposal
I propose that we take a leaf out of the type-driven design crowd's book and treat the user-facing input validation as a parsing rather than validating problem. What does this mean? Rather than just checking input in user-facing API functions and dispatching to internal functions that receive the same type we should tighten the types as we go, "parsing" the unstructured user input into more structured data as we transit through the call stack.
This has, I think, a number of advantages:
It clarifies where in the implementation validation in dispatch takes place (any time a type changes), separating the business logic of making sense of the input from the concrete implementation for the different paths.
It makes cross-calling between internal APIs that don't do validation safe. For example, rather than _gather having the signature a -> Column -> Maybe a for some dataframe type a and always having to check that the provided column is inbounds as a gather map, we would tighten the type to a -> InBoundsFor a Column -> a. Now the gather map comes with a "proof" that it is inbounds for the dataframe we are gathering, and we therefore do not have to do bounds checking1. Now, we can't statically enforce this (in the sense that in the implementation, someone will be able to conjure the "proof" out of thin air if they really want to), but type checking will at least indicate when we don't promise an in-bounds column.
By separating the business logic from the backend dispatch we keep the good work we've done in producing a pandas-like API and make it easier to (on the long term) slot in other backends (for example a distributed, multi-node, backend underneath the cudf front-facing API).
100% agree. I've made very similar points on multiple occasions (everywhere from design documents to issues) specifically focused on: 1) the performance impacts of multiple validation, 2) the correctness impacts of inconsistent and hard to track validation, and 3) the readability impacts of the resulting spaghetti code. One of the most pervasive issues is simply tracking calls to as_column, which on certain particularly convoluted code paths can happen a dozen times to the same inputs.
I like the approach that you're proposing. It is analogous in many ways to the approach I've taken in various specific aspects of refactoring. For instance, here are a handful of PRs around an extended binops refactor that essentially implement more internal APIs that are allowed to make assumptions about input sanitization: #10509, #10563, #10421, #8598. Certainly not sufficient, and not entirely the same as what you're proposing, but along the same lines and aimed at tackling the same types of problems. Using types to formalize the structures would go a long way to expressing the dispatch cleanly.
tl;dr: If you squint, everything is really a compiler problem in disguise.
Overview
The pandas API is very broad and has numerous places where "best effort" attempts are made to DWIM (do what I mean). One example is during dataframe indexing, where a significant effort is made to "desugar" the user input into a canonical form.
For example, turning
df.loc[[1, 2, 3]]
intodf.loc[[1, 2, 3], :]
(that is, a single entry on a dataframe is equivalent to taking rows and all the columns).This pattern is pervasive, to take another different example,
read_parquet
accepts as the "file" object:read
methodThe status quo
To handle this kind of multi-modal unstructured input, the user-visible API of cudf carries out inspection and validation of the inputs before dispatching to appropriate lower-level routines. By the time we reach libcudf, all decisions must have been made.
This works, but has a number of problems:
Frame
methods like_gather
which have an explicitcheck_bounds
flag (which often is not set when it could be, because the only safe default isTrue
).i. More than once
ii. Inconsistently (generally this is not deliberate, it's just hard to keep track).
Proposal
I propose that we take a leaf out of the type-driven design crowd's book and treat the user-facing input validation as a parsing rather than validating problem. What does this mean? Rather than just checking input in user-facing API functions and dispatching to internal functions that receive the same type we should tighten the types as we go, "parsing" the unstructured user input into more structured data as we transit through the call stack.
This has, I think, a number of advantages:
_gather
having the signaturea -> Column -> Maybe a
for some dataframe typea
and always having to check that the provided column is inbounds as a gather map, we would tighten the type toa -> InBoundsFor a Column -> a
. Now the gather map comes with a "proof" that it is inbounds for the dataframe we are gathering, and we therefore do not have to do bounds checking1. Now, we can't statically enforce this (in the sense that in the implementation, someone will be able to conjure the "proof" out of thin air if they really want to), but type checking will at least indicate when we don't promise an in-bounds column.Further reading
Footnotes
In a statically typed language with rank-N types you can actually make these proofs part of the type system, though we can't here in Python. ↩
The text was updated successfully, but these errors were encountered: