Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

Open
vyasr opened this issue Oct 28, 2024 · 2 comments
Open

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

vyasr opened this issue Oct 28, 2024 · 2 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Comments

@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2024

Is your feature request related to a problem? Please describe.
Currently pylibcudf exposes a subset of the factories in libcudf. When they were added in #15257, we omitted the factories for nested types due to various difficulties around ownership and what columns should be constructible from. We also have not strongly considered how to create pylibcudf columns of list or string types whose underlying data and offset arrays are views into other arrays. This type of construction could be done by manual column_view creation in libcudf, but it does require a thorough understanding of Arrow data layouts as well as their implementation in libcudf (especially for strings post the large strings refactor). All of these holes are particularly problematic because strings, lists and structs are the data types for which pylibcudf may have the most to offer: beyond simply providing a higher-performance, low-level API that cudf users could reach for when necessary, for these types pylibcudf can offer various bits of libcudf functionality that simply have no home in cudf at all. Therefore, making it possible to work with these types transparently in pylibcudf is of high importance to satisfy use cases for which we have no satisfactory solution at present.

Describe the solution you'd like
We should investigate the best ways to enable construction of pylibcudf columns of nested types, including from other data sources like pairs of cupy arrays, and we should make these constructors as easy to use as possible.

Additional context
Where appropriate, we should consider adding constructors directly to libcudf as well. While it is possible to do everything we need with low-level libcudf APIs, one of the major synergies I anticipate between pylibcudf and libcudf is that pylibcudf will motivate usability improvements in libcudf that might otherwise have little impetus behind them. This is one such case where improving constructors directly in libcudf to help pylibcudf users can help a wider range of users, so we should seize the opportunity if it presents itself.

@vyasr vyasr added the feature request New feature or request label Oct 28, 2024
@vyasr vyasr added libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package labels Oct 28, 2024
@vyasr vyasr added the Python Affects Python cuDF API. label Oct 28, 2024
@bdice
Copy link
Contributor

bdice commented Nov 26, 2024

I'm discussing with @Matt711, here are some thoughts so far:

  • By default all nested factories would be non-owning, and would have a parameter copy=False
  • If copy=True or if the data cannot be "viewed" from GPU (e.g. an input that is a NumPy array on CPU) then we would return an owning object (for all parts? or should it just own offsets / just own values?)
  • copy is passed recursively, so all children are copied if some parent column had to be copied
  • Lists should be constructible from list-column-like objects, e.g. host lists-of-lists [[1, 2], [3]] or PyTorch ragged tensors
  • Lists should be constructible from values and offsets, where offsets can be any array-like of integers (validated by default so that it's 0 <= offsets <= len(values) and monotonically increasing, but perhaps we could have a "no-check" option -- I am unsure if pylibcudf tries to avoid introspection like libcudf)
    • Name this something like pylibcudf.make_lists_column_from_values_and_offsets? Should it be a free function or a class method of something? We don't have overloads like in C++ and I don't really want to have a single-argument function that parses a tuple of (values, offsets) as an input, since it's a bit ambiguous.
  • Strings should be constructible from a string buffer and offsets (could be zero-copy if the string buffer is a CuPy array of bytes?), or from an array-like of strings (always has to copy/own). Offsets would be validated like they are for lists, perhaps with a "no-check" option.
  • Structs should be constructible from a dictionary-like of array-like objects with the same length (so dict objects like {"a": [1, 2], "b": ["c", "d"]} would count)
  • PyArrow types and cuDF types should be usable as inputs for pylibcudf constructors

Are there other requirements or constraints to consider?

@vyasr
Copy link
Contributor Author

vyasr commented Dec 10, 2024

That almost all sounds good. I'm mostly curious about

By default all nested factories would be non-owning, and would have a parameter copy=False

Could you elaborate on how you expect nested factories to differ from those for primitive types, particularly w.r.t. sources like cupy arrays or other data that's already resident on device? You obviously must copy when coming from the host, do you intend for those to be separate constructors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants