[FEA] Support more flexible construction of nested columns in pylibcudf #17192

vyasr · 2024-10-28T17:16:31Z

Is your feature request related to a problem? Please describe.
Currently pylibcudf exposes a subset of the factories in libcudf. When they were added in #15257, we omitted the factories for nested types due to various difficulties around ownership and what columns should be constructible from. We also have not strongly considered how to create pylibcudf columns of list or string types whose underlying data and offset arrays are views into other arrays. This type of construction could be done by manual column_view creation in libcudf, but it does require a thorough understanding of Arrow data layouts as well as their implementation in libcudf (especially for strings post the large strings refactor). All of these holes are particularly problematic because strings, lists and structs are the data types for which pylibcudf may have the most to offer: beyond simply providing a higher-performance, low-level API that cudf users could reach for when necessary, for these types pylibcudf can offer various bits of libcudf functionality that simply have no home in cudf at all. Therefore, making it possible to work with these types transparently in pylibcudf is of high importance to satisfy use cases for which we have no satisfactory solution at present.

Describe the solution you'd like
We should investigate the best ways to enable construction of pylibcudf columns of nested types, including from other data sources like pairs of cupy arrays, and we should make these constructors as easy to use as possible.

Additional context
Where appropriate, we should consider adding constructors directly to libcudf as well. While it is possible to do everything we need with low-level libcudf APIs, one of the major synergies I anticipate between pylibcudf and libcudf is that pylibcudf will motivate usability improvements in libcudf that might otherwise have little impetus behind them. This is one such case where improving constructors directly in libcudf to help pylibcudf users can help a wider range of users, so we should seize the opportunity if it presents itself.

The text was updated successfully, but these errors were encountered:

bdice · 2024-11-26T15:38:32Z

I'm discussing with @Matt711, here are some thoughts so far:

By default all nested factories would be non-owning, and would have a parameter copy=False
If copy=True or if the data cannot be "viewed" from GPU (e.g. an input that is a NumPy array on CPU) then we would return an owning object (for all parts? or should it just own offsets / just own values?)
copy is passed recursively, so all children are copied if some parent column had to be copied
Lists should be constructible from list-column-like objects, e.g. host lists-of-lists [[1, 2], [3]] or PyTorch ragged tensors
Lists should be constructible from values and offsets, where offsets can be any array-like of integers (validated by default so that it's 0 <= offsets <= len(values) and monotonically increasing, but perhaps we could have a "no-check" option -- I am unsure if pylibcudf tries to avoid introspection like libcudf)
- Name this something like pylibcudf.make_lists_column_from_values_and_offsets? Should it be a free function or a class method of something? We don't have overloads like in C++ and I don't really want to have a single-argument function that parses a tuple of (values, offsets) as an input, since it's a bit ambiguous.
Strings should be constructible from a string buffer and offsets (could be zero-copy if the string buffer is a CuPy array of bytes?), or from an array-like of strings (always has to copy/own). Offsets would be validated like they are for lists, perhaps with a "no-check" option.
Structs should be constructible from a dictionary-like of array-like objects with the same length (so dict objects like {"a": [1, 2], "b": ["c", "d"]} would count)
PyArrow types and cuDF types should be usable as inputs for pylibcudf constructors

Are there other requirements or constraints to consider?

vyasr · 2024-12-10T01:24:37Z

That almost all sounds good. I'm mostly curious about

By default all nested factories would be non-owning, and would have a parameter copy=False

Could you elaborate on how you expect nested factories to differ from those for primitive types, particularly w.r.t. sources like cupy arrays or other data that's already resident on device? You obviously must copy when coming from the host, do you intend for those to be separate constructors?

vyasr added the feature request New feature or request label Oct 28, 2024

vyasr assigned bdice Oct 28, 2024

vyasr added libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package labels Oct 28, 2024

github-project-automation bot added this to cuDF Python Oct 28, 2024

github-project-automation bot moved this to Todo in cuDF Python Oct 28, 2024

vyasr added the Python Affects Python cuDF API. label Oct 28, 2024

bdice assigned Matt711 Nov 26, 2024

bdice mentioned this issue Dec 2, 2024

[DOC] can I rely on Series. _from_column ? #17483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

vyasr commented Oct 28, 2024

bdice commented Nov 26, 2024 •

edited

Loading

vyasr commented Dec 10, 2024

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

[FEA] Support more flexible construction of nested columns in pylibcudf #17192

Comments

vyasr commented Oct 28, 2024

bdice commented Nov 26, 2024 • edited Loading

vyasr commented Dec 10, 2024

bdice commented Nov 26, 2024 •

edited

Loading