[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

westonpace · 2024-03-20T23:37:54Z

Describe the enhancement requested

The Arrow<->Substrait conversion currently only works with the types that are supported by both Arrow and Substrait. I would like to use Substrait expression conversion for filter pushdown (polars can convert to a pyarrow.compute expression, and datafusion can consume a substrait expression, and I would like to bridge the two).

This is currently blocked by the fact that polars uses large_string by default and pyarrow.compute expressions fail to serialize if they contain a large_string type.

Since I know that both the source and destination are arrow I should be able to use the arrow-specific types (substrait will consider them user defined types).

To simplify things, this request only asks for support for non-parameterized types. Arrow-specific parameterized types (e.g. decimal256, large_string, etc.) can come in a future request.

Component(s)

C++

### Rationale for this change See #40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: #40695 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Weston Pace <[email protected]>

westonpace · 2024-04-11T02:48:12Z

Issue resolved by pull request 40696
#40696

### Rationale for this change See #40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: #40695 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Weston Pace <[email protected]>

### Rationale for this change See apache#40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: apache#40695 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Weston Pace <[email protected]>

westonpace added the Type: enhancement label Mar 20, 2024

github-actions bot added the Component: C++ label Mar 20, 2024

westonpace mentioned this issue Mar 20, 2024

GH-40695 [C++] Expand Substrait type support #40696

Merged

github-actions bot assigned westonpace Mar 20, 2024

This was referenced Mar 22, 2024

[C++] Add Substrait support for arrow-specific types (paramaeterized) #40740

Open

[C++] Add support for precision timestamp literals #40741

Open

westonpace added this to the 16.0.0 milestone Apr 11, 2024

westonpace closed this as completed Apr 11, 2024

attilajeges added a commit to attilajeges/arrow that referenced this issue Jan 4, 2025

apacheGH-40695 [C++] Reduce string inlining in Substrait serde

f905f99

attilajeges added a commit to attilajeges/arrow that referenced this issue Jan 4, 2025

apacheGH-40695 [C++] Reduce string inlining in Substrait serde

f4cfeff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

westonpace commented Mar 20, 2024

westonpace commented Apr 11, 2024

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

[C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

Comments

westonpace commented Mar 20, 2024

Describe the enhancement requested

Component(s)

westonpace commented Apr 11, 2024