-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support for a custom DataSource V2 which supplies Arrow data #1072
Comments
Does this |
After discussions data source v1 doesn't support columnar so switch to use data source v2. With datasource v2, custom datasources just work and we insert a HostColumnarToGpu transition to get the data onto the GPU. In this case I believe the data will already be in an Arrow format ArrowColumnVector we can investigate making the HostColumnarToGpu smarter about getting the data onto the GPU |
note that looking at a couple of sample queries it uses Round of a decimal, which support for it in progress and it also uses average of a decimal which we don't support yet. |
note for sample queries and data we can look at the taxi ride dataset and queries: https://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ This is the result of other solutions. |
Note we may also need percentile_approx here. |
cudf jira for percentile_approx -> rapidsai/cudf#7170 |
the main functionality to support faster copy when using datasourcev2 supplying arrow data is commited under #1622. It supports primitive types and Strings. It does not support Decimal or nested types yet. |
note filed separate issue for write side #1648. |
…IDIA#1072) Signed-off-by: spark-rapids automation <[email protected]>
Is your feature request related to a problem? Please describe.
When I executed an aggregation query with our custom data source, I found the physical plan of the query was like this.
This shows that the InternalRows are built firstly, and they are transformed into ColumnarBatches by GpuRowToColumnar plan. If the custom DataSource can provide RDD[ColumnBatch] to spark-rapids directly, it would be more efficient because the conversion overhead is removed.
Describe the solution you'd like
The changed physical plan can be illustrated like this.
The text was updated successfully, but these errors were encountered: