-
Notifications
You must be signed in to change notification settings - Fork 919
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
) This PR adds the support for reading and using the `arrow:schema` struct from the serialized `arrow:ipc` message written at the key-value metadata section of the Parquet file with `ARROW:schema` key. This allows cudf to read and interop with arrow for non-standard parquet types (`DurationType` in this PR). Arrow uses Google flatbuffers (inside Schema.fbs) to serialize the `arrow:Schema` structure (containing column descriptors) and puts it (padded for 8 byte alignment) into the header of an empty `ipc:Message` (also a flatbuffer-serialized structure inside Message.fbs). The `ipc:Message` is prepended with two integers containing a `validity` message and the `size of the header` (the `arrow:Schema` + padding). The final message is endoded as a base64 string and written to Parquet file footer key-value metadata using `"ARROW:schema"` key. In this PR, we base64-decode the `ipc:Message`, then we decode the `validity` message and the header size, and offset pointers to the `arrow:Schema` flatbuffer. We then use Flatbuffer structs to walk the `arrow:Schema` and collect information on columns of interest as an unordered_map (using column name as key). This unordered_map is used inside `select_columns` function to build cudf Table columns and get the correct `dtype`. Closes #13410 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #15617
- Loading branch information
1 parent
fa9d028
commit c5c95b7
Showing
18 changed files
with
5,152 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# ============================================================================= | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except | ||
# in compliance with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software distributed under the License | ||
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express | ||
# or implied. See the License for the specific language governing permissions and limitations under | ||
# the License. | ||
# ============================================================================= | ||
|
||
# Use CPM to find or clone flatbuffers | ||
function(find_and_configure_flatbuffers VERSION) | ||
|
||
rapids_cpm_find( | ||
flatbuffers ${VERSION} | ||
GLOBAL_TARGETS flatbuffers | ||
CPM_ARGS | ||
GIT_REPOSITORY https://github.com/google/flatbuffers.git | ||
GIT_TAG v${VERSION} | ||
GIT_SHALLOW TRUE | ||
) | ||
|
||
rapids_export_find_package_root( | ||
BUILD flatbuffers "${flatbuffers_BINARY_DIR}" EXPORT_SET cudf-exports | ||
) | ||
|
||
endfunction() | ||
|
||
find_and_configure_flatbuffers(24.3.25) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.