-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds GPU implementation of JSON-token-stream to JSON-tree #11518
Adds GPU implementation of JSON-token-stream to JSON-tree #11518
Conversation
Squashed commit of the following: commit 6e1bc75 Author: Karthikeyan Natarajan <[email protected]> Date: Fri Aug 12 03:06:30 2022 +0530 remove debug print in logical stack commit 8e75645 Author: Karthikeyan Natarajan <[email protected]> Date: Fri Aug 12 03:01:34 2022 +0530 remove duplicate renamed header commit 3b2acb2 Merge: 2b59b04 a67b718 Author: Karthikeyan Natarajan <[email protected]> Date: Fri Aug 12 02:59:01 2022 +0530 Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into json-tree commit 2b59b04 Merge: 12cf0be 2d214ea Author: Karthikeyan Natarajan <[email protected]> Date: Tue Jul 26 13:40:41 2022 +0530 Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into json-tree commit 12cf0be Author: Karthikeyan Natarajan <[email protected]> Date: Tue Jul 26 13:29:55 2022 +0530 fix clang-format style fix commit 3e756bb Author: Elias Stehle <[email protected]> Date: Mon Jul 18 08:17:03 2022 -0700 replaces tree return type from tuple to struct commit bef4fb1 Author: Elias Stehle <[email protected]> Date: Mon May 16 22:10:08 2022 -0700 moved debug print to detail ns commit ff90528 Author: Elias Stehle <[email protected]> Date: Fri May 13 09:52:20 2022 -0700 squash & rebase on latest tokenizer version commit 987699f Author: Elias Stehle <[email protected]> Date: Thu Jun 2 05:19:53 2022 -0700 fixes sg-count & uses rmm stream in fst tests commit 00a95eb Author: Elias Stehle <[email protected]> Date: Mon Apr 25 12:17:08 2022 -0700 put lookup tables into their own cudf file commit a8ac5fa Author: Elias Stehle <[email protected]> Date: Mon Apr 25 09:59:37 2022 -0700 refactored lookup tables commit f996ce9 Author: Elias Stehle <[email protected]> Date: Mon Apr 11 12:17:55 2022 -0700 squashed with bracket/brace test commit 671ce41 Author: Elias Stehle <[email protected]> Date: Tue Apr 12 22:55:00 2022 -0700 minor style changes addressing review comments commit f4ec994 Author: Elias Stehle <[email protected]> Date: Mon Apr 4 07:35:33 2022 -0700 device_span commit d18238f Author: Elias Stehle <[email protected]> Date: Mon Apr 4 02:28:30 2022 -0700 renaming key-value store op to stack_op commit 62ddf66 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 05:28:17 2022 -0700 switched to using rmm also inside algorithm commit 2f7b254 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 04:11:44 2022 -0700 Added utility to debug print & instrumented code to use it commit 67f609d Author: Elias Stehle <[email protected]> Date: Thu Jul 14 04:15:11 2022 -0700 renames enums & moving from device_span to ptr params commit 01aef44 Author: Elias Stehle <[email protected]> Date: Wed Jul 13 07:22:52 2022 -0700 wraps if with stream params into detail ns commit 4aaf595 Author: Elias Stehle <[email protected]> Date: Wed Jul 13 05:45:49 2022 -0700 fixes for breaking downstream interface changes commit 237456d Author: Elias Stehle <[email protected]> Date: Thu Jun 2 08:19:37 2022 -0700 fixes breaking changes from dependent-FST-PR commit 7fc8619 Author: Elias Stehle <[email protected]> Date: Tue May 3 07:05:44 2022 -0700 rebase on latest FST commit 6d3eff2 Author: Elias Stehle <[email protected]> Date: Thu Jun 2 05:19:53 2022 -0700 fixes sg-count & uses rmm stream in fst tests commit 6548836 Author: Elias Stehle <[email protected]> Date: Mon Apr 25 12:17:08 2022 -0700 put lookup tables into their own cudf file commit 9dfd4ad Author: Elias Stehle <[email protected]> Date: Mon Apr 25 09:59:37 2022 -0700 refactored lookup tables commit fe06f0b Author: Elias Stehle <[email protected]> Date: Mon Apr 11 12:17:55 2022 -0700 squashed with bracket/brace test commit 36c8296 Author: Elias Stehle <[email protected]> Date: Tue Apr 12 22:55:00 2022 -0700 minor style changes addressing review comments commit 24dab9e Author: Elias Stehle <[email protected]> Date: Mon Apr 4 07:35:33 2022 -0700 device_span commit 49fa996 Author: Elias Stehle <[email protected]> Date: Mon Apr 4 02:28:30 2022 -0700 renaming key-value store op to stack_op commit b260610 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 05:28:17 2022 -0700 switched to using rmm also inside algorithm commit 9b20d16 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 04:11:44 2022 -0700 Added utility to debug print & instrumented code to use it commit 78dd893 Merge: 8a184e9 9627091 Author: Elias Stehle <[email protected]> Date: Fri Jul 15 23:06:55 2022 -0700 Merge remote-tracking branch 'upstream/branch-22.08' into feature/finite-state-transducer-trimmed commit 8a184e9 Author: Elias Stehle <[email protected]> Date: Fri Jul 15 22:51:18 2022 -0700 rephrases documentation on in-reg array commit bea2a02 Author: Elias Stehle <[email protected]> Date: Fri Jul 15 01:54:20 2022 -0700 replaces vanilla loop with iota commit cba1619 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:31:12 2022 -0700 fixes style in dispatch dfa commit 3f47952 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:22:03 2022 -0700 replaces gtest asserts with expects commit d351e5c Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:17:59 2022 -0700 addresses style review comments & fixes a todo commit 3038058 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:17:09 2022 -0700 adds excplitis error checking commit f52e614 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:16:18 2022 -0700 replaces enum with typed constexpr commit eb24962 Author: Elias Stehle <[email protected]> Date: Tue Jul 12 04:52:36 2022 -0700 fixes logical stack test includes commit a798852 Author: Elias Stehle <[email protected]> Date: Mon Jul 11 11:00:22 2022 -0700 adds check for state transition narrowing conversion commit e6f8def Author: Elias Stehle <[email protected]> Date: Mon Jul 11 09:06:01 2022 -0700 some west-const remainders & unifies StateIndexT commit 5f1c4b5 Author: Elias Stehle <[email protected]> Date: Mon Jul 11 06:26:47 2022 -0700 removes state vector-wrapper in favor of vanilla array commit 485a1c6 Author: Elias Stehle <[email protected]> Date: Fri Jul 8 22:49:57 2022 -0700 adopts c++17 namespaces declarations commit f656f49 Author: Elias Stehle <[email protected]> Date: Thu Jul 7 02:41:16 2022 -0700 adopts device-side test data gen commit 694a365 Author: Elias Stehle <[email protected]> Date: Wed Jun 15 04:28:51 2022 -0700 adopts suggested fst test changes commit 9fe8e4b Author: Elias Stehle <[email protected]> Date: Tue Jun 14 03:12:35 2022 -0700 minor doxygen fix commit eccf970 Author: Elias Stehle <[email protected]> Date: Thu Jun 2 05:19:53 2022 -0700 fixes sg-count & uses rmm stream in fst tests commit 6fdd24a Author: Elias Stehle <[email protected]> Date: Mon May 9 12:17:34 2022 -0700 refactor lut sanity check commit 17dcbfd Author: Elias Stehle <[email protected]> Date: Mon May 9 10:33:00 2022 -0700 making const vars const commit ea79a81 Author: Elias Stehle <[email protected]> Date: Mon May 9 10:32:17 2022 -0700 Adding hostdevice macros to in-reg array commit caf6195 Author: Elias Stehle <[email protected]> Date: Mon May 9 10:24:51 2022 -0700 unified usage of pragma unrolls commit e24a133 Author: Elias Stehle <[email protected]> Date: Wed May 4 07:29:00 2022 -0700 removing unused var post-cleanup commit 39cff80 Author: Elias Stehle <[email protected]> Date: Wed Apr 27 04:42:31 2022 -0700 Change interface for FST to not need temp storage commit 239f138 Author: Elias Stehle <[email protected]> Date: Mon Apr 25 12:17:08 2022 -0700 put lookup tables into their own cudf file commit 39a6b65 Author: Elias Stehle <[email protected]> Date: Mon Apr 25 09:59:37 2022 -0700 refactored lookup tables commit 355d1e4 Author: Elias Stehle <[email protected]> Date: Wed Apr 20 05:11:32 2022 -0700 clean up & addressing review comments commit 0557d41 Author: Elias Stehle <[email protected]> Date: Mon Apr 11 12:17:55 2022 -0700 squashed with bracket/brace test
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11518 +/- ##
===============================================
Coverage ? 86.39%
===============================================
Files ? 145
Lines ? 23014
Branches ? 0
===============================================
Hits ? 19883
Misses ? 3131
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
…fea-json-tree-gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing a few more comments. Still digging into a few more algorithmic details, following up with another review pass shortly
@gpucibot merge |
Adds JSON tree traversal algorithm in host and device. It generates column indices for _record_ orient json format. List of structs at root, where each struct is a row. - [x] column indices generation - [x] row offset Depends on PR #11518 ### Tree Traversal This algorithm assigns a unique column id to each node in the tree. The row offset is the row index of the node in that column id. Algorithm: 1. Convert node_category+fieldname to node_type. a. Create a hashmap to hash field name and assign unique node id as values. b. Convert the node categories to node types. Node type is defined as node category enum value if it is not a field node, otherwise it is the unique node id assigned by the hashmap (value shifted by #NUM_CATEGORY). 2. Preprocessing: Translate parent node ids after sorting by level. a. sort by level b. get gather map of sorted indices c. translate parent_node_ids to new sorted indices 3. Find level boundaries. copy_if index of first unique values of sorted levels. 4. Per-Level Processing: Propagate parent node ids for each level. For each level, a. gather col_id from previous level results. input=col_id, gather_map is parent_indices. b. stable sort by {parent_col_id, node_type} c. scan sum of unique {parent_col_id, node_type} d. scatter the col_id back to stable node_level order (using scatter_indices) Restore original node_id order 5. Generate row_offset. a. stable_sort by parent_col_id. b. scan_by_key {parent_col_id} (required only on nodes who's parent is list) c. propagate to non-list leaves from parent list node by recursion Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Elias Stehle (https://github.com/elstehle) - Tobias Ribizel (https://github.com/upsj) - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #11610
This PR generates json column creation from the traversed json tree. It has following parts 1. `reduce_to_column_tree` - Reduce node tree into column tree by aggregating each property of each column and number of rows in each column. 2. `make_json_column2` - creates the GPU json column tree structure from tree and column info 3. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. 4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Depends on PR #11518 #11610 For code-review, use PR karthikeyann#5 which contains only this tree changes. ### Overview - PR #11264 Tokenizes the JSON string to Tokens - PR #11518 Converts Tokens to Nodes (tree representation) - PR #11610 Traverses this node tree --> assigns column id and row index to each node. - This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column` JSON has 5 categories of nodes. STRUCT, LIST, FIELD, VALUE, STRING, STRUCT, LIST are nested types. FIELD nodes are struct columns' keys. VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2` Tree Representation `tree_meta_t` has 4 data members. 1. node categories 2. node parents' id 3. node level 4. node's string range {begin, end} (as 2 vectors) Currently supported JSON formats are records orient, and JSON lines. ### This PR - Detailed explanation This PR has 3 steps. 1. `reduce_to_column_tree` - Required to compute total number of columns, column type, nested column structure, and number of rows in each column. - Generates `tree_meta_t` data members for column. - - Sort node tree by col_id (stable sort) - - reduce_by_key custom_op on node_categories, collapses to column category - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id. - - reduce_by_key max on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count. 5. `make_json_column2` - Converts nodes to GPU json columns in tree structure - - get column tree, transfer column names to host. - - Create `d_json_column` for non-field columns. - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column. - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length. - - Compute list offset - - Perform scan max operation on offsets. (to fill 0's with previous offset value). - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information. 6. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further. 7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) - Yunsong Wang (https://github.com/PointKernel) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - https://github.com/nvdbaranec - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) URL: #11714
Description
Adds GPU implementation of JSON-token-stream to JSON-tree
Depends on PR Adds JSON-token-stream to JSON-tree #11291
This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.
The PR is part of a multi-part PR-chain. Specifically, this PR builds on the JSON tokenizer PR.
This PR depends on:
⛓️ #11264
⛓️ #11242
⛓️ #11078
Each node has one of the following category:
For each node, the tree representation stores the following information:
An example tree:
The following is just an example print of the information represented in the tree generated by the algorithm.
<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>
The original JSON for this tree:
The tree:
The original JSON pretty-printed for this tree:
Checklist