-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds JSON-token-stream to JSON-tree #11291
Conversation
…ite-state-transducer-trimmed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve for cmake change
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11291 +/- ##
===============================================
Coverage ? 86.48%
===============================================
Files ? 145
Lines ? 22840
Branches ? 0
===============================================
Hits ? 19753
Misses ? 3087
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Squashed commit of the following: commit 6e1bc75 Author: Karthikeyan Natarajan <[email protected]> Date: Fri Aug 12 03:06:30 2022 +0530 remove debug print in logical stack commit 8e75645 Author: Karthikeyan Natarajan <[email protected]> Date: Fri Aug 12 03:01:34 2022 +0530 remove duplicate renamed header commit 3b2acb2 Merge: 2b59b04 a67b718 Author: Karthikeyan Natarajan <[email protected]> Date: Fri Aug 12 02:59:01 2022 +0530 Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into json-tree commit 2b59b04 Merge: 12cf0be 2d214ea Author: Karthikeyan Natarajan <[email protected]> Date: Tue Jul 26 13:40:41 2022 +0530 Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into json-tree commit 12cf0be Author: Karthikeyan Natarajan <[email protected]> Date: Tue Jul 26 13:29:55 2022 +0530 fix clang-format style fix commit 3e756bb Author: Elias Stehle <[email protected]> Date: Mon Jul 18 08:17:03 2022 -0700 replaces tree return type from tuple to struct commit bef4fb1 Author: Elias Stehle <[email protected]> Date: Mon May 16 22:10:08 2022 -0700 moved debug print to detail ns commit ff90528 Author: Elias Stehle <[email protected]> Date: Fri May 13 09:52:20 2022 -0700 squash & rebase on latest tokenizer version commit 987699f Author: Elias Stehle <[email protected]> Date: Thu Jun 2 05:19:53 2022 -0700 fixes sg-count & uses rmm stream in fst tests commit 00a95eb Author: Elias Stehle <[email protected]> Date: Mon Apr 25 12:17:08 2022 -0700 put lookup tables into their own cudf file commit a8ac5fa Author: Elias Stehle <[email protected]> Date: Mon Apr 25 09:59:37 2022 -0700 refactored lookup tables commit f996ce9 Author: Elias Stehle <[email protected]> Date: Mon Apr 11 12:17:55 2022 -0700 squashed with bracket/brace test commit 671ce41 Author: Elias Stehle <[email protected]> Date: Tue Apr 12 22:55:00 2022 -0700 minor style changes addressing review comments commit f4ec994 Author: Elias Stehle <[email protected]> Date: Mon Apr 4 07:35:33 2022 -0700 device_span commit d18238f Author: Elias Stehle <[email protected]> Date: Mon Apr 4 02:28:30 2022 -0700 renaming key-value store op to stack_op commit 62ddf66 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 05:28:17 2022 -0700 switched to using rmm also inside algorithm commit 2f7b254 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 04:11:44 2022 -0700 Added utility to debug print & instrumented code to use it commit 67f609d Author: Elias Stehle <[email protected]> Date: Thu Jul 14 04:15:11 2022 -0700 renames enums & moving from device_span to ptr params commit 01aef44 Author: Elias Stehle <[email protected]> Date: Wed Jul 13 07:22:52 2022 -0700 wraps if with stream params into detail ns commit 4aaf595 Author: Elias Stehle <[email protected]> Date: Wed Jul 13 05:45:49 2022 -0700 fixes for breaking downstream interface changes commit 237456d Author: Elias Stehle <[email protected]> Date: Thu Jun 2 08:19:37 2022 -0700 fixes breaking changes from dependent-FST-PR commit 7fc8619 Author: Elias Stehle <[email protected]> Date: Tue May 3 07:05:44 2022 -0700 rebase on latest FST commit 6d3eff2 Author: Elias Stehle <[email protected]> Date: Thu Jun 2 05:19:53 2022 -0700 fixes sg-count & uses rmm stream in fst tests commit 6548836 Author: Elias Stehle <[email protected]> Date: Mon Apr 25 12:17:08 2022 -0700 put lookup tables into their own cudf file commit 9dfd4ad Author: Elias Stehle <[email protected]> Date: Mon Apr 25 09:59:37 2022 -0700 refactored lookup tables commit fe06f0b Author: Elias Stehle <[email protected]> Date: Mon Apr 11 12:17:55 2022 -0700 squashed with bracket/brace test commit 36c8296 Author: Elias Stehle <[email protected]> Date: Tue Apr 12 22:55:00 2022 -0700 minor style changes addressing review comments commit 24dab9e Author: Elias Stehle <[email protected]> Date: Mon Apr 4 07:35:33 2022 -0700 device_span commit 49fa996 Author: Elias Stehle <[email protected]> Date: Mon Apr 4 02:28:30 2022 -0700 renaming key-value store op to stack_op commit b260610 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 05:28:17 2022 -0700 switched to using rmm also inside algorithm commit 9b20d16 Author: Elias Stehle <[email protected]> Date: Thu Mar 31 04:11:44 2022 -0700 Added utility to debug print & instrumented code to use it commit 78dd893 Merge: 8a184e9 9627091 Author: Elias Stehle <[email protected]> Date: Fri Jul 15 23:06:55 2022 -0700 Merge remote-tracking branch 'upstream/branch-22.08' into feature/finite-state-transducer-trimmed commit 8a184e9 Author: Elias Stehle <[email protected]> Date: Fri Jul 15 22:51:18 2022 -0700 rephrases documentation on in-reg array commit bea2a02 Author: Elias Stehle <[email protected]> Date: Fri Jul 15 01:54:20 2022 -0700 replaces vanilla loop with iota commit cba1619 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:31:12 2022 -0700 fixes style in dispatch dfa commit 3f47952 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:22:03 2022 -0700 replaces gtest asserts with expects commit d351e5c Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:17:59 2022 -0700 addresses style review comments & fixes a todo commit 3038058 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:17:09 2022 -0700 adds excplitis error checking commit f52e614 Author: Elias Stehle <[email protected]> Date: Thu Jul 14 09:16:18 2022 -0700 replaces enum with typed constexpr commit eb24962 Author: Elias Stehle <[email protected]> Date: Tue Jul 12 04:52:36 2022 -0700 fixes logical stack test includes commit a798852 Author: Elias Stehle <[email protected]> Date: Mon Jul 11 11:00:22 2022 -0700 adds check for state transition narrowing conversion commit e6f8def Author: Elias Stehle <[email protected]> Date: Mon Jul 11 09:06:01 2022 -0700 some west-const remainders & unifies StateIndexT commit 5f1c4b5 Author: Elias Stehle <[email protected]> Date: Mon Jul 11 06:26:47 2022 -0700 removes state vector-wrapper in favor of vanilla array commit 485a1c6 Author: Elias Stehle <[email protected]> Date: Fri Jul 8 22:49:57 2022 -0700 adopts c++17 namespaces declarations commit f656f49 Author: Elias Stehle <[email protected]> Date: Thu Jul 7 02:41:16 2022 -0700 adopts device-side test data gen commit 694a365 Author: Elias Stehle <[email protected]> Date: Wed Jun 15 04:28:51 2022 -0700 adopts suggested fst test changes commit 9fe8e4b Author: Elias Stehle <[email protected]> Date: Tue Jun 14 03:12:35 2022 -0700 minor doxygen fix commit eccf970 Author: Elias Stehle <[email protected]> Date: Thu Jun 2 05:19:53 2022 -0700 fixes sg-count & uses rmm stream in fst tests commit 6fdd24a Author: Elias Stehle <[email protected]> Date: Mon May 9 12:17:34 2022 -0700 refactor lut sanity check commit 17dcbfd Author: Elias Stehle <[email protected]> Date: Mon May 9 10:33:00 2022 -0700 making const vars const commit ea79a81 Author: Elias Stehle <[email protected]> Date: Mon May 9 10:32:17 2022 -0700 Adding hostdevice macros to in-reg array commit caf6195 Author: Elias Stehle <[email protected]> Date: Mon May 9 10:24:51 2022 -0700 unified usage of pragma unrolls commit e24a133 Author: Elias Stehle <[email protected]> Date: Wed May 4 07:29:00 2022 -0700 removing unused var post-cleanup commit 39cff80 Author: Elias Stehle <[email protected]> Date: Wed Apr 27 04:42:31 2022 -0700 Change interface for FST to not need temp storage commit 239f138 Author: Elias Stehle <[email protected]> Date: Mon Apr 25 12:17:08 2022 -0700 put lookup tables into their own cudf file commit 39a6b65 Author: Elias Stehle <[email protected]> Date: Mon Apr 25 09:59:37 2022 -0700 refactored lookup tables commit 355d1e4 Author: Elias Stehle <[email protected]> Date: Wed Apr 20 05:11:32 2022 -0700 clean up & addressing review comments commit 0557d41 Author: Elias Stehle <[email protected]> Date: Mon Apr 11 12:17:55 2022 -0700 squashed with bracket/brace test
This PR has been labeled |
Adds GPU implementation of JSON-token-stream to JSON-tree Depends on PR [Adds JSON-token-stream to JSON-tree](#11291) #11291 <details> --- This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node. The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264). **This PR depends on:** ⛓️ #11264 ⛓️ #11242 ⛓️ #11078 **Each node has one of the following category:** ``` /// A node representing a struct NC_STRUCT, /// A node representing a list NC_LIST, /// A node representing a field name NC_FN, /// A node representing a string value NC_STR, /// A node representing a numeric or literal value (e.g., true, false, null) NC_VAL, /// A node representing a parser error NC_ERR ``` **For each node, the tree representation stores the following information:** - node category - node level - node range begin (index of the first character from the original JSON input that this node demarcates) - node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates) **An example tree:** The following is just an example print of the information represented in the tree generated by the algorithm. - Each line is printing the full path to the next node in the tree. - For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>` **The original JSON for this tree:** ``` [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95}, {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] ``` **The tree:** ``` <0:LIST:[2, 3) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'> ``` **The original JSON pretty-printed for this tree:** ``` [ { "category": "reference", "index:": [ 4, 12, 42 ], "author": "Nigel Rees", "title": "[Sayings of the Century]", "price": 8.95 }, { "category": "reference", "index": [ 4, {}, null, { "a": [ {}, {} ] } ], "author": "Nigel Rees", "title": "{}[], <=semantic-symbols-string", "price": 8.95 } ] ``` </details> --- Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - David Wendt (https://github.com/davidwendt) URL: #11518
This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.
The PR is part of a multi-part PR-chain. Specifically, this PR builds on the JSON tokenizer PR.
This PR depends on:
⛓️ #11264
⛓️ #11242
⛓️ #11078
Each node has one of the following category:
For each node, the tree representation stores the following information:
An example tree:
The following is just an example print of the information represented in the tree generated by the algorithm.
<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>
The original JSON for this tree:
The tree:
The original JSON pretty-printed for this tree: