-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the Logical Stack algorithm #11078
Merged
rapids-bot
merged 22 commits into
rapidsai:branch-22.08
from
elstehle:feature/logical-stack
Jul 11, 2022
Merged
Adds the Logical Stack algorithm #11078
rapids-bot
merged 22 commits into
rapidsai:branch-22.08
from
elstehle:feature/logical-stack
Jul 11, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
CMake
CMake build issue
libcudf
Affects libcudf (C++/CUDA) code.
labels
Jun 8, 2022
elstehle
added
feature request
New feature or request
cuIO
cuIO issue
non-breaking
Non-breaking change
labels
Jun 8, 2022
PointKernel
requested changes
Jun 8, 2022
karthikeyann
requested changes
Jun 14, 2022
PointKernel
approved these changes
Jun 15, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
rerun tests |
rerun tests |
ttnghia
reviewed
Jul 5, 2022
ttnghia
reviewed
Jul 5, 2022
ttnghia
reviewed
Jul 5, 2022
ttnghia
reviewed
Jul 5, 2022
rerun tests |
karthikeyann
approved these changes
Jul 11, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of unused code. Rest all looks good.
@gpucibot merge |
This was referenced Jul 18, 2022
Merged
rapids-bot bot
pushed a commit
that referenced
this pull request
Aug 6, 2022
This PR builds on the _Finite-State Transducer_ (_FST_) algorithm and the _Logical Stack_ to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section. **This PR builds on:** ⛓️ #11242 ⛓️ #11078 Specifically, the tokenizer comprises the following processing steps: 1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation. 2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack. 3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc. Authors: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - Karthikeyan (https://github.com/karthikeyann) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11264
3 tasks
3 tasks
rapids-bot bot
pushed a commit
that referenced
this pull request
Sep 19, 2022
Adds GPU implementation of JSON-token-stream to JSON-tree Depends on PR [Adds JSON-token-stream to JSON-tree](#11291) #11291 <details> --- This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node. The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264). **This PR depends on:** ⛓️ #11264 ⛓️ #11242 ⛓️ #11078 **Each node has one of the following category:** ``` /// A node representing a struct NC_STRUCT, /// A node representing a list NC_LIST, /// A node representing a field name NC_FN, /// A node representing a string value NC_STR, /// A node representing a numeric or literal value (e.g., true, false, null) NC_VAL, /// A node representing a parser error NC_ERR ``` **For each node, the tree representation stores the following information:** - node category - node level - node range begin (index of the first character from the original JSON input that this node demarcates) - node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates) **An example tree:** The following is just an example print of the information represented in the tree generated by the algorithm. - Each line is printing the full path to the next node in the tree. - For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>` **The original JSON for this tree:** ``` [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95}, {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] ``` **The tree:** ``` <0:LIST:[2, 3) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'> ``` **The original JSON pretty-printed for this tree:** ``` [ { "category": "reference", "index:": [ 4, 12, 42 ], "author": "Nigel Rees", "title": "[Sayings of the Century]", "price": 8.95 }, { "category": "reference", "index": [ 4, {}, null, { "a": [ {}, {} ] } ], "author": "Nigel Rees", "title": "{}[], <=semantic-symbols-string", "price": 8.95 } ] ``` </details> --- Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - David Wendt (https://github.com/davidwendt) URL: #11518
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
3 - Ready for Review
Ready for review by team
CMake
CMake build issue
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
non-breaking
Non-breaking change
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds the Logical Stack, an algorithm required by the JSON parser. The Logical Stack takes a sequence of stack operations (i.e.,
push(X)
,pop()
,read()
) as if they were to be applied to a regularstack
data structure in the given order. For each operation within that sequence, the algorithm resolves the stack state and writes out the item that is on top of the stack before such operation is applied. As, for some operations, the stack may be empty, the algorithm uses a user-specified sentinel symbol to represent the "empty-stack" (i.e., there is no item on top of the stack).How the Logical Stack is implemented is illustrated in this presentation:
https://docs.google.com/presentation/d/16r-0SlQFd-7fH2R7I06tc_JqsAd_0GrTgh_q20sJ2ak/edit?usp=sharing
The only deviation from the algorithm presented in the slides is the optimisation of a sparse sequence of stack operations. That is, in case of the JSON Parser, we only pass symbols that actually push or pop (i.e.,
{
,[
,}
, and]
) along with the index at which that operation occurred. Symbols that follow a stack operation that pushes or pops are filled with the symbol that is inferred as top-of-stack symbol of such operation.Results from intermediate processing steps can be dumped to
stdout
by setting:For instance: