-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the Finite-State Transducer algorithm #11242
Adds the Finite-State Transducer algorithm #11242
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #11242 +/- ##
================================================
+ Coverage 86.34% 86.37% +0.03%
================================================
Files 144 144
Lines 22826 22826
================================================
+ Hits 19708 19715 +7
+ Misses 3118 3111 -7
Continue to review full report at Codecov.
|
Absolute unit! |
Actually, let me move the discussion from Slack here: One suggestion that came to my mind: Every DFA has something like an error state. Do you think it would be possible to integrate that with the transducer? Right now, it would spam the output with "error" symbols. If we extended the output offset prefix sum with a |
I think what you're describing can be achieved quite naturally by the user simply defining an |
Ah thanks, I mixed up output on states vs. transitions - if all transitions into the error state output an error symbol, but transitions inside the error state don't output anything. |
Exactly 👍 |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpucibot merge |
Depends on #11242 Feature/finite state transducer Benchmark for Finite State Transducer parse and identify JSON symbols - [x] FST with output, output index, output str - [x] FST without output index - [x] FST without, output - [x] FST without output str Look into elstehle#1 for files modified only in this PR (i.e excluding parent depending PR) Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Elias Stehle (https://github.com/elstehle) URL: #11243
This PR builds on the _Finite-State Transducer_ (_FST_) algorithm and the _Logical Stack_ to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section. **This PR builds on:** ⛓️ #11242 ⛓️ #11078 Specifically, the tokenizer comprises the following processing steps: 1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation. 2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack. 3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc. Authors: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - Karthikeyan (https://github.com/karthikeyann) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11264
Adds GPU implementation of JSON-token-stream to JSON-tree Depends on PR [Adds JSON-token-stream to JSON-tree](#11291) #11291 <details> --- This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node. The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264). **This PR depends on:** ⛓️ #11264 ⛓️ #11242 ⛓️ #11078 **Each node has one of the following category:** ``` /// A node representing a struct NC_STRUCT, /// A node representing a list NC_LIST, /// A node representing a field name NC_FN, /// A node representing a string value NC_STR, /// A node representing a numeric or literal value (e.g., true, false, null) NC_VAL, /// A node representing a parser error NC_ERR ``` **For each node, the tree representation stores the following information:** - node category - node level - node range begin (index of the first character from the original JSON input that this node demarcates) - node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates) **An example tree:** The following is just an example print of the information represented in the tree generated by the algorithm. - Each line is printing the full path to the next node in the tree. - For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>` **The original JSON for this tree:** ``` [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95}, {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] ``` **The tree:** ``` <0:LIST:[2, 3) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'> ``` **The original JSON pretty-printed for this tree:** ``` [ { "category": "reference", "index:": [ 4, 12, 42 ], "author": "Nigel Rees", "title": "[Sayings of the Century]", "price": 8.95 }, { "category": "reference", "index": [ 4, {}, null, { "a": [ {}, {} ] } ], "author": "Nigel Rees", "title": "{}[], <=semantic-symbols-string", "price": 8.95 } ] ``` </details> --- Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - David Wendt (https://github.com/davidwendt) URL: #11518
This PR adds a parallel Finite-State Transducer (FST) algorithm. The FST is a key component of the nested JSON parser.
Background
An example of a Finite-State Transducer (FST) // aka the algorithm which we try to mimic:
Slides from the JSON parser presentation, Slides 11-17
Our GPU-based implementation
The GPU-based algorithm builds on the following work:
ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data
The following sections are of relevance:
How the algorithm works is illustrated in the following presentation:
ParPaRaw @VlLDB'20
Relevent Data Structures
A word about the motivation and need for the Multi-fragment in-register array:
The composition over to state-transaction vectors is a key operation (in the prefix scan). Basically, what it does for two state-transition vectors
lhs
andrhs
, both comprisingN
items:The relevant part is the indexing into
rhs
:rhs[lhs[i]]
, i.e., the index islhs[i]
, a runtime value that isn't known at compile time. It's important to understand that in CUB's prefix scan bothrhs
andlhs
are thread-local variables. As such, they either live in the fast register file or in (slow off-chip) local memory.The register file has a shortcoming, it cannot be indexed dynamically. And here, we are dynamically indexing into
rhs
. Sorhs
will need to be spilled to local memory (backed by device memory) to allow for dynamic indexing. This would usually make the algorithm very slow. That's why we have the Multi-fragment in-register array. For its implementation details I'd suggest reading Section 4.5.In contrast, the following example is fine and
foo
will be mapped to registers, because the loop can be unrolled, and, ifN
is known at compile time and sufficiently small (of at most tens of items).Style & CUB Integration
The following may be considered for being integrated into CUB at a later point, hence the deviation in style from cuDF.
in_reg_array.cuh
agent_dfa.cuh
device_dfa.cuh
dispatch_dfa.cuh