Adds the Finite-State Transducer algorithm #11242

elstehle · 2022-07-12T10:55:35Z

This PR adds a parallel Finite-State Transducer (FST) algorithm. The FST is a key component of the nested JSON parser.

Background

An example of a Finite-State Transducer (FST) // aka the algorithm which we try to mimic:
Slides from the JSON parser presentation, Slides 11-17

Our GPU-based implementation

The GPU-based algorithm builds on the following work:
ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data

The following sections are of relevance:

Section 3.1
Section 4.5 (i.e., the Multi-fragment in-register array)

How the algorithm works is illustrated in the following presentation:
ParPaRaw @VlLDB'20

Relevent Data Structures

A word about the motivation and need for the Multi-fragment in-register array:

The composition over to state-transaction vectors is a key operation (in the prefix scan). Basically, what it does for two state-transition vectors lhs and rhs, both comprising N items:

for (int32_t i = 0; i < N; ++i) {
  result[n] = rhs[lhs[i]];
}
return result;

The relevant part is the indexing into rhs: rhs[lhs[i]], i.e., the index is lhs[i], a runtime value that isn't known at compile time. It's important to understand that in CUB's prefix scan both rhs and lhs are thread-local variables. As such, they either live in the fast register file or in (slow off-chip) local memory.
The register file has a shortcoming, it cannot be indexed dynamically. And here, we are dynamically indexing into rhs. So rhs will need to be spilled to local memory (backed by device memory) to allow for dynamic indexing. This would usually make the algorithm very slow. That's why we have the Multi-fragment in-register array. For its implementation details I'd suggest reading Section 4.5.

In contrast, the following example is fine and foo will be mapped to registers, because the loop can be unrolled, and, if N is known at compile time and sufficiently small (of at most tens of items).

// this is fine, if N is a compile-time constant 
for (int32_t i = 1; i < N; ++i) {
  foo[n] = foo[n-1];
}

Style & CUB Integration

The following may be considered for being integrated into CUB at a later point, hence the deviation in style from cuDF.

in_reg_array.cuh
agent_dfa.cuh
device_dfa.cuh
dispatch_dfa.cuh

codecov · 2022-07-12T13:47:49Z

Codecov Report

Merging #11242 (8a54c72) into branch-22.08 (b2dd1bf) will increase coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.08   #11242      +/-   ##
================================================
+ Coverage         86.34%   86.37%   +0.03%     
================================================
  Files               144      144              
  Lines             22826    22826              
================================================
+ Hits              19708    19715       +7     
+ Misses             3118     3111       -7

Impacted Files	Coverage Δ
python/cudf/cudf/core/dataframe.py	`93.57% <0.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/column/string.py	`88.80% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.02% <0.00%> (+0.21%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`96.19% <0.00%> (+0.29%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`91.70% <0.00%> (+0.97%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2dd1bf...8a54c72. Read the comment docs.

vuule · 2022-07-12T16:56:45Z

Absolute unit!
For now, I have to say that the description is 🔥

upsj · 2022-07-19T12:19:20Z

Actually, let me move the discussion from Slack here:

One suggestion that came to my mind: Every DFA has something like an error state. Do you think it would be possible to integrate that with the transducer? Right now, it would spam the output with "error" symbols. If we extended the output offset prefix sum with a bool has_error and or reduction operation that only sums up if not lhs.has_error, it would output only a single token that can even be used to print a useful error message.

elstehle · 2022-07-19T13:38:59Z

One suggestion that came to my mind: Every DFA has something like an error state. Do you think it would be possible to integrate that with the transducer? Right now, it would spam the output with "error" symbols. If we extended the output offset prefix sum with a bool has_error and or reduction operation that only sums up if has_error is false, it would output only a single token that can even be used to print a useful error message.

I think what you're describing can be achieved quite naturally by the user simply defining an error trap state. I.e., a state that once entered will not be left. So, the FST would emit just one single "error" symbol when that state is being entered. At the same time it allows you to pinpoint where in the input we began seeing that error state. All that without having to worry about it in the FST implementation.

upsj · 2022-07-19T13:45:47Z

Ah thanks, I mixed up output on states vs. transitions - if all transitions into the error state output an error symbol, but transitions inside the error state don't output anything.

elstehle · 2022-07-19T13:50:16Z

Ah thanks, I mixed up output on states vs. transitions - if all transitions into the error state output an error symbol, but transitions inside the error state don't output anything.

Exactly 👍

karthikeyann · 2022-07-21T07:59:06Z

rerun tests

cpp/src/io/fst/dispatch_dfa.cuh

cpp/src/io/fst/in_reg_array.cuh

cpp/src/io/fst/lookup_tables.cuh

karthikeyann

Wonderful work! @elstehle Thank you for this. Looks good.
Template heavy & "CUB" style code!
Amazing work by reviewers @vuule @upsj and great suggestions to improve the code.

karthikeyann · 2022-07-22T14:28:50Z

@gpucibot merge

Depends on #11242 Feature/finite state transducer Benchmark for Finite State Transducer parse and identify JSON symbols - [x] FST with output, output index, output str - [x] FST without output index - [x] FST without, output - [x] FST without output str Look into elstehle#1 for files modified only in this PR (i.e excluding parent depending PR) Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Elias Stehle (https://github.com/elstehle) URL: #11243

This PR builds on the _Finite-State Transducer_ (_FST_) algorithm and the _Logical Stack_ to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section. **This PR builds on:** ⛓️ #11242 ⛓️ #11078 Specifically, the tokenizer comprises the following processing steps: 1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation. 2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack. 3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc. Authors: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - Karthikeyan (https://github.com/karthikeyann) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11264

Adds GPU implementation of JSON-token-stream to JSON-tree Depends on PR [Adds JSON-token-stream to JSON-tree](#11291) #11291 <details> --- This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node. The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264). **This PR depends on:** ⛓️ #11264 ⛓️ #11242 ⛓️ #11078 **Each node has one of the following category:** ``` /// A node representing a struct NC_STRUCT, /// A node representing a list NC_LIST, /// A node representing a field name NC_FN, /// A node representing a string value NC_STR, /// A node representing a numeric or literal value (e.g., true, false, null) NC_VAL, /// A node representing a parser error NC_ERR ``` **For each node, the tree representation stores the following information:** - node category - node level - node range begin (index of the first character from the original JSON input that this node demarcates) - node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates) **An example tree:** The following is just an example print of the information represented in the tree generated by the algorithm. - Each line is printing the full path to the next node in the tree. - For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>` **The original JSON for this tree:** ``` [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95}, {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] ``` **The tree:** ``` <0:LIST:[2, 3) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'> ``` **The original JSON pretty-printed for this tree:** ``` [ { "category": "reference", "index:": [ 4, 12, 42 ], "author": "Nigel Rees", "title": "[Sayings of the Century]", "price": 8.95 }, { "category": "reference", "index": [ 4, {}, null, { "a": [ {}, {} ] } ], "author": "Nigel Rees", "title": "{}[], <=semantic-symbols-string", "price": 8.95 } ] ``` </details> --- Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - David Wendt (https://github.com/davidwendt) URL: #11518

elstehle requested a review from a team as a code owner July 12, 2022 10:55

elstehle requested review from karthikeyann and vuule July 12, 2022 10:55

github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 12, 2022

elstehle added feature request New feature or request 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Jul 12, 2022

karthikeyann modified the milestone: Nested JSON reader Jul 12, 2022

karthikeyann mentioned this pull request Jul 12, 2022

FST benchmark #11243

Merged

4 tasks

elstehle changed the title ~~Feature/finite state transducer~~ Adds the Finite-State Transducer algorithm Jul 12, 2022

elstehle added 14 commits July 13, 2022 00:53

squashed with bracket/brace test

0557d41

clean up & addressing review comments

355d1e4

refactored lookup tables

39a6b65

put lookup tables into their own cudf file

239f138

Change interface for FST to not need temp storage

39cff80

removing unused var post-cleanup

e24a133

unified usage of pragma unrolls

caf6195

Adding hostdevice macros to in-reg array

ea79a81

making const vars const

17dcbfd

refactor lut sanity check

6fdd24a

fixes sg-count & uses rmm stream in fst tests

eccf970

minor doxygen fix

9fe8e4b

adopts suggested fst test changes

694a365

adopts device-side test data gen

f656f49

elstehle added 7 commits July 20, 2022 06:28

improves style in fst test

4783aae

adds comments in in_reg array

6203709

adds comments to lookup tables

ad5817a

fixes formatting

dc55653

exchanges loops in favor of copy and fills

378be9f

clarifies documentation in agent dfa

4ba5472

disambiguates transition and translation tables

7980978

PointKernel mentioned this pull request Jul 20, 2022

Add generic type inference for cuIO #11121

Merged

minor style fix

2bce061

karthikeyann reviewed Jul 21, 2022

View reviewed changes

elstehle added 4 commits July 21, 2022 06:20

if constexprs and doxy on DFA helper

b37f716

minor documentation fix

d42869a

replaces loop for comparing vectors with generic macro

6c889f7

uses new vector comparison for logical stack test

8a54c72

karthikeyann approved these changes Jul 22, 2022

View reviewed changes

karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 22, 2022

rapids-bot bot merged commit ebcea0f into rapidsai:branch-22.08 Jul 22, 2022

karthikeyann mentioned this pull request Aug 11, 2022

Adds GPU implementation of JSON-token-stream to JSON-tree #11518

Merged

3 tasks

upsj mentioned this pull request Sep 21, 2022

[FEA] read_csv context-passing interface for distributed/segmented parsing #11728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds the Finite-State Transducer algorithm #11242

Adds the Finite-State Transducer algorithm #11242

elstehle commented Jul 12, 2022

codecov bot commented Jul 12, 2022 •

edited

Loading

vuule commented Jul 12, 2022

upsj commented Jul 19, 2022 •

edited

Loading

elstehle commented Jul 19, 2022

upsj commented Jul 19, 2022

elstehle commented Jul 19, 2022

karthikeyann commented Jul 21, 2022

karthikeyann left a comment

karthikeyann commented Jul 22, 2022

Adds the Finite-State Transducer algorithm #11242

Adds the Finite-State Transducer algorithm #11242

Conversation

elstehle commented Jul 12, 2022

Background

Our GPU-based implementation

Relevent Data Structures

Style & CUB Integration

codecov bot commented Jul 12, 2022 • edited Loading

Codecov Report

vuule commented Jul 12, 2022

upsj commented Jul 19, 2022 • edited Loading

elstehle commented Jul 19, 2022

upsj commented Jul 19, 2022

elstehle commented Jul 19, 2022

karthikeyann commented Jul 21, 2022

karthikeyann left a comment

Choose a reason for hiding this comment

karthikeyann commented Jul 22, 2022

codecov bot commented Jul 12, 2022 •

edited

Loading

upsj commented Jul 19, 2022 •

edited

Loading