Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the Finite-State Transducer algorithm #11242

Merged

Conversation

elstehle
Copy link
Contributor

This PR adds a parallel Finite-State Transducer (FST) algorithm. The FST is a key component of the nested JSON parser.

Background

An example of a Finite-State Transducer (FST) // aka the algorithm which we try to mimic:
Slides from the JSON parser presentation, Slides 11-17

Our GPU-based implementation

The GPU-based algorithm builds on the following work:
ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data

The following sections are of relevance:

  • Section 3.1
  • Section 4.5 (i.e., the Multi-fragment in-register array)

How the algorithm works is illustrated in the following presentation:
ParPaRaw @VlLDB'20

Relevent Data Structures

A word about the motivation and need for the Multi-fragment in-register array:

The composition over to state-transaction vectors is a key operation (in the prefix scan). Basically, what it does for two state-transition vectors lhs and rhs, both comprising N items:

for (int32_t i = 0; i < N; ++i) {
  result[n] = rhs[lhs[i]];
}
return result;

The relevant part is the indexing into rhs: rhs[lhs[i]], i.e., the index is lhs[i], a runtime value that isn't known at compile time. It's important to understand that in CUB's prefix scan both rhs and lhs are thread-local variables. As such, they either live in the fast register file or in (slow off-chip) local memory.
The register file has a shortcoming, it cannot be indexed dynamically. And here, we are dynamically indexing into rhs. So rhs will need to be spilled to local memory (backed by device memory) to allow for dynamic indexing. This would usually make the algorithm very slow. That's why we have the Multi-fragment in-register array. For its implementation details I'd suggest reading Section 4.5.

In contrast, the following example is fine and foo will be mapped to registers, because the loop can be unrolled, and, if N is known at compile time and sufficiently small (of at most tens of items).

// this is fine, if N is a compile-time constant 
for (int32_t i = 1; i < N; ++i) {
  foo[n] = foo[n-1];
}

Style & CUB Integration

The following may be considered for being integrated into CUB at a later point, hence the deviation in style from cuDF.

  • in_reg_array.cuh
  • agent_dfa.cuh
  • device_dfa.cuh
  • dispatch_dfa.cuh

@elstehle elstehle requested a review from a team as a code owner July 12, 2022 10:55
@elstehle elstehle requested review from karthikeyann and vuule July 12, 2022 10:55
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jul 12, 2022
@elstehle elstehle added feature request New feature or request 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Jul 12, 2022
@karthikeyann karthikeyann modified the milestone: Nested JSON reader Jul 12, 2022
@karthikeyann karthikeyann mentioned this pull request Jul 12, 2022
4 tasks
@codecov
Copy link

codecov bot commented Jul 12, 2022

Codecov Report

Merging #11242 (8a54c72) into branch-22.08 (b2dd1bf) will increase coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.08   #11242      +/-   ##
================================================
+ Coverage         86.34%   86.37%   +0.03%     
================================================
  Files               144      144              
  Lines             22826    22826              
================================================
+ Hits              19708    19715       +7     
+ Misses             3118     3111       -7     
Impacted Files Coverage Δ
python/cudf/cudf/core/dataframe.py 93.57% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.80% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.02% <0.00%> (+0.21%) ⬆️
python/cudf/cudf/core/column/numerical.py 96.19% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 91.70% <0.00%> (+0.97%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2dd1bf...8a54c72. Read the comment docs.

@elstehle elstehle changed the title Feature/finite state transducer Adds the Finite-State Transducer algorithm Jul 12, 2022
@vuule
Copy link
Contributor

vuule commented Jul 12, 2022

Absolute unit!
For now, I have to say that the description is 🔥

@upsj
Copy link
Contributor

upsj commented Jul 19, 2022

Actually, let me move the discussion from Slack here:

One suggestion that came to my mind: Every DFA has something like an error state. Do you think it would be possible to integrate that with the transducer? Right now, it would spam the output with "error" symbols. If we extended the output offset prefix sum with a bool has_error and or reduction operation that only sums up if not lhs.has_error, it would output only a single token that can even be used to print a useful error message.

@elstehle
Copy link
Contributor Author

One suggestion that came to my mind: Every DFA has something like an error state. Do you think it would be possible to integrate that with the transducer? Right now, it would spam the output with "error" symbols. If we extended the output offset prefix sum with a bool has_error and or reduction operation that only sums up if has_error is false, it would output only a single token that can even be used to print a useful error message.

I think what you're describing can be achieved quite naturally by the user simply defining an error trap state. I.e., a state that once entered will not be left. So, the FST would emit just one single "error" symbol when that state is being entered. At the same time it allows you to pinpoint where in the input we began seeing that error state. All that without having to worry about it in the FST implementation.

@upsj
Copy link
Contributor

upsj commented Jul 19, 2022

Ah thanks, I mixed up output on states vs. transitions - if all transitions into the error state output an error symbol, but transitions inside the error state don't output anything.

@elstehle
Copy link
Contributor Author

Ah thanks, I mixed up output on states vs. transitions - if all transitions into the error state output an error symbol, but transitions inside the error state don't output anything.

Exactly 👍

@karthikeyann
Copy link
Contributor

rerun tests

cpp/src/io/fst/dispatch_dfa.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/dispatch_dfa.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/dispatch_dfa.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/in_reg_array.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/in_reg_array.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/lookup_tables.cuh Show resolved Hide resolved
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful work! @elstehle Thank you for this. Looks good.
Template heavy & "CUB" style code!
Amazing work by reviewers @vuule @upsj and great suggestions to improve the code.

@karthikeyann karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 22, 2022
@karthikeyann
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ebcea0f into rapidsai:branch-22.08 Jul 22, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2022
Depends on #11242 Feature/finite state transducer 

Benchmark for Finite State Transducer
parse and identify JSON symbols
- [x] FST with output, output index, output str
- [x] FST without output index
- [x] FST without, output
- [x] FST without output str

Look into elstehle#1 for files modified only in this PR (i.e excluding parent depending PR)

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Elias Stehle (https://github.com/elstehle)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Elias Stehle (https://github.com/elstehle)

URL: #11243
rapids-bot bot pushed a commit that referenced this pull request Aug 6, 2022
This PR builds on the _Finite-State Transducer_ (_FST_) algorithm and the _Logical Stack_ to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section.

**This PR builds on:**
⛓️ #11242
⛓️ #11078

Specifically, the tokenizer comprises the following processing steps:
1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation.
2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack.
3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc.

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Tobias Ribizel (https://github.com/upsj)
  - Karthikeyan (https://github.com/karthikeyann)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #11264
rapids-bot bot pushed a commit that referenced this pull request Sep 19, 2022
Adds GPU implementation of JSON-token-stream to JSON-tree 
Depends on PR [Adds JSON-token-stream to JSON-tree](#11291)  #11291 




<details>

---
This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.  

The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264).

**This PR depends on:**
⛓️ #11264
⛓️ #11242
⛓️ #11078

**Each node has one of the following category:**

```
/// A node representing a struct
NC_STRUCT,
/// A node representing a list
NC_LIST,
/// A node representing a field name
NC_FN,
/// A node representing a string value
NC_STR,
/// A node representing a numeric or literal value (e.g., true, false, null)
NC_VAL,
/// A node representing a parser error
NC_ERR
```

**For each node, the tree representation stores the following information:**
- node category
- node level
- node range begin (index of the first character from the original JSON input that this node demarcates)
- node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates)

**An example tree:**
The following is just an example print of the information represented in the tree generated by the algorithm.

- Each line is printing the full path to the next node in the tree. 
- For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>`


**The original JSON for this tree:**
```
  [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95},  {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] 
```

**The tree:**
```
<0:LIST:[2, 3) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'>
```

**The original JSON pretty-printed for this tree:**
```
[
    {
        "category": "reference",
        "index:": [
            4,
            12,
            42
        ],
        "author": "Nigel Rees",
        "title": "[Sayings of the Century]",
        "price": 8.95
    },
    {
        "category": "reference",
        "index": [
            4,
            {},
            null,
            {
                "a": [
                    {},
                    {}
                ]
            }
        ],
        "author": "Nigel Rees",
        "title": "{}[], <=semantic-symbols-string",
        "price": 8.95
    }
]
```
</details>

---

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - David Wendt (https://github.com/davidwendt)

URL: #11518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants