Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the Logical Stack algorithm #11078

Merged
merged 22 commits into from
Jul 11, 2022

Conversation

elstehle
Copy link
Contributor

@elstehle elstehle commented Jun 8, 2022

Description
This PR adds the Logical Stack, an algorithm required by the JSON parser. The Logical Stack takes a sequence of stack operations (i.e., push(X), pop(), read()) as if they were to be applied to a regular stack data structure in the given order. For each operation within that sequence, the algorithm resolves the stack state and writes out the item that is on top of the stack before such operation is applied. As, for some operations, the stack may be empty, the algorithm uses a user-specified sentinel symbol to represent the "empty-stack" (i.e., there is no item on top of the stack).

How the Logical Stack is implemented is illustrated in this presentation:
https://docs.google.com/presentation/d/16r-0SlQFd-7fH2R7I06tc_JqsAd_0GrTgh_q20sJ2ak/edit?usp=sharing

The only deviation from the algorithm presented in the slides is the optimisation of a sparse sequence of stack operations. That is, in case of the JSON Parser, we only pass symbols that actually push or pop (i.e., {, [, }, and ]) along with the index at which that operation occurred. Symbols that follow a stack operation that pushes or pops are filled with the symbol that is inferred as top-of-stack symbol of such operation.

Results from intermediate processing steps can be dumped to stdout by setting:

export CUDA_DBG_DUMP=1

For instance:

//            0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
d_input  = "  [  {  }  ,  n  u  l  l  ,  [  1  ,  2  ,  3  ]  ,  [  1  ]  ]  ";

// This is the sparse representation we feed into the logical stack
// The algorithm's contract is: positions not present in the sparse list are reading the top-of-the stack
d_symbols  = "  [  {  }  [  ]  [  ]  ]  "
d_indexes  = "  0  1  2  9 15 17 19 20  "

// Function object used for classifying the kind of stack operation a symbol represents
struct ToStackOp {
  __host__ __device__ fst::stack_op_type operator()(
    char const& symbol) const
  {
    return symbol == '[' ? fst::stack_op_type::PUSH : symbol == ']' ? fst::stack_op_type::POP : fst::stack_op_type::READ;
  }
};

// The symbol that we'll put whenever there's nothing on the stack
auto empty_stack_symbol = '_';

// A symbol that does not push
auto read_symbol = 'x';

// Type sufficiently large to cover [-max_stack_level, max_stack_level]
using stack_level_t = int8_t;
fst::sparse_stack_op_to_top_of_stack<stack_level_t>(
                          d_symbols,
                          d_indexes,
                          ToStackOp{},
                          d_top_of_stack_out,
                          empty_stack_symbol,
                          read_symbol,
                          d_symbols.size(), // input size (num. items in sparse representation)
                          d_input.size(),   // output size (num. items in dense representation)
                          stream);

// The output represents the symbol that was on top of the stack prior to applying the stack operation
d_input             = "  [  {  }  ,  n  u  l  l  ,  [  1  ,  2  ,  3  ]  ,  [  1  ]  ]  "; // <<-- original input
d_top_of_stack_out  = "  _  [  {  [  [  [  [  [  [  [  [  [  [  [  [  [  [  [  [  [  [  "; // <<-- logical stack output

@elstehle elstehle added the 3 - Ready for Review Ready for review by team label Jun 8, 2022
@elstehle elstehle requested a review from a team as a code owner June 8, 2022 12:33
@elstehle elstehle requested review from mythrocks and davidwendt June 8, 2022 12:33
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Jun 8, 2022
@elstehle elstehle added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Jun 8, 2022
cpp/src/io/fst/logical_stack.cuh Show resolved Hide resolved
cpp/include/cudf_test/print_utilities.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/logical_stack.cuh Show resolved Hide resolved
cpp/include/cudf_test/print_utilities.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/logical_stack.cuh Outdated Show resolved Hide resolved
cpp/include/cudf_test/print_utilities.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/logical_stack.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/logical_stack.cuh Outdated Show resolved Hide resolved
cpp/src/io/fst/logical_stack.cuh Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Outdated Show resolved Hide resolved
cpp/tests/io/fst/logical_stack_test.cu Show resolved Hide resolved
cpp/src/io/fst/logical_stack.cuh Outdated Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elstehle
Copy link
Contributor Author

rerun tests

@elstehle
Copy link
Contributor Author

rerun tests

@elstehle
Copy link
Contributor Author

elstehle commented Jul 5, 2022

rerun tests

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of unused code. Rest all looks good.

cpp/src/io/fst/logical_stack.cuh Outdated Show resolved Hide resolved
@elstehle
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8c39130 into rapidsai:branch-22.08 Jul 11, 2022
rapids-bot bot pushed a commit that referenced this pull request Aug 6, 2022
This PR builds on the _Finite-State Transducer_ (_FST_) algorithm and the _Logical Stack_ to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section.

**This PR builds on:**
⛓️ #11242
⛓️ #11078

Specifically, the tokenizer comprises the following processing steps:
1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation.
2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack.
3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc.

Authors:
  - Elias Stehle (https://github.com/elstehle)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Tobias Ribizel (https://github.com/upsj)
  - Karthikeyan (https://github.com/karthikeyann)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #11264
rapids-bot bot pushed a commit that referenced this pull request Sep 19, 2022
Adds GPU implementation of JSON-token-stream to JSON-tree 
Depends on PR [Adds JSON-token-stream to JSON-tree](#11291)  #11291 




<details>

---
This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.  

The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264).

**This PR depends on:**
⛓️ #11264
⛓️ #11242
⛓️ #11078

**Each node has one of the following category:**

```
/// A node representing a struct
NC_STRUCT,
/// A node representing a list
NC_LIST,
/// A node representing a field name
NC_FN,
/// A node representing a string value
NC_STR,
/// A node representing a numeric or literal value (e.g., true, false, null)
NC_VAL,
/// A node representing a parser error
NC_ERR
```

**For each node, the tree representation stores the following information:**
- node category
- node level
- node range begin (index of the first character from the original JSON input that this node demarcates)
- node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates)

**An example tree:**
The following is just an example print of the information represented in the tree generated by the algorithm.

- Each line is printing the full path to the next node in the tree. 
- For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>`


**The original JSON for this tree:**
```
  [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95},  {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] 
```

**The tree:**
```
<0:LIST:[2, 3) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'>
```

**The original JSON pretty-printed for this tree:**
```
[
    {
        "category": "reference",
        "index:": [
            4,
            12,
            42
        ],
        "author": "Nigel Rees",
        "title": "[Sayings of the Century]",
        "price": 8.95
    },
    {
        "category": "reference",
        "index": [
            4,
            {},
            null,
            {
                "a": [
                    {},
                    {}
                ]
            }
        ],
        "author": "Nigel Rees",
        "title": "{}[], <=semantic-symbols-string",
        "price": 8.95
    }
]
```
</details>

---

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - David Wendt (https://github.com/davidwendt)

URL: #11518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants