Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the end-to-end JSON parser implementation #11388

Merged
merged 97 commits into from
Aug 12, 2022
Merged
Changes from 95 commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
0557d41
squashed with bracket/brace test
elstehle Apr 11, 2022
355d1e4
clean up & addressing review comments
elstehle Apr 20, 2022
39a6b65
refactored lookup tables
elstehle Apr 25, 2022
239f138
put lookup tables into their own cudf file
elstehle Apr 25, 2022
39cff80
Change interface for FST to not need temp storage
elstehle Apr 27, 2022
e24a133
removing unused var post-cleanup
elstehle May 4, 2022
caf6195
unified usage of pragma unrolls
elstehle May 9, 2022
ea79a81
Adding hostdevice macros to in-reg array
elstehle May 9, 2022
17dcbfd
making const vars const
elstehle May 9, 2022
6fdd24a
refactor lut sanity check
elstehle May 9, 2022
eccf970
fixes sg-count & uses rmm stream in fst tests
elstehle Jun 2, 2022
9fe8e4b
minor doxygen fix
elstehle Jun 14, 2022
694a365
adopts suggested fst test changes
elstehle Jun 15, 2022
f656f49
adopts device-side test data gen
elstehle Jul 7, 2022
485a1c6
adopts c++17 namespaces declarations
elstehle Jul 9, 2022
5f1c4b5
removes state vector-wrapper in favor of vanilla array
elstehle Jul 11, 2022
e6f8def
some west-const remainders & unifies StateIndexT
elstehle Jul 11, 2022
a798852
adds check for state transition narrowing conversion
elstehle Jul 11, 2022
eb24962
fixes logical stack test includes
elstehle Jul 12, 2022
f52e614
replaces enum with typed constexpr
elstehle Jul 14, 2022
3038058
adds excplitis error checking
elstehle Jul 14, 2022
d351e5c
addresses style review comments & fixes a todo
elstehle Jul 14, 2022
3f47952
replaces gtest asserts with expects
elstehle Jul 14, 2022
cba1619
fixes style in dispatch dfa
elstehle Jul 14, 2022
bea2a02
replaces vanilla loop with iota
elstehle Jul 15, 2022
8a184e9
rephrases documentation on in-reg array
elstehle Jul 16, 2022
78dd893
Merge remote-tracking branch 'upstream/branch-22.08' into feature/fin…
elstehle Jul 16, 2022
7a19f64
Merge remote-tracking branch 'upstream/branch-22.08' into feature/fin…
elstehle Jul 19, 2022
4783aae
improves style in fst test
elstehle Jul 20, 2022
6203709
adds comments in in_reg array
elstehle Jul 20, 2022
ad5817a
adds comments to lookup tables
elstehle Jul 20, 2022
dc55653
fixes formatting
elstehle Jul 20, 2022
378be9f
exchanges loops in favor of copy and fills
elstehle Jul 20, 2022
4ba5472
clarifies documentation in agent dfa
elstehle Jul 20, 2022
7980978
disambiguates transition and translation tables
elstehle Jul 20, 2022
2bce061
minor style fix
elstehle Jul 21, 2022
b37f716
if constexprs and doxy on DFA helper
elstehle Jul 21, 2022
d42869a
minor documentation fix
elstehle Jul 21, 2022
6c889f7
replaces loop for comparing vectors with generic macro
elstehle Jul 21, 2022
8a54c72
uses new vector comparison for logical stack test
elstehle Jul 21, 2022
cc1e135
Added utility to debug print & instrumented code to use it
elstehle Mar 31, 2022
7dba177
switched to using rmm also inside algorithm
elstehle Mar 31, 2022
ff7144a
renaming key-value store op to stack_op
elstehle Apr 4, 2022
61a76b7
device_span
elstehle Apr 4, 2022
c28e327
minor style changes addressing review comments
elstehle Apr 13, 2022
a2f27ae
squashed with bracket/brace test
elstehle Apr 11, 2022
fe4762d
refactored lookup tables
elstehle Apr 25, 2022
a064bdd
put lookup tables into their own cudf file
elstehle Apr 25, 2022
2c729c0
fixes sg-count & uses rmm stream in fst tests
elstehle Jun 2, 2022
dbefb6c
rebase on latest FST
elstehle May 3, 2022
d54f3e5
fixes breaking changes from dependent-FST-PR
elstehle Jun 2, 2022
5fc3399
fixes for breaking downstream interface changes
elstehle Jul 13, 2022
6f65947
wraps if with stream params into detail ns
elstehle Jul 13, 2022
6ffc7f3
renames enums & moving from device_span to ptr params
elstehle Jul 14, 2022
0a7821e
fixes rebase conflicts
elstehle Jul 21, 2022
7396335
fixes escape sequence inside strings and field names and adds test fo…
elstehle Jul 21, 2022
6252208
adds comments on pda transition table states
elstehle Jul 21, 2022
191d71d
adopts new verification macro in test
elstehle Jul 22, 2022
4e99962
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
karthikeyann Jul 22, 2022
3b9a1ed
removes superfluous semicolons
elstehle Jul 22, 2022
632be35
rearranges token order in enum and adds documentation
elstehle Jul 23, 2022
3237772
uses namespace alias and switch to rmm stream in test
elstehle Jul 23, 2022
d2832b9
drops the gpu namespace
elstehle Jul 24, 2022
618ed3f
renames header file extension from h to hpp
elstehle Jul 24, 2022
19b37b7
squashed with minimal example
elstehle Jul 27, 2022
f015594
add parse_json_to_columns -> cudf::column
karthikeyann Jul 27, 2022
389e8e8
+ wraps dbg print in macro
elstehle Jul 27, 2022
42f7c4a
disables debug print by default
elstehle Jul 27, 2022
3e9db89
Merge remote-tracking branch 'origin/feature/json-to-columnar' into f…
elstehle Jul 27, 2022
ecf68fc
changeing interface of get_json_columns to also take device_span
elstehle Jul 28, 2022
93cbe1a
parsing to table_with_metadata
elstehle Jul 28, 2022
a1d8901
removes debug print examples
elstehle Jul 28, 2022
b9296d6
renames lists child col name to elements
elstehle Jul 28, 2022
5eddcc7
adds validity
elstehle Jul 28, 2022
cfcd7a1
fixes style
elstehle Jul 28, 2022
4c2ea7b
minor cleanup
karthikeyann Jul 28, 2022
8fc3adc
use device_uvector at few places
karthikeyann Jul 28, 2022
c1b9213
fixes metadata to match parquets metadata
elstehle Jul 30, 2022
e5b1ba6
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
karthikeyann Aug 1, 2022
fd02437
use make_device_uvector_async
karthikeyann Aug 1, 2022
f5810f8
Apply suggestions from code review (nvdbaranec)
karthikeyann Aug 1, 2022
342d3c3
added mr
karthikeyann Aug 1, 2022
f6a531a
add utf fail, pass test cases
karthikeyann Aug 2, 2022
2f22899
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
karthikeyann Aug 4, 2022
1bfa0ff
fixes cudf column generation for struct cols without child cols
elstehle Aug 6, 2022
c8ae264
Merge remote-tracking branch 'upstream/branch-22.10' into feature/jso…
elstehle Aug 6, 2022
17f44e5
integrates upstream changes
elstehle Aug 6, 2022
25cd354
adds function to compare schema metadata
elstehle Aug 6, 2022
f839bb0
adds more complex test case with lots of nesting and corner cases
elstehle Aug 6, 2022
25d783c
fixes year in copyright notice
elstehle Aug 6, 2022
1860126
uses host-side bitmask_type buffer
elstehle Aug 8, 2022
7711736
addresses review comments
elstehle Aug 11, 2022
9d3c2e0
removes prints from tests
elstehle Aug 11, 2022
da6bc9f
Merge remote-tracking branch 'upstream/branch-22.10' into feature/jso…
elstehle Aug 11, 2022
165f3b7
adds support for streaming
elstehle Aug 12, 2022
bf99646
makes append_row a json_column member
elstehle Aug 12, 2022
9ea6e1c
fixes header includes
elstehle Aug 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions cpp/include/cudf_test/io_metadata_utilities.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
@@ -22,4 +22,10 @@ namespace cudf::test {
void expect_metadata_equal(cudf::io::table_input_metadata in_meta,
cudf::io::table_metadata out_meta);

}
/**
* @brief Ensures that the metadata of two tables matches for the root columns as well as for all
* descendents (recursively)
*/
void expect_metadata_equal(cudf::io::table_metadata lhs_meta, cudf::io::table_metadata rhs_meta);

} // namespace cudf::test
150 changes: 150 additions & 0 deletions cpp/src/io/json/nested_json.hpp
Original file line number Diff line number Diff line change
@@ -16,10 +16,17 @@

#pragma once

#include <cudf/io/types.hpp>
#include <cudf/utilities/default_stream.hpp>

elstehle marked this conversation as resolved.
Show resolved Hide resolved
#include <cudf/types.hpp>
#include <cudf/utilities/bit.hpp>
#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_view.hpp>

#include <vector>

namespace cudf::io::json {

/// Type used to represent the atomic symbol type used within the finite-state machine
@@ -46,6 +53,135 @@ using PdaSymbolGroupIdT = char;
/// Type being emitted by the pushdown automaton transducer
using PdaTokenT = char;

/// Type used to represent the class of a node (or a node "category") within the tree representation
using NodeT = char;

/// Type used to index into the nodes within the tree of structs, lists, field names, and value
/// nodes
using NodeIndexT = uint32_t;

/// Type large enough to represent tree depth from [0, max-tree-depth); may be an unsigned type
using TreeDepthT = StackLevelT;

/**
* @brief Struct that encapsulate all information of a columnar tree representation.
*/
struct tree_meta_t {
std::vector<NodeT> node_categories;
std::vector<NodeIndexT> parent_node_ids;
std::vector<TreeDepthT> node_levels;
std::vector<SymbolOffsetT> node_range_begin;
std::vector<SymbolOffsetT> node_range_end;
};

constexpr NodeIndexT parent_node_sentinel = std::numeric_limits<NodeIndexT>::max();

/**
* @brief Class of a node (or a node "category") within the tree representation
*/
enum node_t : NodeT {
/// A node representing a struct
NC_STRUCT,
/// A node representing a list
NC_LIST,
/// A node representing a field name
NC_FN,
/// A node representing a string value
NC_STR,
/// A node representing a numeric or literal value (e.g., true, false, null)
NC_VAL,
/// A node representing a parser error
NC_ERR,
/// Total number of node classes
NUM_NODE_CLASSES
};

/**
* @brief A column type
*/
enum class json_col_t : char { ListColumn, StructColumn, StringColumn, Unknown };

/**
* @brief Intermediate representation of data from a nested JSON input
*/
struct json_column {
elstehle marked this conversation as resolved.
Show resolved Hide resolved
// Type used to count number of rows
using row_offset_t = size_type;

// The inferred type of this column (list, struct, or value/string column)
json_col_t type = json_col_t::Unknown;

std::vector<row_offset_t> string_offsets;
std::vector<row_offset_t> string_lengths;

// Row offsets
std::vector<row_offset_t> child_offsets;

// Validity bitmap
std::vector<bitmask_type> validity;
row_offset_t valid_count = 0;

// Map of child columns, if applicable.
// Following "items" as the default child column's name of a list column
// Using the struct's field names
std::map<std::string, json_column> child_columns;

// Counting the current number of items in this column
row_offset_t current_offset = 0;

json_column() = default;
json_column(json_column&& other) = default;
json_column& operator=(json_column&&) = default;
json_column(const json_column&) = delete;
json_column& operator=(const json_column&) = delete;

/**
* @brief Fills the rows up to the given \p up_to_row_offset with nulls.
*
* @param up_to_row_offset The row offset up to which to fill with nulls.
*/
void null_fill(row_offset_t up_to_row_offset)
{
// Fill all the rows up to up_to_row_offset with "empty"/null rows
validity.resize(word_index(up_to_row_offset) + 1);
std::fill_n(std::back_inserter(string_offsets),
up_to_row_offset - string_offsets.size(),
(string_offsets.size() > 0) ? string_offsets.back() : 0);
std::fill_n(std::back_inserter(string_lengths), up_to_row_offset - string_lengths.size(), 0);
std::fill_n(std::back_inserter(child_offsets),
up_to_row_offset + 1 - child_offsets.size(),
(child_offsets.size() > 0) ? child_offsets.back() : 0);
current_offset = up_to_row_offset;
}

/**
* @brief Recursively iterates through the tree of columns making sure that all child columns of a
* struct column have the same row count, filling missing rows with nulls.
*
* @param min_row_count The minimum number of rows to be filled.
*/
void level_child_cols_recursively(row_offset_t min_row_count)
{
// Fill this columns with nulls up to the given row count
null_fill(min_row_count);

// If this is a struct column, we need to level all its child columns
if (type == json_col_t::StructColumn) {
for (auto it = std::begin(child_columns); it != std::end(child_columns); it++) {
it->second.level_child_cols_recursively(min_row_count);
}
}
// If this is a list column, we need to make sure that its child column levels its children
else if (type == json_col_t::ListColumn) {
auto it = std::begin(child_columns);
// Make that child column fill its child columns up to its own row count
if (it != std::end(child_columns)) {
it->second.level_child_cols_recursively(it->second.current_offset);
}
}
}
};

/**
* @brief Tokens emitted while parsing a JSON input
*/
@@ -110,6 +246,20 @@ void get_token_stream(device_span<SymbolT const> d_json_in,
SymbolOffsetT* d_tokens_indices,
SymbolOffsetT* d_num_written_tokens,
rmm::cuda_stream_view stream);

/**
* @brief Parses the given JSON string and generates table from the given input.
*
* @param input The JSON input
* @param stream The CUDA stream to which kernels are dispatched
* @param mr Optional, resource with which to allocate.
* @return The data parsed from the given JSON input
*/
table_with_metadata parse_nested_json(
host_span<SymbolT const> input,
rmm::cuda_stream_view stream = cudf::default_stream_value,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace detail

} // namespace cudf::io::json
Loading