Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for json lines format to the nested JSON reader #11534

Merged
merged 53 commits into from
Aug 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
9f2247f
add placeholder experimental JSON reader
vuule Jul 22, 2022
76b2834
doc fix
vuule Jul 22, 2022
f5464f6
copyright year
vuule Jul 22, 2022
fa2c3a6
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
vuule Jul 25, 2022
2ca0ac0
newline
vuule Jul 25, 2022
7ee4b41
Merge branch 'fea-read_json-experimental' of http://github.com/vuule/…
vuule Jul 25, 2022
3ee7a5a
use span
vuule Jul 25, 2022
fa9ab89
Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …
vuule Jul 26, 2022
fcc90c5
options check + decompression
vuule Jul 27, 2022
b481d66
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 12, 2022
bdbc111
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 13, 2022
22b5a46
adds support for ndjson
elstehle Aug 15, 2022
907a846
Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…
elstehle Aug 15, 2022
87fce7d
addresses outstanding todo
elstehle Aug 15, 2022
c41f178
Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into …
vuule Aug 15, 2022
9669c6a
C++ side changes + test
vuule Aug 15, 2022
c9fb5b2
working Python + test
vuule Aug 15, 2022
2de91e1
clean up
vuule Aug 16, 2022
70dd9b1
stop using column_names
vuule Aug 16, 2022
b1afef0
adds documentation for mr parameter
elstehle Aug 16, 2022
8409214
minor documentation fixes
elstehle Aug 16, 2022
d0e0def
fixes parameter order
elstehle Aug 16, 2022
574ac43
fix copy-paste error
vuule Aug 16, 2022
2de80b7
raw string
vuule Aug 16, 2022
81cab67
Merge branch 'exp-read_json-adapter' of http://github.com/vuule/cudf …
vuule Aug 16, 2022
bc14a1d
remove print in Python test
vuule Aug 16, 2022
bca2e83
addressing reviews
vuule Aug 16, 2022
ba28571
Java fix
vuule Aug 16, 2022
a6d5ab7
style
vuule Aug 16, 2022
397e00f
Merge remote-tracking branch 'upstream/pull-request/11364' into featu…
elstehle Aug 17, 2022
a0bd229
integrates upstream interface changes
elstehle Aug 17, 2022
46a3c44
Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…
elstehle Aug 17, 2022
f3bba9d
enables lines option in the nested reader
elstehle Aug 17, 2022
21b4023
migrates test from details api to reader api
elstehle Aug 17, 2022
cdc4441
improves code comment
elstehle Aug 17, 2022
7174a03
Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…
elstehle Aug 22, 2022
ea6959f
removes in/out specification on params
elstehle Aug 22, 2022
00be915
removes _gpu suffix from tokens
elstehle Aug 22, 2022
73ff307
better translation table comments thx @upsj
elstehle Aug 22, 2022
39243f3
uses device_scalar and better generator
elstehle Aug 22, 2022
fb5e397
Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…
elstehle Aug 23, 2022
7220174
removes code comment banner
elstehle Aug 23, 2022
c6f8d0e
fixes code comments
elstehle Aug 24, 2022
e38f3d8
adds more tests for json lines
elstehle Aug 24, 2022
cdb743d
adds json lines test for experimental nested json reader
elstehle Aug 24, 2022
713260f
fixes style
elstehle Aug 24, 2022
14749f7
parametrizes test and uses bytesio
elstehle Aug 24, 2022
94daa4f
adds seek before reads
elstehle Aug 24, 2022
310f519
Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…
elstehle Aug 25, 2022
c09c4af
prettifies translation table
elstehle Aug 25, 2022
6efecf4
default_stream and more constness
elstehle Aug 25, 2022
272bc16
add TODO for stack ctx interface
elstehle Aug 25, 2022
9822ecb
clarifies treatment of empty lines for ndjson
elstehle Aug 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions cpp/src/io/json/experimental/read_json.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,13 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources,
auto const dtypes_empty =
std::visit([](const auto& dtypes) { return dtypes.empty(); }, reader_opts.get_dtypes());
CUDF_EXPECTS(dtypes_empty, "user specified dtypes are not yet supported");
CUDF_EXPECTS(not reader_opts.is_enabled_lines(), "JSON Lines format is not yet supported");
CUDF_EXPECTS(reader_opts.get_byte_range_offset() == 0 and reader_opts.get_byte_range_size() == 0,
"specifying a byte range is not yet supported");

auto const buffer = ingest_raw_input(sources, reader_opts.get_compression());
auto data = host_span<char const>(reinterpret_cast<char const*>(buffer.data()), buffer.size());

return cudf::io::json::detail::parse_nested_json(data, stream, mr);
return cudf::io::json::detail::parse_nested_json(data, reader_opts, stream, mr);
}

} // namespace cudf::io::detail::json::experimental
33 changes: 19 additions & 14 deletions cpp/src/io/json/nested_json.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

#pragma once

#include <cudf/io/json.hpp>
#include <cudf/io/types.hpp>
#include <cudf/types.hpp>
#include <cudf/utilities/bit.hpp>
Expand Down Expand Up @@ -261,50 +262,54 @@ enum token_t : PdaTokenT {
};

namespace detail {

// TODO: return device_uvector instead of passing pre-allocated memory
/**
* @brief Identifies the stack context for each character from a JSON input. Specifically, we
* identify brackets and braces outside of quoted fields (e.g., field names, strings).
* At this stage, we do not perform bracket matching, i.e., we do not verify whether a closing
* bracket would actually pop a the corresponding opening brace.
*
* @param[in] d_json_in The string of input characters
* @param[in] json_in The string of input characters
* @param[out] d_top_of_stack Will be populated with what-is-on-top-of-the-stack for any given input
* character of \p d_json_in, where a '{' represents that the corresponding input character is
* within the context of a struct, a '[' represents that it is within the context of an array, and a
* '_' symbol that it is at the root of the JSON.
* @param[in] stream The cuda stream to dispatch GPU kernels to
*/
void get_stack_context(device_span<SymbolT const> d_json_in,
void get_stack_context(device_span<SymbolT const> json_in,
elstehle marked this conversation as resolved.
Show resolved Hide resolved
SymbolT* d_top_of_stack,
rmm::cuda_stream_view stream);

/**
* @brief Parses the given JSON string and emits a sequence of tokens that demarcate relevant
* sections from the input.
*
* @param[in] d_json_in The JSON input
* @param[out] d_tokens Device memory to which the parsed tokens are written
* @param[out] d_tokens_indices Device memory to which the indices are written, where each index
* represents the offset within \p d_json_in that cause the input being written
* @param[out] d_num_written_tokens The total number of tokens that were parsed
* @param[in] stream The CUDA stream to which kernels are dispatched
* @param json_in The JSON input
* @param options Parsing options specifying the parsing behaviour
* @param stream The CUDA stream to which kernels are dispatched
* @param mr Optional, resource with which to allocate
* @return Pair of device vectors, where the first vector represents the token types and the second
* vector represents the index within the input corresponding to each token
*/
void get_token_stream(device_span<SymbolT const> d_json_in,
PdaTokenT* d_tokens,
SymbolOffsetT* d_tokens_indices,
SymbolOffsetT* d_num_written_tokens,
rmm::cuda_stream_view stream);
std::pair<rmm::device_uvector<PdaTokenT>, rmm::device_uvector<SymbolOffsetT>> get_token_stream(
elstehle marked this conversation as resolved.
Show resolved Hide resolved
device_span<SymbolT const> json_in,
cudf::io::json_reader_options const& options,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Parses the given JSON string and generates table from the given input.
*
* @param input The JSON input
* @param options Parsing options specifying the parsing behaviour
* @param stream The CUDA stream to which kernels are dispatched
* @param mr Optional, resource with which to allocate.
* @param mr Optional, resource with which to allocate
* @return The data parsed from the given JSON input
*/
table_with_metadata parse_nested_json(
host_span<SymbolT const> input,
cudf::io::json_reader_options const& options,
rmm::cuda_stream_view stream = cudf::default_stream_value,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Expand Down
Loading