Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_json_object() implementation #7286

Merged
merged 40 commits into from
Mar 31, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
b654879
Extremely rough draft.
nvdbaranec Feb 2, 2021
36cd4c1
Add support for full set of operators I believe we will need to suppo…
nvdbaranec Feb 10, 2021
adb5724
Optimization: preprocess the json path into a simple command buffer i…
nvdbaranec Feb 11, 2021
0665942
Merge branch 'branch-0.19' into get_json_object
nvdbaranec Feb 11, 2021
ec7ab4a
Fix incorrect interface in detail header.
nvdbaranec Feb 17, 2021
6d94a73
Add benchmarks for get_json_object(). Couple of bug fixes.
nvdbaranec Feb 18, 2021
b3242b1
Merge branch 'branch-0.19' into get_json_object
nvdbaranec Feb 18, 2021
05ad3fc
Make get_json_object() non-recursive.
nvdbaranec Feb 19, 2021
9411f29
Java bindings for get_json_object
razajafri Feb 22, 2021
ff3544c
Make debug readability formatting of output off by default.
nvdbaranec Feb 22, 2021
78d3dd8
Change interface to get_json_object() to take a cudf::string_scalar i…
nvdbaranec Feb 22, 2021
e124cc5
updated test
razajafri Feb 23, 2021
efb767e
updated to scalar
razajafri Feb 23, 2021
6127b7c
changes to match the cudf
razajafri Feb 23, 2021
d6602bd
Strip quotes from singular returned string values. Propagate validit…
nvdbaranec Feb 24, 2021
05121e6
Merge branch 'get_json_object' of github.com:nvdbaranec/cudf into get…
nvdbaranec Feb 24, 2021
f419636
Return null rows for queries with no result instead of just empty str…
nvdbaranec Feb 24, 2021
3f3ec3d
Merge branch 'branch-0.19' into get_json_object
nvdbaranec Mar 9, 2021
dea355c
Merge branch 'branch-0.19' into get_json_object
nvdbaranec Mar 22, 2021
4cd0e2d
get_json_path() cleaned up and ready for review.
nvdbaranec Mar 25, 2021
b1a2b09
Update meta.yaml
nvdbaranec Mar 25, 2021
02e20b7
Additional docs and cleanup
nvdbaranec Mar 25, 2021
fd330fe
Update java/src/main/native/src/ColumnViewJni.cpp
razajafri Mar 25, 2021
5229790
Update java/src/main/native/src/ColumnViewJni.cpp
razajafri Mar 25, 2021
9465864
Fix spelling.
nvdbaranec Mar 25, 2021
4e4865b
Make larger test strings more human readable.
nvdbaranec Mar 26, 2021
3653d0d
PR review changes. Changed get_json_object_kernel() to take a column…
nvdbaranec Mar 26, 2021
9c761b8
Fixed missing newline.
nvdbaranec Mar 26, 2021
e47b088
Handle additional disallowed cases when indexing into child elements.…
nvdbaranec Mar 28, 2021
f898ca6
Distinguish between "no output" (null result) and "empty output" (val…
nvdbaranec Mar 28, 2021
6829f46
Moved get_json_object() declarations out of strings/substring.hpp to …
nvdbaranec Mar 29, 2021
c0743b4
Clang format
nvdbaranec Mar 29, 2021
285ed92
Use string_view instead of json_string struct. Cleanup benchmark CMak…
nvdbaranec Mar 29, 2021
ef03e30
Fix errant whitespace in meta.yaml. Update benchmarks and JNI binding…
nvdbaranec Mar 29, 2021
0d35aaa
Merge branch 'branch-0.19' into get_json_object
nvdbaranec Mar 29, 2021
975ee51
Remove SPARK_BEHAVIORS #define. Use thrust::optional for more kernel…
nvdbaranec Mar 29, 2021
bc649d8
Clean up includes in detail/json.hpp. Change copyright date back to 2020
nvdbaranec Mar 30, 2021
9d9dbf2
Merge branch 'branch-0.19' into get_json_object
nvdbaranec Mar 30, 2021
e69e6bb
Use offset_type when dealing with output offsets view.
nvdbaranec Mar 30, 2021
74a7154
Newline in benchmark CMakeLists.txt. Remove more includes. Remove u…
nvdbaranec Mar 31, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -178,12 +178,14 @@ test:
- test -f $PREFIX/include/cudf/strings/detail/converters.hpp
- test -f $PREFIX/include/cudf/strings/detail/copying.hpp
- test -f $PREFIX/include/cudf/strings/detail/fill.hpp
- test -f $PREFIX/include/cudf/strings/detail/json.hpp
- test -f $PREFIX/include/cudf/strings/detail/replace.hpp
- test -f $PREFIX/include/cudf/strings/detail/utilities.hpp
- test -f $PREFIX/include/cudf/strings/extract.hpp
- test -f $PREFIX/include/cudf/strings/findall.hpp
- test -f $PREFIX/include/cudf/strings/find.hpp
- test -f $PREFIX/include/cudf/strings/find_multiple.hpp
- test -f $PREFIX/include/cudf/strings/json.hpp
- test -f $PREFIX/include/cudf/strings/padding.hpp
- test -f $PREFIX/include/cudf/strings/replace.hpp
- test -f $PREFIX/include/cudf/strings/replace_re.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,7 @@ add_library(cudf
src/strings/find.cu
src/strings/find_multiple.cu
src/strings/padding.cu
src/strings/json/json_path.cu
src/strings/regex/regcomp.cpp
src/strings/regex/regexec.cu
src/strings/replace/backref_re.cu
Expand Down
5 changes: 5 additions & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -202,3 +202,8 @@ ConfigureBench(STRINGS_BENCH
string/substring_benchmark.cpp
string/translate_benchmark.cpp
string/url_decode_benchmark.cpp)

###################################################################################################
# - json benchmark -------------------------------------------------------------------
ConfigureBench(JSON_BENCH
string/json_benchmark.cpp)
140 changes: 140 additions & 0 deletions cpp/benchmarks/string/json_benchmark.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmark/benchmark.h>
#include <benchmarks/common/generate_benchmark_input.hpp>
#include <benchmarks/fixture/benchmark_fixture.hpp>
#include <benchmarks/synchronization/synchronization.hpp>

#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_wrapper.hpp>

#include <cudf/strings/json.hpp>
#include <cudf/strings/strings_column_view.hpp>

class JsonPath : public cudf::benchmark {
};

float frand() { return static_cast<float>(rand()) / static_cast<float>(RAND_MAX); }

int rand_range(int min, int max) { return min + static_cast<int>(frand() * (max - min)); }

std::vector<std::string> Books{
"{\n\"category\": \"reference\",\n\"author\": \"Nigel Rees\",\n\"title\": \"Sayings of the "
"Century\",\n\"price\": 8.95\n}",
"{\n\"category\": \"fiction\",\n\"author\": \"Evelyn Waugh\",\n\"title\": \"Sword of "
"Honour\",\n\"price\": 12.99\n}",
"{\n\"category\": \"fiction\",\n\"author\": \"Herman Melville\",\n\"title\": \"Moby "
"Dick\",\n\"isbn\": \"0-553-21311-3\",\n\"price\": 8.99\n}",
"{\n\"category\": \"fiction\",\n\"author\": \"J. R. R. Tolkien\",\n\"title\": \"The Lord of the "
"Rings\",\n\"isbn\": \"0-395-19395-8\",\n\"price\": 22.99\n}"};
constexpr int Approx_book_size = 110;
std::vector<std::string> Bicycles{
"{\"color\": \"red\", \"price\": 9.95}",
"{\"color\": \"green\", \"price\": 29.95}",
"{\"color\": \"blue\", \"price\": 399.95}",
"{\"color\": \"yellow\", \"price\": 99.95}",
"{\"color\": \"mauve\", \"price\": 199.95}",
};
constexpr int Approx_bicycle_size = 33;
std::string Misc{"\n\"expensive\": 10\n"};
std::string generate_field(std::vector<std::string> const& values, int num_values)
{
std::string res;
for (int idx = 0; idx < num_values; idx++) {
if (idx > 0) { res += std::string(",\n"); }
int vindex = std::min(static_cast<int>(floor(frand() * values.size())),
static_cast<int>(values.size() - 1));
res += values[vindex];
}
return res;
}

std::string build_row(int desired_bytes)
{
// always have at least 2 books and 2 bikes
int num_books = 2;
int num_bicycles = 2;
int remaining_bytes =
desired_bytes - ((num_books * Approx_book_size) + (num_bicycles * Approx_bicycle_size));

// divide up the remainder between books and bikes
float book_pct = frand();
float bicycle_pct = 1.0f - book_pct;
num_books += (remaining_bytes * book_pct) / Approx_book_size;
num_bicycles += (remaining_bytes * bicycle_pct) / Approx_bicycle_size;

std::string books = "\"book\": [\n" + generate_field(Books, num_books) + "]\n";
std::string bicycles = "\"bicycle\": [\n" + generate_field(Bicycles, num_bicycles) + "]\n";

std::string store = "\"store\": {\n";
if (frand() <= 0.5f) {
store += books + std::string(",\n") + bicycles;
} else {
store += bicycles + std::string(",\n") + books;
}
store += std::string("}\n");

std::string row = std::string("{\n");
if (frand() <= 0.5f) {
row += store + std::string(",\n") + Misc;
} else {
row += Misc + std::string(",\n") + store;
}
row += std::string("}\n");
return row;
}

template <class... QueryArg>
static void BM_case(benchmark::State& state, QueryArg&&... query_arg)
{
srand(5236);
auto iter = thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
[desired_bytes = state.range(1)](int index) { return build_row(desired_bytes); });
int num_rows = state.range(0);
cudf::test::strings_column_wrapper input(iter, iter + num_rows);
cudf::strings_column_view scv(input);
size_t num_chars = scv.chars().size();

std::string json_path(query_arg...);

for (auto _ : state) {
cuda_event_timer raii(state, true, 0);
auto result = cudf::strings::get_json_object(scv, json_path);
cudaStreamSynchronize(0);
}

// this isn't strictly 100% accurate. a given query isn't necessarily
// going to visit every single incoming character. but in spirit it does.
state.SetBytesProcessed(state.iterations() * num_chars);
}

#define JSON_BENCHMARK_DEFINE(name, query) \
BENCHMARK_CAPTURE(BM_case, name, query) \
->ArgsProduct({{100, 1000, 100000, 400000}, {300, 600, 4096}}) \
->UseManualTime() \
->Unit(benchmark::kMillisecond);

JSON_BENCHMARK_DEFINE(query0, "$");
JSON_BENCHMARK_DEFINE(query1, "$.store");
JSON_BENCHMARK_DEFINE(query2, "$.store.book");
JSON_BENCHMARK_DEFINE(query3, "$.store.*");
JSON_BENCHMARK_DEFINE(query4, "$.store.book[*]");
JSON_BENCHMARK_DEFINE(query5, "$.store.book[*].category");
JSON_BENCHMARK_DEFINE(query6, "$.store['bicycle']");
JSON_BENCHMARK_DEFINE(query7, "$.store.book[*]['isbn']");
JSON_BENCHMARK_DEFINE(query8, "$.store.bicycle[1]");
40 changes: 40 additions & 0 deletions cpp/include/cudf/strings/detail/json.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cudf/strings/strings_column_view.hpp>

#include <rmm/cuda_stream_view.hpp>

namespace cudf {
namespace strings {
namespace detail {

/**
* @copydoc cudf::strings::get_json_object
*
* @param stream CUDA stream used for device memory operations and kernel launches
*/
std::unique_ptr<cudf::column> get_json_object(
cudf::strings_column_view const& col,
cudf::string_scalar const& json_path,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace detail
} // namespace strings
} // namespace cudf
50 changes: 50 additions & 0 deletions cpp/include/cudf/strings/json.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
/*
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/strings/strings_column_view.hpp>

namespace cudf {
namespace strings {

/**
* @addtogroup strings_json
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
* @{
* @file
*/

/**
* @brief Apply a JSONPath string to all rows in an input strings column.
*
* Applies a JSONPath string to an incoming strings column where each row in the column
* is a valid json string. The output is returned by row as a strings column.
*
* https://tools.ietf.org/id/draft-goessner-dispatch-jsonpath-00.html
* Implements only the operators: $ . [] *
*
* @param col The input strings column. Each row must contain a valid json string
* @param json_path The JSONPath string to be applied to each row
* @param mr Resource for allocating device memory.
* @return New strings column containing the retrieved json object strings
*/
std::unique_ptr<cudf::column> get_json_object(
cudf::strings_column_view const& col,
cudf::string_scalar const& json_path,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
1 change: 1 addition & 0 deletions cpp/include/doxygen_groups.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@
* @defgroup strings_modify Modifying
* @defgroup strings_replace Replacing
* @defgroup strings_split Splitting
* @defgroup strings_json JSON
* @}
* @defgroup dictionary_apis Dictionary
* @{
Expand Down
6 changes: 3 additions & 3 deletions cpp/src/io/csv/csv_gpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ __global__ void __launch_bounds__(csvparse_block_dim)
} else if (serialized_trie_contains(opts.trie_true, {field_start, field_len}) ||
serialized_trie_contains(opts.trie_false, {field_start, field_len})) {
atomicAdd(&d_columnData[actual_col].bool_count, 1);
} else if (cudf::io::gpu::is_infinity(field_start, next_delimiter)) {
} else if (cudf::io::is_infinity(field_start, next_delimiter)) {
atomicAdd(&d_columnData[actual_col].float_count, 1);
} else {
long countNumber = 0;
Expand Down Expand Up @@ -277,15 +277,15 @@ __inline__ __device__ T decode_value(char const *begin,
char const *end,
parse_options_view const &opts)
{
return cudf::io::gpu::parse_numeric<T, base>(begin, end, opts);
return cudf::io::parse_numeric<T, base>(begin, end, opts);
}

template <typename T>
__inline__ __device__ T decode_value(char const *begin,
char const *end,
parse_options_view const &opts)
{
return cudf::io::gpu::parse_numeric<T>(begin, end, opts);
return cudf::io::parse_numeric<T>(begin, end, opts);
}

template <>
Expand Down
4 changes: 2 additions & 2 deletions cpp/src/io/json/json_gpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ __inline__ __device__ T decode_value(const char *begin,
uint64_t end,
parse_options_view const &opts)
{
return cudf::io::gpu::parse_numeric<T, base>(begin, end, opts);
return cudf::io::parse_numeric<T, base>(begin, end, opts);
}

/**
Expand All @@ -131,7 +131,7 @@ __inline__ __device__ T decode_value(const char *begin,
const char *end,
parse_options_view const &opts)
{
return cudf::io::gpu::parse_numeric<T>(begin, end, opts);
return cudf::io::parse_numeric<T>(begin, end, opts);
}

/**
Expand Down
Loading