JSON parser integration #11717

karthikeyann · 2022-09-20T00:21:30Z

Description

Integrates the experimental full gpu json parser. This PR is to check integration of various parts of nested json parser.

All test passes and no cuda memory errors.
Depends on #11610 , #11714, and #11746

Benchmark Results

Dated 27-Sept-2022

nested_json_gpu_parser

[0] Quadro GV100

string_size	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	bytes_per_second	peak_memory_usage
2^20 = 1048576	2192x	5.784 ms	11.10%	5.777 ms	11.10%	181.518M	181518461	17.149 MiB
2^21 = 2097152	960x	6.158 ms	5.10%	6.150 ms	5.09%	340.980M	340979979	34.106 MiB
2^22 = 4194304	192x	7.798 ms	15.82%	7.791 ms	15.82%	538.363M	538362570	68.020 MiB
2^23 = 8388608	848x	10.036 ms	4.47%	10.029 ms	4.47%	836.432M	836432150	135.848 MiB
2^24 = 16777216	80x	16.555 ms	9.49%	16.547 ms	9.49%	1.014G	1013889148	271.504 MiB
2^25 = 33554432	552x	27.182 ms	7.52%	27.175 ms	7.52%	1.235G	1234774323	542.817 MiB
2^26 = 67108864	80x	48.934 ms	3.19%	48.927 ms	3.19%	1.372G	1371621284	1.060 GiB
2^27 = 134217728	166x	90.397 ms	1.64%	90.390 ms	1.64%	1.485G	1484872039	2.120 GiB
2^28 = 268435456	87x	172.521 ms	0.99%	172.514 ms	0.99%	1.556G	1556019625	4.239 GiB
2^29 = 536870912	43x	351.175 ms	0.77%	351.170 ms	0.77%	1.529G	1528807741	8.479 GiB
2^30 = 1073741824	11x	667.835 ms	0.18%	667.831 ms	0.18%	1.608G	1607803624	16.957 GiB

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…f-column-type-conversion

Co-authored-by: Elias Stehle <[email protected]>

…e-inference

… fea-json-tree-traversal

…n-explicit-schema

…f-column-type-conversion

…re/leaf-column-type-conversion

…sses

…size

…-integration

… into fea-json-integration

wence- · 2022-09-27T13:34:23Z

cpp/src/io/json/json_tree.cu

+      case token_t::ValueBegin:
+        return NC_STR;  // NC_VAL;
+      // NV_VAL is removed because type inference and
+      // reduce_to_column_tree category collapsing takes care of this.


This feels like a comment for the reviewer, and not future readers of the code. Perhaps:

Suggested change

case token_t::ValueBegin:

return NC_STR; // NC_VAL;

// NV_VAL is removed because type inference and

// reduce_to_column_tree category collapsing takes care of this.

case token_t::ValueBegin:

return NC_STR; // type inference and reduce_to_column_tree category collapsing will later convert this to a value

WDYT? (I don't know the details of the type inference, so my proposed comment might be wrong)

wence-

Not very confident on many of the details, but some potential for a few cleanups?

wence- · 2022-09-27T13:35:13Z

cpp/src/io/json/json_tree.cu

@@ -225,6 +228,7 @@ tree_meta_t get_tree_representation(device_span<PdaTokenT const> tokens,
  // TODO: make it own function.
  rmm::device_uvector<size_type> parent_token_ids(num_tokens, stream);
  rmm::device_uvector<size_type> initial_order(num_tokens, stream);
+  // TODO re-write the algorithm to work only on nodes, not tokens.


Is it worth producing a tracking issue?

wence- · 2022-09-27T13:38:50Z

cpp/tests/io/json_tree.cpp

@@ -289,7 +289,7 @@ tree_meta_t2 get_tree_representation_cpu(device_span<PdaTokenT const> tokens_gpu
      case token_t::StructBegin: return NC_STRUCT;
      case token_t::ListBegin: return NC_LIST;
      case token_t::StringBegin: return NC_STR;
-      case token_t::ValueBegin: return NC_VAL;
+      case token_t::ValueBegin: return NC_STR;  // NC_VAL;


Suggested change

case token_t::ValueBegin: return NC_STR; // NC_VAL;

case token_t::ValueBegin: return NC_STR;

Maybe add the same comment that was introduced above?

wence- · 2022-09-27T13:40:33Z

cpp/tests/io/json_tree.cpp

    cuio_json::NC_FN,   cuio_json::NC_LIST,   cuio_json::NC_STRUCT, cuio_json::NC_STRUCT,
    cuio_json::NC_FN,   cuio_json::NC_STR,    cuio_json::NC_FN,     cuio_json::NC_STR,
-    cuio_json::NC_FN,   cuio_json::NC_VAL};
+    cuio_json::NC_FN,   cuio_json::NC_STR};


If NC_VAL is no longer used at all, does it make sense to remove it from the enums? Or is it still used in the non-experimental engines?

It's removed for a specific null behaviour (null literals are identified as values now). But it might be needed in future (null literals are not checked, it might be change in future). NC_VAL does not add value now, but will be required when we add other features of json parser.

wence- · 2022-09-27T13:43:59Z

python/cudf/cudf/tests/test_json.py

+    bytes_file = BytesIO()
+    data = {
+        "c1": [{"f1": "sf11", "f2": "sf21"}, {"f1": "sf12", "f2": "sf22"}],
+        "c2": [["l11", "l21"], ["l12", "l22"]],


These changes came in as part of #11746, so not sure if you need to merge trunk for the diff to disappear?

wence- · 2022-09-27T13:45:56Z