POC for whitespace removal in input JSON data using FST #14931

shrshi · 2024-01-30T01:53:24Z

Description

This PR provides a proof-of-concept for the usage of FST in removing unquoted spaces and tabs in JSON strings. This is a useful feature in the cases where we want to cast a hierarchical JSON object to a string, and overcomes the challenge of processing mixed types using Spark. #14865
The FST assumes that the single quotes in the input data have already been normalized (possibly using normalize_single_quotes).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

elstehle

Thank you so much for working on this and for putting the FST to use 🙂
I did just some early high-level review on the FST stuff. Overall, this looks good already. Just left a few minor comments that may help us to further simplify the logic a bit.

cpp/tests/io/json_whitespace_normalization_test.cu

…n-whitespace-fst

cpp/tests/io/json_whitespace_normalization_test.cu

vuule · 2024-02-03T00:41:51Z

@elstehle would it be feasible to create a single FST that can perform both quote normalization and whitespace removal, and also be configurable to do only one of these preprocessing steps. I know the JSON parser FST is configurable to some extent, but I don't know how limited this approach.

elstehle · 2024-02-05T10:35:49Z

@elstehle would it be feasible to create a single FST that can perform both quote normalization and whitespace removal, and also be configurable to do only one of these preprocessing steps. I know the JSON parser FST is configurable to some extent, but I don't know how limited this approach.

I believe it should be possible to have an FST that does both in a single pass. We'd have to see if it makes sense to integrate all three options, i.e., (1) whitespace removal, (2) quote normalization, (3) both, into a single FST instance or whether that would overcomplicate the translation function and make it too branchy. Or whether it'd be better to have three separate FST instances for each of the three options above.

vuule · 2024-02-05T18:07:01Z

@elstehle would it be feasible to create a single FST that can perform both quote normalization and whitespace removal, and also be configurable to do only one of these preprocessing steps. I know the JSON parser FST is configurable to some extent, but I don't know how limited this approach.

I believe it should be possible to have an FST that does both in a single pass. We'd have to see if it makes sense to integrate all three options, i.e., (1) whitespace removal, (2) quote normalization, (3) both, into a single FST instance or whether that would overcomplicate the translation function and make it too branchy. Or whether it'd be better to have three separate FST instances for each of the three options above.

Thanks. If you're not sure if this is feasible, it probably makes the most sense to start with separate FSTs.

cpp/tests/io/json_whitespace_normalization_test.cu

elstehle

Just a few minor comments, otherwise looks good to me 👍

cpp/tests/io/json_whitespace_normalization_test.cu

…n-whitespace-fst

cpp/tests/io/json_whitespace_normalization_test.cu

bdice

A couple suggestions to improve comments. Otherwise LGTM!

bdice · 2024-02-07T20:41:48Z

cpp/tests/io/json_whitespace_normalization_test.cu

+ *        |   state, whitespaces following escaped double quotes inside strings may be removed.
+ *
+ * NOTE: An important case NOT handled by this FST is that of whitespace following newline
+ * characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`


The example makes it sound like this FST does that transformation. Maybe write:

Suggested change

* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`

* characters within a string. For example, `{"a":"x\n y"}` is unchanged by this FST. It

* does not become `{"a":"x\ny"}`.

With the current FST, we would get the transformation described in the comment, but that is not the expected behaviour i.e. we should not remove whitespace characters within quotes. I think the following would make it clearer -

Suggested change

* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`

* characters within a string. Consider the following example

* Input: {"a":"x\n y"}

* FST output: {"a":"x\ny"}

* Expected output: {"a":"x\n y"}

I'm confused. We are documenting a known bug in the current implementation? Are we intending to fix this before merging?

For compatibility with Spark, we don't need to consider newlines within strings as a part of the string. While reading from JSON lines with the option set to recover from invalid lines, I think newline characters present before the end of the record (like in the example {"a":"x\n y"}) will result in the parser treating it as an invalid line.
I have added the note for the sake of completeness and to clarify the scope of the FST.

cpp/tests/io/json_whitespace_normalization_test.cu

karthikeyann · 2024-02-08T13:06:39Z

Quote normalization is used for entire JSON. Whitespace removal is required only for downstream processing of mixed types (#14865) which should be much smaller than entire JSON. So, this may be the reason for separate FSTs. Per string FST for whitespace could be useful (only if without minimizing the performance).

karthikeyann

Nice and clean FST state table! Great work.

cpp/tests/io/json_whitespace_normalization_test.cu

karthikeyann · 2024-02-08T13:17:31Z

cpp/tests/io/json_whitespace_normalization_test.cu

+  {/* IN_STATE      "       \       \n    <SPC>   OTHER  */
+   /* TT_OOS */ {{TT_DQS, TT_OOS, TT_OOS, TT_OOS, TT_OOS}},
+   /* TT_DQS */ {{TT_OOS, TT_DEC, TT_OOS, TT_DQS, TT_DQS}},
+   /* TT_DEC */ {{TT_DQS, TT_DQS, TT_DQS, TT_DQS, TT_DQS}}}};


Is error state not expected to happen since we don't have a state for error?

There is no error state for both the quote normalization and whitespace normalization. In case of invalid JSON inputs (such as the GroundTruth_InvalidInput test case), it processes them anyway and leaves the error-handling and recovery to the next parsing FST.

karthikeyann · 2024-02-08T13:26:22Z

cpp/tests/io/json_whitespace_normalization_test.cu

+void run_test(const std::string& input, const std::string& output)
+{
+  // Prepare cuda stream for data transfers & kernels
+  rmm::cuda_stream stream{};


should this be cudf::get_default_stream() for tests?

Great idea! It's better to call cudf::test::get_default_stream() here instead of creating a new stream. Fixed.

cpp/tests/io/json_whitespace_normalization_test.cu

karthikeyann · 2024-02-08T13:35:26Z

cpp/tests/io/json_whitespace_normalization_test.cu

+TEST_F(JsonWSNormalizationTest, GroundTruth_InvalidInput)
+{
+  std::string input  = "{\"a\" : \"b }\n{ \"c\" :\t\"d\"}";
+  std::string output = "{\"a\":\"b }\n{\"c\":\"d\"}";


question (not change suggestion):
Why do some strings cases use raw string literal, but some cases are escaped strings?

With raw strings, it's hard to see the positions of spaces and tabs when they are next to each other, especially when editors map tabs to different number of spaces. With escaped strings, I think we have more control.

That's a really good answer! I hadn't considered that.

…n-whitespace-fst

karthikeyann

Look good.

shrshi · 2024-02-08T18:03:10Z

/merge

This work is a follow-up to PR #14931 which provided a proof-of-concept for using the a FST to normalize unquoted whitespaces. This PR implements the pre-processing FST in cuIO and adds a JSON reader option that needs to be set to true to invoke the normalizer. Addresses feature request #14865 Authors: - Shruti Shivakumar (https://github.com/shrshi) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Vukasin Milovanovic (https://github.com/vuule) - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #15033

tests to verify fst correctness

5f47b0b

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 30, 2024

shrshi added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currently a work in progress and removed CMake CMake build issue labels Jan 30, 2024

formatting fixes

e3871de

github-actions bot added the CMake CMake build issue label Jan 30, 2024

shrshi added 3 commits January 30, 2024 10:49

Merge branch 'branch-24.04' into json-whitespace-fst

302ec76

aded more tests

5c0ab9b

formatting fixes

690a61a

elstehle reviewed Jan 30, 2024

View reviewed changes

shrshi added 4 commits January 30, 2024 19:15

Merge branch 'json-whitespace-fst' of github.com:shrshi/cudf into jso…

8e5256d

…n-whitespace-fst

addressing initial PR reviews

529776d

formatting fix

4aaa610

minor comment fix

58f697d

shrshi marked this pull request as ready for review January 30, 2024 22:16

shrshi requested a review from a team as a code owner January 30, 2024 22:16

shrshi requested review from robertmaynard and davidwendt January 30, 2024 22:16

GregoryKimball mentioned this pull request Jan 31, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Closed

bdice reviewed Feb 1, 2024

View reviewed changes

cpp/tests/io/json_whitespace_normalization_test.cu Show resolved Hide resolved

cpp/tests/io/json_whitespace_normalization_test.cu Outdated Show resolved Hide resolved

vuule reviewed Feb 3, 2024

View reviewed changes

cpp/tests/io/json_whitespace_normalization_test.cu Show resolved Hide resolved

vuule reviewed Feb 5, 2024

View reviewed changes

shrshi added 2 commits February 6, 2024 17:46

modified transition table; PR reviews

5957a71

Merge branch 'branch-24.04' into json-whitespace-fst

9c51a21

shrshi requested a review from elstehle February 6, 2024 19:41

elstehle reviewed Feb 6, 2024

View reviewed changes

shrshi added 2 commits February 7, 2024 20:12

reducing state space; addressing feedback

78e47c0

Merge branch 'json-whitespace-fst' of github.com:shrshi/cudf into jso…

dc0c75d

…n-whitespace-fst

vuule approved these changes Feb 7, 2024

View reviewed changes

elstehle approved these changes Feb 7, 2024

View reviewed changes

cpp/tests/io/json_whitespace_normalization_test.cu Show resolved Hide resolved

bdice approved these changes Feb 7, 2024

View reviewed changes

Merge branch 'branch-24.04' into json-whitespace-fst

081c282

karthikeyann reviewed Feb 8, 2024

View reviewed changes

shrshi added 3 commits February 8, 2024 16:12

more fixes based on reviews

de4bd03

Merge branch 'json-whitespace-fst' of github.com:shrshi/cudf into jso…

183e66b

…n-whitespace-fst

Merge branch 'branch-24.04' into json-whitespace-fst

55d3bc5

karthikeyann approved these changes Feb 8, 2024

View reviewed changes

rapids-bot bot merged commit 3f8cb74 into rapidsai:branch-24.04 Feb 8, 2024
68 checks passed

shrshi mentioned this pull request Feb 13, 2024

API for JSON unquoted whitespace normalization #15033

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC for whitespace removal in input JSON data using FST #14931

POC for whitespace removal in input JSON data using FST #14931

shrshi commented Jan 30, 2024

elstehle left a comment

vuule commented Feb 3, 2024

elstehle commented Feb 5, 2024

vuule commented Feb 5, 2024

elstehle left a comment

bdice left a comment

bdice Feb 7, 2024

shrshi Feb 7, 2024

bdice Feb 7, 2024

shrshi Feb 7, 2024

karthikeyann commented Feb 8, 2024

karthikeyann left a comment

karthikeyann Feb 8, 2024

shrshi Feb 8, 2024

karthikeyann Feb 8, 2024

shrshi Feb 8, 2024

karthikeyann Feb 8, 2024

shrshi Feb 8, 2024

bdice Feb 8, 2024

karthikeyann left a comment

shrshi commented Feb 8, 2024

	* characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`
	* characters within a string. For example, `{"a":"x\n y"}` is unchanged by this FST. It
	* does not become `{"a":"x\ny"}`.

- * characters within a string. For example, `{"a":"x\n y"}` ---FST--> `{"a":"x\ny"}`
+ * characters within a string. Consider the following example
+ * Input:           {"a":"x\n y"}
+ * FST output:      {"a":"x\ny"}
+ * Expected output: {"a":"x\n y"}

POC for whitespace removal in input JSON data using FST #14931

POC for whitespace removal in input JSON data using FST #14931

Conversation

shrshi commented Jan 30, 2024

Description

Checklist

elstehle left a comment

Choose a reason for hiding this comment

vuule commented Feb 3, 2024

elstehle commented Feb 5, 2024

vuule commented Feb 5, 2024

elstehle left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikeyann commented Feb 8, 2024

karthikeyann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

shrshi commented Feb 8, 2024