Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support casting of Map type to string in JSON reader #14936

Merged
merged 28 commits into from
Mar 11, 2024

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Jan 30, 2024

Description

Addresses part of #14288
Depends on #14939 (mixed type ignore nulls fix)

In the input schema, if a struct column is given as STRING type, it's forced to be a STRING column.
This could be used to support map type in spark JSON reader. (Force a map type to be a STRING, and use different parser to extract this string column as key, value columns)
To enable this forcing, mixed type as string should be enabled in json_reader_options.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS 4 - Needs cuDF (Java) Reviewer non-breaking Non-breaking change labels Jan 30, 2024
@karthikeyann karthikeyann changed the title Map type as string in JSON reader (prototype) Map type as string in JSON reader Jan 31, 2024
@github-actions github-actions bot added the CMake CMake build issue label Feb 6, 2024
@karthikeyann
Copy link
Contributor Author

Here is an JSONL file with column "C" with 1000 rows, each with 100 keys, 1000 types of keys.
This does not include the time of parsing the column "C" as map.

# Colum "C" read as string
In [6]: %time df = cudf.read_json(open("output.json"), orient="records",  lines=True, mixed_types_as_string=True, dtype={"C": str})
using gpu engine: cudf
CPU times: user 6.15 ms, sys: 24.4 ms, total: 30.5 ms
Wall time: 29.3 ms

# Column "C" read as struct (inferred automatically)
In [7]: %time df2 = cudf.read_json(open("output.json"), orient="records",  lines=True, mixed_types_as_string=True)
using gpu engine: cudf
CPU times: user 367 ms, sys: 7.8 ms, total: 375 ms
Wall time: 373 ms

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 7, 2024
@karthikeyann karthikeyann marked this pull request as ready for review February 7, 2024 13:34
@ttnghia
Copy link
Contributor

ttnghia commented Feb 22, 2024

/ok to test

Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the examples in #14239 (comment), I see the correct results with this PR.

Thank you @karthikeyann

@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few minor questions -

@@ -540,6 +577,12 @@ void make_device_json_column(device_span<SymbolT const> input,
col.column_order.clear();
};

path_from_tree tree_path{column_categories,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to know the tree path only for mixed types, can we create the object only when the option is enabled?

Copy link
Contributor Author

@karthikeyann karthikeyann Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object is light weight. It holds span and couple of primitives. So, it may not matter much if the suggestion is for reducing runtime or memory.
I added the struct because in future it can have a memoizer.

* @param options json reader options which holds schema
* @return data type of the column if present
*/
std::optional<data_type> get_path_data_type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this as a member of the path_from_tree struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path_from_tree struct functions on column tree. This function checks if a column path is present in the json options input schema. They need not be combined as get_path_data_type doesn't use any data from path_from_tree struct.

cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions and comments.

cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@GregoryKimball GregoryKimball changed the title Map type as string in JSON reader Support casting of Map type to string in JSON reader Mar 7, 2024
// Testing function for mixed types in JSON (for spark json reader)
auto test_fn = [](std::string_view json_string, bool lines, std::vector<type_id> types) {
std::map<std::string, cudf::io::schema_element> dtype_schema{
{"foo1", {data_type{type_id::STRING}}}, // list won't be a string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I understand why a list will not be returned as a string? I know that this might not be in the requirements, but if I ask for a nested type to be a string, I really would like for it to be returned as a string no matter what. I'm not 100% sure how difficult that is to pull off though.

Copy link
Contributor Author

@karthikeyann karthikeyann Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restricted forced string by input schema to Struct type alone for 2 reasons.

  1. Map type is enclosed with {}, which is interpreted as struct in nested JSON reader. so, as per requirement, only struct need to be forced as string.
  2. It reduces the search space. Right now, I check for only struct column type, if its path is present in input schema, which reduces some search space. If list also need to be checked, then the path needs to be built for list types too, which is additional work, which does not help the map type support requirement.

It's easy to implement; in cpp/src/io/json/json_column.cu:715 , we need add LIST type also in the condition. Would you prefer to add LIST too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a missed requirement. Sorry about that. In Spark if I ask for a string and it sees a list or a struct, or really just about any type, it is returned as a string in a very similar way to how the mixed type support works. So yes I would love it if we could get this to work for both list and struct types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I will add that if you want to do it as a separate issue, because it is a missed requirement on our part I am fine with that. It is about what ever is simpler for you to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries @revans2 . We could do it as a separate PR to test it extensively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #15278 for this

@karthikeyann karthikeyann requested a review from revans2 March 8, 2024 18:19
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake.

@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 6a03827 into rapidsai:branch-24.04 Mar 11, 2024
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants