-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support casting of Map type to string in JSON reader #14936
Support casting of Map type to string in JSON reader #14936
Conversation
Here is an JSONL file with column "C" with 1000 rows, each with 100 keys, 1000 types of keys. # Colum "C" read as string
In [6]: %time df = cudf.read_json(open("output.json"), orient="records", lines=True, mixed_types_as_string=True, dtype={"C": str})
using gpu engine: cudf
CPU times: user 6.15 ms, sys: 24.4 ms, total: 30.5 ms
Wall time: 29.3 ms
# Column "C" read as struct (inferred automatically)
In [7]: %time df2 = cudf.read_json(open("output.json"), orient="records", lines=True, mixed_types_as_string=True)
using gpu engine: cudf
CPU times: user 367 ms, sys: 7.8 ms, total: 375 ms
Wall time: 373 ms |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the examples in #14239 (comment), I see the correct results with this PR.
Thank you @karthikeyann
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few minor questions -
@@ -540,6 +577,12 @@ void make_device_json_column(device_span<SymbolT const> input, | |||
col.column_order.clear(); | |||
}; | |||
|
|||
path_from_tree tree_path{column_categories, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to know the tree path only for mixed types, can we create the object only when the option is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object is light weight. It holds span and couple of primitives. So, it may not matter much if the suggestion is for reducing runtime or memory.
I added the struct because in future it can have a memoizer.
* @param options json reader options which holds schema | ||
* @return data type of the column if present | ||
*/ | ||
std::optional<data_type> get_path_data_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this as a member of the path_from_tree
struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path_from_tree
struct functions on column tree. This function checks if a column path is present in the json options input schema. They need not be combined as get_path_data_type
doesn't use any data from path_from_tree
struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions and comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you!
// Testing function for mixed types in JSON (for spark json reader) | ||
auto test_fn = [](std::string_view json_string, bool lines, std::vector<type_id> types) { | ||
std::map<std::string, cudf::io::schema_element> dtype_schema{ | ||
{"foo1", {data_type{type_id::STRING}}}, // list won't be a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I understand why a list will not be returned as a string? I know that this might not be in the requirements, but if I ask for a nested type to be a string, I really would like for it to be returned as a string no matter what. I'm not 100% sure how difficult that is to pull off though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restricted forced string by input schema
to Struct type alone for 2 reasons.
- Map type is enclosed with
{}
, which is interpreted as struct in nested JSON reader. so, as per requirement, only struct need to be forced as string. - It reduces the search space. Right now, I check for only struct column type, if its path is present in input schema, which reduces some search space. If list also need to be checked, then the path needs to be built for list types too, which is additional work, which does not help the map type support requirement.
It's easy to implement; in cpp/src/io/json/json_column.cu:715 , we need add LIST type also in the condition. Would you prefer to add LIST too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a missed requirement. Sorry about that. In Spark if I ask for a string and it sees a list or a struct, or really just about any type, it is returned as a string in a very similar way to how the mixed type support works. So yes I would love it if we could get this to work for both list and struct types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I will add that if you want to do it as a separate issue, because it is a missed requirement on our part I am fine with that. It is about what ever is simpler for you to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries @revans2 . We could do it as a separate PR to test it extensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #15278 for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake.
/merge |
Description
Addresses part of #14288
Depends on #14939 (mixed type ignore nulls fix)
In the input schema, if a struct column is given as STRING type, it's forced to be a STRING column.
This could be used to support map type in spark JSON reader. (Force a map type to be a STRING, and use different parser to extract this string column as key, value columns)
To enable this forcing, mixed type as string should be enabled in json_reader_options.
Checklist