Add JSON option to prune columns #14996

karthikeyann · 2024-02-07T17:37:29Z

Description

Resolves #14951
This adds an option prune_columns to json_reader_options (default False)
When set to True, the dtypes option is used as filter instead of type inference suggestion. If dtypes (vector of dtypes, map of dtypes or nested schema), is not specified, output is empty dataframe.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

karthikeyann · 2024-02-07T17:44:32Z

Profiled on GV100 machine.
Reading JSON with 512 columns, 10k rows without filter

Reading 1 columns out of JSON with 512 columns, 10k rows. ~~(with filter 1 row)~~ (with filter 1 column)

unnecesary parse_data() calls are eliminated.
It's possible to eliminate the initialize_json_columns() calls as well (but runtime impact is less, memory usage will reduce, and depends on map type PR #14936)

revans2

This looks great. I tried it out again and the time for pulling one item out of 512 went from 17 seconds to 9 seconds. I have not done traces on it yet to see what the next steps would be, but it is a lot better.

GregoryKimball · 2024-02-07T21:24:42Z

Thank you @karthikeyann, this is a great demonstration! When you mention:

Reading 1 columns out of JSON with 512 columns, 10k rows. (with filter 1 row)

What do you mean by "filter 1 row"?

karthikeyann · 2024-02-08T06:20:16Z

What do you mean by "filter 1 row"?

Sorry. I meant to type "filter 1 column".

keys.json content in each line:
{"key_109": "value0", "key_200": "value0", "key_342": "value0", ... } (500 keys out of 512 columns in each row)

import cudf
import nvtx
# read all 512 columns
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True)
# read only 1 column
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True, dtype={"key_10": str}, use_dtypes_as_filter=True)

copy-pr-bot · 2024-04-08T20:31:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

karthikeyann · 2024-04-08T20:35:49Z

/ok to test

karthikeyann · 2024-04-10T18:11:09Z

/ok to test

karthikeyann · 2024-04-11T03:10:59Z

/ok to test

mythrocks

Some minor nitpicks, but this LGTM.

It's a little late now to suggest this, but one wonders if "column pruning" might have been an acceptable replacement to "column filter", to avoid potential confusion.

I've learnt a couple of things from reviewing this PR, as per usual with @karthikeyann's PRs.

cpp/src/io/json/parser_features.cpp

cpp/src/io/json/json_column.cu

cpp/include/cudf/io/json.hpp

cpp/src/io/json/json_column.cu

cpp/tests/io/json_test.cpp

mythrocks · 2024-04-29T18:23:56Z

cpp/tests/io/json_test.cpp

+ {std::map<std::string, cudf::io::schema_element> dtype_schema{
+   {"a", {dtype<int32_t>()}},
+ };
+in_options.set_dtypes(dtype_schema);
+cudf::io::table_with_metadata result = cudf::io::read_json(in_options);
+// Make sure we have column "a"
+ASSERT_EQ(result.tbl->num_columns(), 1);
+ASSERT_EQ(result.metadata.schema_info.size(), 1);
+EXPECT_EQ(result.metadata.schema_info[0].name, "a");


Is the formatting here a little off?

// include only one column {// schema {std::map<std::string, cudf::io::schema_element> dtype_schema{

This part of code makes the formatting off.
Consecutive { (even with comment), makes clang-format think, it's consecutive uniform initialization braces {{.
I removed the extra { }.

karthikeyann · 2024-04-30T16:37:41Z

/ok to test

karthikeyann · 2024-04-30T16:38:23Z

/ok to test

hyperbolic2346

Filter to prune changes. This looks good to me.

cpp/src/io/json/json_column.cu

cpp/tests/io/json_test.cpp

python/cudf/cudf/_lib/cpp/io/json.pxd

python/cudf/cudf/_lib/json.pyx

python/cudf/cudf/io/json.py

Co-authored-by: Mike Wilson <[email protected]>

karthikeyann · 2024-04-30T17:51:17Z

/ok to test

hyperbolic2346

Thank you for entertaining my nits. I always love a good stoptimization and this is a perfect example!

shrshi

Looks good to me!
One question - do we need to include this option in the java code as well?

karthikeyann · 2024-05-01T02:29:56Z

do we need to include this option in the java code as well?

Yes. @revans2 Should I include the java code changes as well in this PR?

karthikeyann · 2024-05-01T22:13:20Z

/merge

wence-

Could we please have a docstring addition for the python read_json? Additionally, perhaps I am dumb, I couldn't understand the C++ docstring for the prune_columns option.

python/cudf/cudf/io/json.py

cpp/include/cudf/io/json.hpp

karthikeyann · 2024-05-02T17:57:56Z

/ok to test

karthikeyann added 2 commits February 7, 2024 23:03

add use_dtypes_as_filter in json_reader_options

6f8a5bc

add use_dtypes_as_filter in python

f827ae2

karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress cuIO cuIO issue Java Affects Java cuDF API. 4 - Needs cuDF (Java) Reviewer non-breaking Non-breaking change labels Feb 7, 2024

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Java Affects Java cuDF API. labels Feb 7, 2024

revans2 approved these changes Feb 7, 2024

View reviewed changes

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024

revans2 mentioned this pull request Mar 4, 2024

[FEA] Options to validate JSON fields #15222

Open

GregoryKimball assigned karthikeyann Mar 6, 2024

GregoryKimball mentioned this pull request Mar 12, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

karthikeyann changed the base branch from branch-24.04 to branch-24.06 April 8, 2024 20:31

Merge branch 'branch-24.06' into fea-json_filter_columns

1dba438

skip copying string offsets for parsing for skipped dtypes columns

3054949

karthikeyann and others added 2 commits April 11, 2024 03:09

style fix, reduce get_path call

dceab5f

Merge branch 'branch-24.06' into fea-json_filter_columns

d35ee03

Merge branch 'branch-24.06' into fea-json_filter_columns

de1506e

karthikeyann requested a review from davidwendt April 24, 2024 04:14

GregoryKimball requested review from shrshi and elstehle April 29, 2024 17:46

mythrocks approved these changes Apr 29, 2024

View reviewed changes

karthikeyann added 2 commits April 30, 2024 15:40

address review comments

88e8d33

fix format for unit tests

7a34759

karthikeyann requested a review from hyperbolic2346 April 30, 2024 16:37

Merge branch 'branch-24.06' into fea-json_filter_columns

f235492

hyperbolic2346 requested changes Apr 30, 2024

View reviewed changes

karthikeyann and others added 2 commits April 30, 2024 12:42

Apply suggestions from code review

aa49ef7

Co-authored-by: Mike Wilson <[email protected]>

rename python api names

8e0562c

karthikeyann changed the title ~~Add JSON option to use dtypes as Filter~~ Add JSON option to prune columns Apr 30, 2024

karthikeyann requested a review from hyperbolic2346 April 30, 2024 17:58

hyperbolic2346 approved these changes Apr 30, 2024

View reviewed changes

shrshi approved these changes Apr 30, 2024

View reviewed changes

karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 1, 2024

wence- requested changes May 2, 2024

View reviewed changes

python/cudf/cudf/io/json.py Show resolved Hide resolved

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved

karthikeyann and others added 2 commits May 2, 2024 17:55

Update documentation

8def9db

Merge branch 'branch-24.06' into fea-json_filter_columns

e4fd7b7

karthikeyann requested a review from wence- May 2, 2024 17:57

wence- approved these changes May 2, 2024

View reviewed changes

rapids-bot bot merged commit 2fccbc0 into rapidsai:branch-24.06 May 2, 2024
69 checks passed

karthikeyann mentioned this pull request Nov 12, 2024

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON option to prune columns #14996

Add JSON option to prune columns #14996

karthikeyann commented Feb 7, 2024 •

edited

Loading

karthikeyann commented Feb 7, 2024 •

edited

Loading

revans2 left a comment

GregoryKimball commented Feb 7, 2024

karthikeyann commented Feb 8, 2024 •

edited

Loading

copy-pr-bot bot commented Apr 8, 2024

karthikeyann commented Apr 8, 2024

karthikeyann commented Apr 10, 2024

karthikeyann commented Apr 11, 2024

mythrocks left a comment •

edited

Loading

mythrocks Apr 29, 2024

karthikeyann Apr 30, 2024 •

edited

Loading

karthikeyann commented Apr 30, 2024

karthikeyann commented Apr 30, 2024

hyperbolic2346 left a comment

karthikeyann commented Apr 30, 2024

hyperbolic2346 left a comment

shrshi left a comment

karthikeyann commented May 1, 2024

karthikeyann commented May 1, 2024

wence- left a comment

karthikeyann commented May 2, 2024

Add JSON option to prune columns #14996

Add JSON option to prune columns #14996

Conversation

karthikeyann commented Feb 7, 2024 • edited Loading

Description

Checklist

karthikeyann commented Feb 7, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

GregoryKimball commented Feb 7, 2024

karthikeyann commented Feb 8, 2024 • edited Loading

copy-pr-bot bot commented Apr 8, 2024

karthikeyann commented Apr 8, 2024

karthikeyann commented Apr 10, 2024

karthikeyann commented Apr 11, 2024

mythrocks left a comment • edited Loading

Choose a reason for hiding this comment

mythrocks Apr 29, 2024

Choose a reason for hiding this comment

karthikeyann Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

karthikeyann commented Apr 30, 2024

karthikeyann commented Apr 30, 2024

hyperbolic2346 left a comment

Choose a reason for hiding this comment

karthikeyann commented Apr 30, 2024

hyperbolic2346 left a comment

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment

karthikeyann commented May 1, 2024

karthikeyann commented May 1, 2024

wence- left a comment

Choose a reason for hiding this comment

karthikeyann commented May 2, 2024

karthikeyann commented Feb 7, 2024 •

edited

Loading

karthikeyann commented Feb 7, 2024 •

edited

Loading

karthikeyann commented Feb 8, 2024 •

edited

Loading

mythrocks left a comment •

edited

Loading

karthikeyann Apr 30, 2024 •

edited

Loading