Fix missing trailing comma in json writer #12688

karthikeyann · 2023-02-03T08:26:11Z

Description

Fix missing trailing comma in json writer for non-Lines json format.
Updates default rows_per_chunk because 8 is too small and writer is slow.
closes #12687

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2023-02-03T09:42:23Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@bad94b9). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head d16da78 differs from pull request most recent head 28ede56. Consider uploading reports for the commit 28ede56 to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-23.04   #12688   +/-   ##
===============================================
  Coverage                ?   85.81%           
===============================================
  Files                   ?      158           
  Lines                   ?    25153           
  Branches                ?        0           
===============================================
  Hits                    ?    21586           
  Misses                  ?     3567           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

bdice

Looks good. Do you have performance data for different chunk sizes?

karthikeyann · 2023-02-03T17:31:53Z

bdice · 2023-02-03T18:39:19Z

@karthikeyann Is there a reason we don't use a larger chunk size? It appears that chunk size 512k performance is better than chunk size 256k. Are there other constraints like memory usage that balance against runtime performance? (Are "chunks" needed at all?)

karthikeyann · 2023-02-03T21:04:30Z

@bdice Yes. Memory usage is a major concern. chunk is not neeed. If we don't pass chunk size, it will convert entire table into string and will incur high memory usage (and may fail). Just to be safe, I used 256K.
If we want to use rows_per_chunk effectively, we should measure free memory and estimate the chunk size based on free memory, number of columns (child columns and non-child). It is still an approximation.

For example: In these perf measurements, I used a 2 column dataframe.
without chunksize, >128M rows fail, >2GB json fail.
For 256K chunksize, >512M rows fail, >8GB json fail.

Using 512K would be good too. I am trying to keep the rows_per_chunk small so that if number of columns are high, it will not exceed memory limit.

bdice · 2023-02-03T22:10:22Z

Interesting! Let's make sure that there is guidance in the docs about this: rows_per_chunk should be decreased if writes fail due to running out of memory.

bdice · 2023-02-03T22:59:06Z

See also this related request for CSV writing: #12690 (comment)

ttnghia · 2023-02-09T21:39:53Z

cpp/src/io/json/write_json.cu

                   json_writer_options const& options,
                   rmm::cuda_stream_view stream,
                   rmm::mr::device_memory_resource* mr)
 {
  CUDF_EXPECTS(str_column_view.size() > 0, "Unexpected empty strings column.");

-  string_scalar d_line_terminator{line_terminator};


So this was unused before?

It was used before too. Instead of passing std::string as argument, now string_scalar is passed.

python/cudf/cudf/_lib/json.pyx

karthikeyann · 2023-02-13T19:31:03Z

/merge

karthikeyann added 2 commits February 3, 2023 13:43

fix missing trailing comma in chunked writer for non-Lines json

ed97359

update default row_per_chunk to 256K

ea041a3

karthikeyann requested review from a team as code owners February 3, 2023 08:26

karthikeyann requested review from bdice, mroeschke and vyasr February 3, 2023 08:26

karthikeyann mentioned this pull request Feb 3, 2023

[BUG] Write JSON misses a comma every 8 records with lines=False #12687

Closed

update pytest

d16da78

karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 3, 2023

bdice approved these changes Feb 3, 2023

View reviewed changes

ttnghia reviewed Feb 9, 2023

View reviewed changes

python/cudf/cudf/_lib/json.pyx Show resolved Hide resolved

karthikeyann requested a review from ttnghia February 13, 2023 06:58

Merge branch 'branch-23.04' into fix-json-chunked_comma

28ede56

ttnghia approved these changes Feb 13, 2023

View reviewed changes

rapids-bot bot merged commit 53183cd into rapidsai:branch-23.04 Feb 13, 2023

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Python) Reviewer labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing trailing comma in json writer #12688

Fix missing trailing comma in json writer #12688

karthikeyann commented Feb 3, 2023

codecov bot commented Feb 3, 2023 •

edited

Loading

bdice left a comment

karthikeyann commented Feb 3, 2023

bdice commented Feb 3, 2023

karthikeyann commented Feb 3, 2023 •

edited

Loading

bdice commented Feb 3, 2023

bdice commented Feb 3, 2023

ttnghia Feb 9, 2023

karthikeyann Feb 13, 2023

karthikeyann commented Feb 13, 2023

Fix missing trailing comma in json writer #12688

Fix missing trailing comma in json writer #12688

Conversation

karthikeyann commented Feb 3, 2023

Description

Checklist

codecov bot commented Feb 3, 2023 • edited Loading

Codecov Report

bdice left a comment

Choose a reason for hiding this comment

karthikeyann commented Feb 3, 2023

bdice commented Feb 3, 2023

karthikeyann commented Feb 3, 2023 • edited Loading

bdice commented Feb 3, 2023

bdice commented Feb 3, 2023

ttnghia Feb 9, 2023

Choose a reason for hiding this comment

karthikeyann Feb 13, 2023

Choose a reason for hiding this comment

karthikeyann commented Feb 13, 2023

codecov bot commented Feb 3, 2023 •

edited

Loading

karthikeyann commented Feb 3, 2023 •

edited

Loading