Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing trailing comma in json writer #12688

Merged
merged 4 commits into from
Feb 13, 2023

Conversation

karthikeyann
Copy link
Contributor

Description

Fix missing trailing comma in json writer for non-Lines json format.
Updates default rows_per_chunk because 8 is too small and writer is slow.
closes #12687

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer cuIO cuIO issue Performance Performance related issue non-breaking Non-breaking change labels Feb 3, 2023
@karthikeyann karthikeyann requested review from a team as code owners February 3, 2023 08:26
@codecov
Copy link

codecov bot commented Feb 3, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@bad94b9). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head d16da78 differs from pull request most recent head 28ede56. Consider uploading reports for the commit 28ede56 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04   #12688   +/-   ##
===============================================
  Coverage                ?   85.81%           
===============================================
  Files                   ?      158           
  Lines                   ?    25153           
  Branches                ?        0           
===============================================
  Hits                    ?    21586           
  Misses                  ?     3567           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 3, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Do you have performance data for different chunk sizes?

@karthikeyann
Copy link
Contributor Author

image

@bdice
Copy link
Contributor

bdice commented Feb 3, 2023

@karthikeyann Is there a reason we don't use a larger chunk size? It appears that chunk size 512k performance is better than chunk size 256k. Are there other constraints like memory usage that balance against runtime performance? (Are "chunks" needed at all?)

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Feb 3, 2023

@bdice Yes. Memory usage is a major concern. chunk is not neeed. If we don't pass chunk size, it will convert entire table into string and will incur high memory usage (and may fail). Just to be safe, I used 256K.
If we want to use rows_per_chunk effectively, we should measure free memory and estimate the chunk size based on free memory, number of columns (child columns and non-child). It is still an approximation.

For example: In these perf measurements, I used a 2 column dataframe.
without chunksize, >128M rows fail, >2GB json fail.
For 256K chunksize, >512M rows fail, >8GB json fail.

Using 512K would be good too. I am trying to keep the rows_per_chunk small so that if number of columns are high, it will not exceed memory limit.

image

@bdice
Copy link
Contributor

bdice commented Feb 3, 2023

Interesting! Let's make sure that there is guidance in the docs about this: rows_per_chunk should be decreased if writes fail due to running out of memory.

@bdice
Copy link
Contributor

bdice commented Feb 3, 2023

See also this related request for CSV writing: #12690 (comment)

json_writer_options const& options,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
CUDF_EXPECTS(str_column_view.size() > 0, "Unexpected empty strings column.");

string_scalar d_line_terminator{line_terminator};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was unused before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used before too. Instead of passing std::string as argument, now string_scalar is passed.

@karthikeyann karthikeyann requested a review from ttnghia February 13, 2023 06:58
@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 53183cd into rapidsai:branch-23.04 Feb 13, 2023
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Python) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Write JSON misses a comma every 8 records with lines=False
4 participants