Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add write_json method to cudf/io/json.hpp #11165

Closed
dagardner-nv opened this issue Jun 28, 2022 · 6 comments · Fixed by #12474
Closed

[FEA] Add write_json method to cudf/io/json.hpp #11165

dagardner-nv opened this issue Jun 28, 2022 · 6 comments · Fixed by #12474
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request

Comments

@dagardner-nv
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently the only method for serializing a cudf dataframe to json requires acquiring the Python GIL.

cudf/io/json.hpp contains a read_json method, and the corresponding csv.hpp header contains both read_csv and write_csv methods.

Describe the solution you'd like
Add a write_json method which is callable from C++ and does not require the Python GIL.

Describe alternatives you've considered
Currently we have to use the Python API from C++:
https://github.com/nv-morpheus/Morpheus/blob/branch-22.08/morpheus/_lib/src/io/serializers.cpp#L109

Additional context
This would be a big performance win for us to not require the GIL in our serialize stage.

@dagardner-nv dagardner-nv added Needs Triage Need team to review and classify feature request New feature or request labels Jun 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Oct 21, 2022

Thank you @dagardner-nv for sharing this request. Currently our main JSON focus is on reading arbitrarily-nested data types (Nested JSON reader).

Writing JSON is complex because there are several common ways to represent tabular data in JSON. Would you please review the orient argument in pandas.read_json and share which convention would match your use case?

EDIT: I see from your link that you are writing JSON Lines as records. Thank you!

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify inactive-30d labels Oct 21, 2022
@dagardner-nv
Copy link
Contributor Author

@GregoryKimball thanks, and yes for the purposes of Morpheus most of the time by the time we get around to writing it the data would be flat.

I wonder if it could be possible to do something like:

if is_simple(df):
    use_cpp_impl(df)
else:
    # Use current impl

@GregoryKimball
Copy link
Contributor

Hello @chinmaychandak, would this to_json feature be useful for your work in Kratos?

@chinmaychandak
Copy link
Contributor

chinmaychandak commented Dec 2, 2022

Thanks for pointing me to this @GregoryKimball!

Yes, to_json without acquiring the Python GIL would be super amazing for both overall performance like David mentioned as well as for streaming computations involving multiple layers of transformations and while writing serialized data (JSON Lines, mostly, in our case) back to sinks like Kafka/S3 from cuStreamz pipelines.

@jsmaupin - anything to add?

@jsmaupin
Copy link

jsmaupin commented Dec 2, 2022

Nothing to add from me. I agree with @chinmaychandak 's comments.

@GregoryKimball GregoryKimball moved this to 23.02 in libcudf Jan 6, 2023
@GregoryKimball GregoryKimball moved this from 23.02 to 23.04 in libcudf Jan 6, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Jan 9, 2023
@GregoryKimball GregoryKimball moved this to Burndown PRs in libcudf Jan 9, 2023
@karthikeyann karthikeyann mentioned this issue Jan 25, 2023
3 tasks
rapids-bot bot pushed a commit that referenced this issue Jan 28, 2023
Adds JSON writer with nested support.
It supports numeric, datetime, duration, strings,  nested types such as struct and list types.
`orient='records'` is only supported now, with `lines=True/False`.
Usage: `df.to_json(engine='cudf')`

closes #11165

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - David Wendt (https://github.com/davidwendt)
  - Michael Wang (https://github.com/isVoid)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #12474
@GregoryKimball GregoryKimball removed the status in libcudf Apr 4, 2023
@GregoryKimball GregoryKimball removed this from libcudf Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants