-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to nullify empty lines #17028
add option to nullify empty lines #17028
Conversation
struct TransduceToken { | ||
bool nullify_empty_lines; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply any performance hit? Please run benchmark with this. If there is any slowdown, we probably need to make this as a template argument (with sacrificing compile time) so we can optimize the code out if it is false
.
@@ -73,5 +74,9 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources, | |||
rmm::cuda_stream_view stream, | |||
rmm::device_async_resource_ref mr); | |||
|
|||
std::tuple<rmm::device_buffer, char> preprocess(cudf::strings_column_view const& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this function is called only in testing. Do we ever need it in the source code in other places. If not, can we generate the test string directly without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. you can test without this function. But idea is that each string row is appended with 1 delimiter that's not present in the strings. This function is provided by @shrshi for you to convert string column to a rmm buffer and delimiter easily.
…n/cudf into enh-json_nullify_empty_lines
This PR is waiting for #17178 to be resolved. |
Closed since the performance improvement of |
Description
This PR adds option to nullify empty lines. in pandas json reader, empty lines are ignored. But for spark empty lines still need to be a null row. So, this options will enable it only when recovery mode RECOVER_WITH_NULL is used.
TODO: unit tests.
Checklist