Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup common parsing code in JSON, CSV reader #12022

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 28, 2022

Description

This PR will cleanup nested json reader and csv reader's common parsing code.

  • Uses std::optional for indicating parsing failure in parse_numeric
  • Cleanup
    • Removed decode_value as it only gives only specialization for timestamp and duration types, rest of types are passthrough.
    • Unified decode_digit

Depends on #11898 and #12021

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond non-breaking Non-breaking change labels Oct 28, 2022
@karthikeyann karthikeyann self-assigned this Oct 28, 2022
@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 28, 2022
@karthikeyann karthikeyann requested a review from vuule November 9, 2022 14:36
}
if (exponent != 0) { value *= exp10(double(exponent * exponent_sign)); }
}
}
if (!all_digits_valid) { return error_result; }
if (!all_digits_valid) { return {}; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more explicit here. Current code could be mistaken for returning a zero.

Suggested change
if (!all_digits_valid) { return {}; }
if (!all_digits_valid) { return std::nullopt; }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiles locally, but CI build gives this error
04:37:13 $SRC_DIR/cpp/src/io/utilities/parsing_utils.cuh(240): error #20014-D: calling a __host__ function from a __host__ __device__ function is not allowed

The constructor from nullopt_t arg might be a host function. not constexpr.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann Hi,I also encountered this problem, did you solve this problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. By callling default constructor of std::optional
if (!all_digits_valid) { return std::optional<T>{}; }

@vuule
Copy link
Contributor

vuule commented Nov 9, 2022

CC @galipremsagar for viz and Python perspective on this behavior change

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small suggestions.
It's not obvious at first glance what the desired behavior is, especially since Pandas uses NaN (for it's own reasons). The main benefit of using null here is that we used to return zeroes for invalid integers :( Not a great stand-in for NaN.

cpp/src/io/utilities/parsing_utils.cuh Outdated Show resolved Hide resolved
cpp/tests/io/csv_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the Python Affects Python cuDF API. label Nov 14, 2022
@karthikeyann
Copy link
Contributor Author

0e45d13
This commit reverts change of behaviour of nested json reader to return null element on parsing error of numeric types.
Breaking behaviour in read_csv, and read_json:

  • old behaviour: parsing error returns quiet NAN for that type as result for that row.
  • new behaviour: parsing error returns null for that row.

pushing this breaking change to next release.

@karthikeyann karthikeyann changed the title Null element for parsing error in numeric types in JSON, CSV reader Cleanup common parsing code in JSON, CSV reader Nov 15, 2022
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement.

@@ -670,6 +597,7 @@ struct ConvertFunctor {
parse_options_view const& opts,
bool as_hex)
{
// TODO what's invalid input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make these TODOs clearer if we're planning to merge them?
I looked into the code and don't see that invalid input is handled in a robust way in parse_decimal and to_timestamp. Is that what this TODO needs to remind us of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. this TODO is a reminder for future and also to remind reviewers to discuss and decide what is invalid input for this type.

@karthikeyann karthikeyann added non-breaking Non-breaking change and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer 4 - Needs Review Waiting for reviewer to review or respond breaking Breaking change labels Nov 15, 2022
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

1 similar comment
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fd488cd into rapidsai:branch-22.12 Nov 15, 2022
rapids-bot bot pushed a commit that referenced this pull request Dec 5, 2022
…12272)

This PR changes behaviour of nested json reader to return null element on parsing error of numeric types.
Breaking behaviour in read_csv, and read_json:

- old behaviour: parsing error returns quiet NAN for that type as result for that row.
- new behaviour: parsing error returns null for that row.

Moved commit 0e45d13  #12022 (comment) to 23.02

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #12272
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants