Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand CSV and JSON reader APIs to accept dtypes as a vector or map of data_type objects #8856

Merged
merged 30 commits into from
Aug 4, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jul 26, 2021

Goal of the PR is to enable CSV to read columns as decimal, and to replace the string-based dtype part of the API. data_type based API is needed because we need to specify scale for decimal columns, and doing this via a string that describes the type is 💩

Changes in the PR:

  • Added overloads to dtype related getters/setters to also take a vector or a map of data_type objects.
    In case of CSV, vector of data_types was already supported. Reworked the implementation to support different use cases that the "dtype-as-string" code path supports.
  • Fixed naming of compression option setter.
  • Added parse_dates option to make up for the special strings that CSV supported to denote that a column needs to be parsed as hexadecimal (the option to pass strings is to be removed).
  • Changed naming of infer_date option to parse_dates.
  • Updated all CSV and JSON tests to use the new APIs.

Breaking because API to specify date columns has been renamed to match the new parse_hex API; renamed from infer_date to parse_dates

Depends on #8843

@vuule vuule added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jul 26, 2021
@vuule vuule self-assigned this Jul 26, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 26, 2021
@vuule vuule added non-breaking Non-breaking change breaking Breaking change and removed non-breaking Non-breaking change labels Jul 26, 2021
@vuule vuule added this to the IO Data Type Expansion milestone Jul 26, 2021
@@ -464,47 +475,71 @@ void reader::impl::set_column_names(device_span<uint64_t const> rec_starts,
}
}

void reader::impl::set_data_types(device_span<uint64_t const> rec_starts,
rmm::cuda_stream_view stream)
std::vector<data_type> reader::impl::parse_data_types(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the diff is very confusing here. Did not rename the function, but added parse_data_types to handle the deprecated API that takes strings. This will make the future removal cleaner.

cpp/src/io/json/reader_impl.cu Outdated Show resolved Hide resolved
@vuule vuule requested a review from rgsl888prabhu August 2, 2021 19:10
@vuule vuule requested a review from elstehle August 3, 2021 05:28
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First high-level round of review. Overall looks good. Just found a few minors so far.

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Show resolved Hide resolved
cpp/src/io/csv/reader_impl.hpp Outdated Show resolved Hide resolved
@vuule vuule requested a review from elstehle August 3, 2021 18:55
@github-actions github-actions bot added the conda label Aug 3, 2021
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest look good, a small suggestion

cpp/include/cudf/detail/utilities/visitor_overload.hpp Outdated Show resolved Hide resolved
@vuule vuule requested a review from a team as a code owner August 4, 2021 19:03
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libcudf changes LGTM 👍

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer labels Aug 4, 2021
@vuule
Copy link
Contributor Author

vuule commented Aug 4, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cc2f192 into rapidsai:branch-21.10 Aug 4, 2021
@vuule vuule deleted the fea-csv-dtypes-api branch August 4, 2021 22:38
rapids-bot bot pushed a commit that referenced this pull request Aug 6, 2021
Found an compile error in `csv_test.cpp` that only occurs with a debug build since it exists in an `assert()` statement.
Looks like this was introduced in PR #8856

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #8981
shwina pushed a commit to shwina/cudf that referenced this pull request Aug 9, 2021
Found an compile error in `csv_test.cpp` that only occurs with a debug build since it exists in an `assert()` statement.
Looks like this was introduced in PR rapidsai#8856

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: rapidsai#8981
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants