Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nanosecond timestamps in parquet #10063

Merged

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Jan 17, 2022

Closes #9393

This PR was intended to support nanoseconds for both duration and timestamp types in parquet. It introduces LogicalType-handling on both reader and writer sides. This PR also includes code cleanups like moving CompactProtocolReader to its own file. Finally, nanosecond durations remain unchanged since it's not fully supported by pyarrow i.e. nanosecond durations are truncated to microseconds (see here).

@PointKernel PointKernel self-assigned this Jan 17, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 17, 2022
@PointKernel PointKernel added non-breaking Non-breaking change 2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request labels Jan 17, 2022
@github-actions
Copy link

github-actions bot commented Mar 4, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@PointKernel PointKernel force-pushed the parquet-nanosecond-support branch from 6af6cc7 to 2467672 Compare March 8, 2022 19:52
@PointKernel PointKernel force-pushed the parquet-nanosecond-support branch from 98d241b to 3113ede Compare March 8, 2022 19:55
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 8, 2022
@github-actions github-actions bot removed the Python Affects Python cuDF API. label Mar 8, 2022
@PointKernel PointKernel requested review from a team as code owners March 12, 2022 00:43
@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 12, 2022
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Only one blocking concern about statistics.

cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

@PointKernel PointKernel requested review from devavret and vuule March 16, 2022 16:32
@PointKernel
Copy link
Member Author

@vuule requesting your review as well.
I've touched the common statistics calculation by adding an int96_timestamps argument. It's used to distinguish the int96 and int64 conversion map where int96 nanoseconds are converted to microseconds but not int64.

@codecov
Copy link

codecov bot commented Mar 16, 2022

Codecov Report

Merging #10063 (dc3cdb8) into branch-22.04 (4596244) will decrease coverage by 0.12%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.04   #10063      +/-   ##
================================================
- Coverage         86.13%   86.01%   -0.13%     
================================================
  Files               139      139              
  Lines             22438    22435       -3     
================================================
- Hits              19328    19298      -30     
- Misses             3110     3137      +27     
Impacted Files Coverage Δ
python/cudf/cudf/core/tools/numeric.py 89.24% <100.00%> (+0.11%) ⬆️
python/dask_cudf/dask_cudf/backends.py 86.44% <100.00%> (+1.47%) ⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <100.00%> (ø)
...thon/dask_cudf/dask_cudf/tests/test_distributed.py 18.86% <0.00%> (-67.93%) ⬇️
python/cudf/cudf/core/column/decimal.py 91.30% <0.00%> (-1.01%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.49% <0.00%> (-0.90%) ⬇️
python/cudf/cudf/core/column/datetime.py 88.55% <0.00%> (-0.52%) ⬇️
python/cudf/cudf/core/column/categorical.py 89.63% <0.00%> (-0.29%) ⬇️
python/cudf/cudf/core/_base_index.py 86.21% <0.00%> (-0.22%) ⬇️
python/cudf/cudf/core/dataframe.py 93.57% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4ce5d5...dc3cdb8. Read the comment docs.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, especially changes to compact protocol RW.
Couple of nitpicks and a suspected bug:

cpp/src/io/statistics/column_statistics.cuh Show resolved Hide resolved
cpp/src/io/statistics/parquet_column_statistics.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/orc_column_statistics.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/statistics_type_identification.cuh Outdated Show resolved Hide resolved
cpp/src/io/statistics/statistics.cuh Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
@PointKernel PointKernel requested a review from vuule March 18, 2022 19:51
@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 40baeb4 into rapidsai:branch-22.04 Mar 21, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 22, 2022
…0483)

Includes `<cstddef>` for `ptrdiff_t` in `parquet/compact_protocol_reader.hpp`. Compilation fails on GCC 11 without this include. Targeting 22.04 since this was broken yesterday in #10063.

Error output:
```
cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:51:17: error: 'ptrdiff_t' does not name a type
   51 |   [[nodiscard]] ptrdiff_t bytecount() const noexcept { return m_cur - m_base; }
      |
cudf/cpp/src/io/parquet/compact_protocol_reader.hpp:22:1: note: 'ptrdiff_t' is defined in header '<cstddef>'; did you forget to '#include <cstddef>'?
```

Also includes `<optional>` in `cpp/include/cudf/table/experimental/row_operators.cuh`, which was broken by #10164.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #10483
@PointKernel PointKernel deleted the parquet-nanosecond-support branch May 26, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet reader produces incorrect timestamps
5 participants