Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nullmask offset handling in parquet and orc writer #6889

Merged
merged 15 commits into from
Dec 13, 2020
Merged

Fix nullmask offset handling in parquet and orc writer #6889

merged 15 commits into from
Dec 13, 2020

Conversation

kaatish
Copy link
Contributor

@kaatish kaatish commented Dec 3, 2020

Fixes #6642

@kaatish kaatish requested a review from a team as a code owner December 3, 2020 18:41
@kaatish kaatish requested review from harrism and codereport December 3, 2020 18:41
@kaatish kaatish added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Dec 3, 2020
@kaatish kaatish requested review from vuule and devavret and removed request for harrism December 3, 2020 18:43
@kaatish kaatish added the bug Something isn't working label Dec 3, 2020
@vuule vuule requested a review from rgsl888prabhu December 3, 2020 21:55
}
bitmask_type mask =
__funnelshift_r(current_mask_word, next_mask_word, current_valid_offset);
valid = 0xff & mask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use existing function

__device__ bitmask_type get_mask_offset_word(bitmask_type const *__restrict__ source,
size_type destination_word_index,
size_type source_begin_bit,
size_type source_end_bit)
{
size_type source_word_index = destination_word_index + word_index(source_begin_bit);
bitmask_type curr_word = source[source_word_index];
bitmask_type next_word = 0;
if (word_index(source_end_bit) >
word_index(source_begin_bit +
destination_word_index * detail::size_in_bits<bitmask_type>())) {
next_word = source[source_word_index + 1];
}
return __funnelshift_r(curr_word, next_word, source_begin_bit);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be reused but it should be changed to an inline function and moved to a cuh file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't affect performance we can use reuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the get_mask_offset_word function to column_device_view.cuh and inlined it so that it can be reused here.

cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic of the change looks good to me, for the most part. Some minor suggestions, mostly related to the clean up of the existing code.

cpp/src/io/orc/dict_enc.cu Show resolved Hide resolved
cpp/src/io/orc/dict_enc.cu Show resolved Hide resolved
cpp/src/io/orc/stripe_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
(row + 32 < s->chunk.start_row + s->chunk.num_rows) ? valid_map[(row >> 5) + 1] : 0;
uint32_t v1 = (row + 32 < s->chunk.start_row + s->chunk.num_rows)
? valid_map[((row + s->chunk.column_offset) >> 5) + 1]
: 0;
v = __funnelshift_r(v, v1, row & 0x1f);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devavret Shouldn't this be changed to
v = __funnelshift_r(v, v1, (row + s->chunk.column_offset) & 0x1f);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears so. Also it seems the wrapping (& 0x1f) is redundant?

@codecov
Copy link

codecov bot commented Dec 4, 2020

Codecov Report

Merging #6889 (d07f581) into branch-0.18 (df5d452) will increase coverage by 0.87%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #6889      +/-   ##
===============================================
+ Coverage        81.58%   82.46%   +0.87%     
===============================================
  Files               96       96              
  Lines            15920    16969    +1049     
===============================================
+ Hits             12989    13994    +1005     
- Misses            2931     2975      +44     
Impacted Files Coverage Δ
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 98.74% <0.00%> (+0.02%) ⬆️
python/cudf/cudf/core/join/join.py 92.44% <0.00%> (+0.03%) ⬆️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df5d452...d07f581. Read the comment docs.

skirui-source and others added 2 commits December 4, 2020 20:22
Closes #5247

Adds `agg` function for DataFrame

Authors:
  - Sheilah Kirui <[email protected]>
  - Sheilah Kirui <[email protected]>
  - Michael Wang <[email protected]>
  - skirui-source <[email protected]>
  - galipremsagar <[email protected]>
  - GALI PREM SAGAR <[email protected]>
  - Keith Kraus <[email protected]>
  - Ashwin Srinath <[email protected]>

Approvers:
  - Michael Wang
  - Michael Wang
  - Keith Kraus

URL: #6483
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@kaatish kaatish requested review from a team as code owners December 4, 2020 21:12
@kaatish kaatish changed the title [WIP] Fix nullmask offset handling in parquet and orc writer [REVIEW] Fix nullmask offset handling in parquet and orc writer Dec 7, 2020
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like something went wrong with the merge, dataframe.py and test_dataframe.py come from a another PR

CHANGELOG.md Outdated Show resolved Hide resolved
cpp/src/io/statistics/column_stats.cu Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Dec 10, 2020

rerun tests

CHANGELOG.md Outdated Show resolved Hide resolved
@vuule vuule self-requested a review December 10, 2020 20:35
CHANGELOG.md Outdated Show resolved Hide resolved
}
bitmask_type mask =
__funnelshift_r(current_mask_word, next_mask_word, current_valid_offset);
valid = 0xff & mask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't affect performance we can use reuse it.

@rgsl888prabhu rgsl888prabhu changed the title [REVIEW] Fix nullmask offset handling in parquet and orc writer Fix nullmask offset handling in parquet and orc writer Dec 10, 2020
@codereport codereport removed their request for review December 11, 2020 03:00
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks good to me, just had q small query

@harrism harrism removed the request for review from a team December 13, 2020 22:35
@rapids-bot rapids-bot bot merged commit 8dbaa2f into rapidsai:branch-0.18 Dec 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet writer does not apply offset to nullmask
8 participants