Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ORC writer output corruption with string columns #7565

Merged
merged 4 commits into from
Mar 12, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Mar 11, 2021

Closes: #7346

Fixes an issue in ORC writer where null counting would not read the mask for every row.
The issue occurs when the column offset is not divisible by 32 so that two words are always read to get 32bits of mask (each read is effectively offset by the columns offset, so when reading the mask for 32 rows, we need to get two words to account for the offset). Namely, the second word is not read when the row is closer than 32 to the end of the chunk. This condition is incorrect for most column offsets, as the current row is not really the first bit of the mask word.
The fix is to adjust the condition when the second mask word is read (assuming that mask in padded to multiple of 32).

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 11, 2021
@codecov
Copy link

codecov bot commented Mar 11, 2021

Codecov Report

Merging #7565 (076b7d9) into branch-0.19 (7871e7a) will increase coverage by 0.51%.
The diff coverage is 92.85%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7565      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.51%     
===============================================
  Files              101      101              
  Lines            16884    17339     +455     
===============================================
+ Hits             13822    14284     +462     
+ Misses            3062     3055       -7     
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 93.34% <ø> (+0.48%) ⬆️
python/cudf/cudf/core/column/column.py 87.80% <75.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.85% <85.71%> (-0.17%) ⬇️
python/cudf/cudf/core/frame.py 89.12% <89.47%> (+0.10%) ⬆️
python/cudf/cudf/core/column/decimal.py 93.33% <90.47%> (-1.54%) ⬇️
python/cudf/cudf/core/dataframe.py 90.58% <95.00%> (+0.11%) ⬆️
python/cudf/cudf/core/series.py 91.57% <95.55%> (+0.78%) ⬆️
python/cudf/cudf/core/column/string.py 86.76% <100.00%> (+0.26%) ⬆️
python/cudf/cudf/core/indexing.py 96.29% <100.00%> (+0.23%) ⬆️
python/cudf/cudf/utils/gpu_utils.py 53.65% <0.00%> (-4.88%) ⬇️
... and 50 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f41c10c...15c354d. Read the comment docs.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find.

cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
@vuule vuule added bug Something isn't working non-breaking Non-breaking change cuIO cuIO issue labels Mar 11, 2021
@vuule vuule self-assigned this Mar 11, 2021
@vuule vuule marked this pull request as ready for review March 11, 2021 23:02
@vuule vuule requested a review from a team as a code owner March 11, 2021 23:02
@vuule vuule removed request for trxcllnt and jrhemstad March 12, 2021 01:33
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool bug in a bad way. Glad this one was found and fixed. Thanks!

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requesting changes because I believe this can be better addressed when we start using column_device_view for orc instead of the current column_data_base and valid_map_base pair of pointers.

}
s->scratch_red[t] = v;
s->scratch_red[t] = valid;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block looks to be over optimized. It's pre-loading an offseted mask word into scratch_red only to have individual threads treating this shared memory as the new mask, and then doing an exclusive sum over the validity values.

This could be achieved using an iterator for validity, directly accessing the mask itself. and that iterator could be given an appropriate offset, which I believe would be the column offset + this dictionary chunk's start row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I also had a few ideas on what can be simplified around this code, but wanted to keep such changes separate from the actual fix.

@vuule
Copy link
Contributor Author

vuule commented Mar 12, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ff0c537 into rapidsai:branch-0.19 Mar 12, 2021
@vuule vuule deleted the bug-orc-valid-mask branch March 12, 2021 21:08
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
Closes: rapidsai#7346

Fixes an issue in ORC writer where null counting would not read the mask for every row. 
The issue occurs when the column offset is not divisible by 32 so that two words are always read to get 32bits of mask (each read is effectively offset by the columns offset, so when reading the mask for 32 rows, we need to get two words to account for the offset). Namely, the second word is not read when the row is closer than 32 to the end of the chunk. This condition is incorrect for most column offsets, as the current row is not really the first bit of the mask word.
The fix is to adjust the condition when the second mask word is read (assuming that mask in padded to multiple of 32).

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - @nvdbaranec
  - Mike Wilson (@hyperbolic2346)
  - Devavret Makkar (@devavret)

URL: rapidsai#7565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] orc writing can produce invalid orc file
4 participants