Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix out-of-bounds access in ORC writer #7902

Merged
merged 8 commits into from
Apr 9, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Apr 7, 2021

Avoid out-of-bounds access due to streams holding column ids as index + 1, and the first index stream using zero for its column id. In a few places the corresponding column is accessed as [column_id - 1], even when the id is zero.

Other changes:
Small refactoring of ORC streams creation. and stream offset computation.

@vuule vuule added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Apr 7, 2021
@vuule vuule self-assigned this Apr 7, 2021
@codecov
Copy link

codecov bot commented Apr 8, 2021

Codecov Report

Merging #7902 (7cd71a0) into branch-0.20 (599f62d) will increase coverage by 0.42%.
The diff coverage is 88.73%.

❗ Current head 7cd71a0 differs from pull request most recent head 6da64e9. Consider uploading reports for the commit 6da64e9 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #7902      +/-   ##
===============================================
+ Coverage        82.30%   82.72%   +0.42%     
===============================================
  Files              101      103       +2     
  Lines            17053    17705     +652     
===============================================
+ Hits             14035    14647     +612     
- Misses            3018     3058      +40     
Impacted Files Coverage Δ
python/cudf/cudf/utils/utils.py 83.25% <ø> (-1.81%) ⬇️
python/cudf/cudf/utils/dtypes.py 83.44% <46.66%> (-6.45%) ⬇️
python/cudf/cudf/core/groupby/groupby.py 92.41% <78.57%> (-1.04%) ⬇️
python/cudf/cudf/core/column/lists.py 87.41% <80.00%> (+0.19%) ⬆️
python/dask_cudf/dask_cudf/backends.py 89.58% <85.71%> (-0.05%) ⬇️
python/cudf/cudf/core/column/struct.py 96.29% <86.66%> (-3.71%) ⬇️
python/cudf/cudf/core/index.py 93.04% <88.09%> (+0.01%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.92% <91.48%> (-0.92%) ⬇️
python/cudf/cudf/core/column/interval.py 91.11% <92.30%> (+0.48%) ⬆️
python/cudf/cudf/core/column/column.py 87.99% <92.59%> (+0.56%) ⬆️
... and 65 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0943487...6da64e9. Read the comment docs.

@vuule vuule marked this pull request as ready for review April 8, 2021 05:59
@vuule vuule requested a review from a team as a code owner April 8, 2021 05:59
@vuule vuule requested review from cwharris and nvdbaranec April 8, 2021 05:59
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of suggestions to secure things a bit tighter.

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
@vuule vuule requested a review from nvdbaranec April 8, 2021 19:48
cpp/src/io/orc/orc.h Show resolved Hide resolved
private:
// ORC column id (different from column index in the table!)
// Zero means no corresponding column in the table
uint32_t _column_id = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this break any logic below that may have been expecting ~0 ? This stuff:

if (stream.column >= orc2gdf.size()) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the logic to separate "0" column id from no column id, should be consistent with the original logic

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
ff.types[0].subtypes[column.id()] = 1 + column.id();
ff.types[0].fieldNames[column.id()] = column.orc_name();
ff.types[column.id()].kind = column.orc_kind();
ff.types[0].subtypes[column.index()] = column.id();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks confusing, because id() has been renamed to index(), and id() has then been added (returns index+1).

@vuule vuule requested a review from nvdbaranec April 9, 2021 06:48
@vuule
Copy link
Contributor Author

vuule commented Apr 9, 2021

@nvdbaranec thank you for the quick reviews!

@vuule vuule requested a review from rgsl888prabhu April 9, 2021 06:50
@vuule
Copy link
Contributor Author

vuule commented Apr 9, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 348ad4d into rapidsai:branch-0.20 Apr 9, 2021
@vuule vuule deleted the bug-orc-writer-oob branch April 9, 2021 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants