Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unused variable and fix memory issue in ORC writer #12984

Merged
merged 6 commits into from
Mar 23, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 21, 2023

This removes unused variable in ORC writer, and also fix a memory issue with dereferencing dangling pointers due to a device buffer being destroyed early while it should be kept alive.

@ttnghia ttnghia added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf blocker libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Mar 21, 2023
@ttnghia ttnghia requested a review from a team as a code owner March 21, 2023 21:42
@ttnghia ttnghia self-assigned this Mar 21, 2023
@ttnghia ttnghia requested review from bdice, davidwendt and vuule March 21, 2023 21:42
segmentation,
stripes,
orc_table,
pd_masks, /* unused, but needs to be kept alive */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why the lifetime matters if it's unused? I don't see what this is holding, or what is holding this data, that makes its lifetime relevant.

Copy link
Contributor Author

@ttnghia ttnghia Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That buffer contains intermediate null mask for the intermediate data (orc_table) during null pushdown. Thus, we need to keep it alive in order to access the null mask later in the write_ function. Otherwise, the null mask pointer in orc_table will be dangling pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easy to overlook and ignore such variable because it is unused, leading to hidden invalid memory access bugs when the buffer is destroyed earlier than needed, similar to this situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed after convert_table_to_orc_data? statistics computation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not used, but it stores null mask for orc_table to be valid. So as long as we still need orc_table we need to keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as long as we still need orc_table we need to keep it.

Not exactly. We only need it as long as we need the null masks in the orc_table. I can't find where that is used convert_table_to_orc_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in calculate_aligned_rowgroup_bounds which is called in encode_columns.

bdice
bdice previously requested changes Mar 21, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory lifetime is extremely non-obvious and hidden behind several layers of functions. We have a few issues I'd like to improve here:

  1. The line auto pd_masks = init_pushdown_null_masks(orc_table, stream); is modifying orc_table in-place with no docstrings or comments indicating that the input orc_table is modified (!) or that the lifetime of pd_masks is relevant. I would recommend that we avoid in-place modifications of input parameters and make the return values meaningful -- rather than just their lifetimes.
  2. The struct type pushdown_null_masks is never really used. It has members data and masks but it's just storing things. It could be a std::tuple or pair instead.
  3. Should the lifetime of the pushed-down null masks be tied to the lifetime of the orc_table_view? Are there other similar objects? Could a single object be used to tie their lifetimes together?

I don't want to block this PR indefinitely. I'd be happy with a partial solution to any of these issues, or at least a discussion of how we can improve and an issue with next steps.

@vuule
Copy link
Contributor

vuule commented Mar 21, 2023

@bdice Maybe pushdown_null_masks should be a part of orc_table?
IMO using a tuple here just takes away a bit of information (name+docs). What is the benefit?

@bdice
Copy link
Contributor

bdice commented Mar 21, 2023

@bdice Maybe pushdown_null_masks should be a part of orc_table?

Maybe so. That would fix a lot of the lifetime questions. I don't know if there is precedent in the I/O codebase for this, or if it's confusing to have a view type (orc_table_view) that actually owns data, but I could support such a refactor.

IMO using a tuple here just takes away a bit of information (name+docs). What is the benefit?

It's not a big deal either way, but I assumed the struct would be using accessors like pd_masks.masks or pd_masks.data and never saw them. If the objects are just needing to stay alive, why should we name them and define a struct? Their names are never used, so a tuple/pair can anonymously hold them. Also, I'm not even sure if the lifetimes of the pointers matter except for the fact these pointers are assigned asynchronously on a stream (which is subtle and also deserves a comment).

@ttnghia ttnghia added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 21, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 21, 2023

Taking this into draft as can't verify/reproduce the issue thus I'm not sure if my fix is correct. I also found something else that can be the source of problem. Will report later.

@ttnghia ttnghia marked this pull request as draft March 21, 2023 23:23
@vuule
Copy link
Contributor

vuule commented Mar 22, 2023

@bdice Maybe pushdown_null_masks should be a part of orc_table?

Maybe so. That would fix a lot of the lifetime questions. I don't know if there is precedent in the I/O codebase for this, or if it's confusing to have a view type (orc_table_view) that actually owns data, but I could support such a refactor.

I thought that we had both orc_table and orc_table_view. But we only have the view, and, yes, it should not own data. I'm going through the writer to make sure pushdown_null_masks are actually needed, since, as Nghia said, there are other suspected issues.

@vuule
Copy link
Contributor

vuule commented Mar 22, 2023

EOD update: still can't find where pushdown null masks are used after the encode. I'm really confused about the reported fix.
Since columns in orc_table_view point to pushdown masks and column_device_views that are owned by other objects, perhaps the best solution is to include them in the class and rename orc_table_view to something that does not imply lack of ownership (e.g. orc_table_info). It will still reference a cudf::table through non-owning views, but we can at least hold on to the internal buffers more reliably.

@github-actions github-actions bot added the CMake CMake build issue label Mar 22, 2023
@github-actions github-actions bot removed the CMake CMake build issue label Mar 22, 2023
@ttnghia ttnghia marked this pull request as ready for review March 22, 2023 18:52
@ttnghia ttnghia removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 22, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 22, 2023

Alright, with the help of GCC address sanitizer (ASAN), I caught the bug at this line, which deref host_stripe_dict that is dangling. The related buffer is stripe_dict which should be returned and kept alive. After fixing that, both GCC ASAN and CUDA compute-sanitizer do not detect anything else.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We can rework the code to eliminate the need to store the dictionaries separately in a later PR.

@vuule vuule requested a review from bdice March 22, 2023 20:07
@bdice bdice dismissed their stale review March 23, 2023 02:38

Outdated

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 23, 2023

/merge

@rapids-bot rapids-bot bot merged commit 9753ace into rapidsai:branch-23.04 Mar 23, 2023
@ttnghia ttnghia deleted the fix_orc_writer branch March 23, 2023 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants