Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate errors from Parquet reader kernels back to host #14167

Merged
merged 18 commits into from
Sep 28, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Sep 21, 2023

Description

Pass the error code to the host when a kernel detects invalid input.
If multiple errors types are detected, they are combined using a bitwise OR so that caller gets the aggregate error code that includes all types of errors that occurred.

Does not change the kernel side checks.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Sep 21, 2023
@vuule vuule self-assigned this Sep 21, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 21, 2023
@vuule vuule marked this pull request as ready for review September 23, 2023 01:38
@vuule vuule requested a review from a team as a code owner September 23, 2023 01:38
@vuule vuule requested review from bdice and divyegala September 23, 2023 01:38
@vuule
Copy link
Contributor Author

vuule commented Sep 25, 2023

CC @nvdbaranec @etseidl
Please let me know if you see any issues with this approach to error reporting in the reader.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a better way to do this. Do we want to define some constants for the error codes?

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor Author

vuule commented Sep 25, 2023

Do we want to define some constants for the error codes?

Definitely, just not sure if it should be in this PR. Related - should we return the error code as a bitmask? Would returning multiple errors even be useful?

@etseidl
Copy link
Contributor

etseidl commented Sep 25, 2023

Definitely, just not sure if it should be in this PR. Related - should we return the error code as a bitmask? Would returning multiple errors even be useful?

I think a bitmask might be a bit much, and limits us to 32 errors. There will probably be more ways to fail than that, esp if we also return errors from the preprocessing kernels.

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
Comment on lines 75 to 76
cuda::atomic_ref<int32_t, cuda::thread_scope_block> ref{const_cast<int&>(error)};
ref.store(err, cuda::std::memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this atomic necessary? I didn't see any places where anything other than thread 0 (of the block) sets the error code. I suppose that may not be the case in the future. Based on how this is called, I wonder if an atomic OR is better here so we can stash multiple error types as individual bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it atomic since we probably don't need to worry about performance when failing. This seemed like a safe option for future checks as well.

About the error code as mask - Ed is concerned about the limit on the number of errors that this would impose. I could be convinced to go either way, don't expect the trade-off to be relevant in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH the most common error condition is going to be a buffer overrun detected somewhere. We could probably get away without codes at all and have a single error bit. The host code calling the kernel can report which kernel failed. It just comes down to how fine grained you want the error reporting to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see it either way. It's so hard to even know what thread failed and the context of why (possibly because some other thread did something wrong) having a set of bits could act as bread-crumbs to lead you to where things really went wrong. But on the other hand, you're a lot more limited on what you can report. I'm fine either way. Parallel error reporting is amusing in any case.

@vuule
Copy link
Contributor Author

vuule commented Sep 26, 2023

Looks like we're leaning towards a mask to aggregate errors. I'll make the changes.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like this mechanism. The explicit names also remove some of the mystery when reading the code itself too.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few naming nits :D

cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved
@vuule vuule requested a review from PointKernel September 27, 2023 21:25
@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 27, 2023
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vuule
Copy link
Contributor Author

vuule commented Sep 28, 2023

/merge

@rapids-bot rapids-bot bot merged commit 2c19bf3 into rapidsai:branch-23.10 Sep 28, 2023
54 of 55 checks passed
rapids-bot bot pushed a commit that referenced this pull request Oct 20, 2023
Fixes #13656.  Uses the error reporting introduced in #14167 to report errors in header parsing.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #14237
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants