-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix errors in chunked ORC writer when no tables were (successfully) written #15393
Fix errors in chunked ORC writer when no tables were (successfully) written #15393
Conversation
if (_state != writer_state::DATA_WRITTEN) { | ||
// writer is either closed or no data has been written | ||
_state = writer_state::CLOSED; | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So no more exception throwing right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fixes the exception (i.e. terminate) in close when footer is empty. this happens when no write calls are made with a chunked writer, and when all write calls threw.
Not sure if this answers the question.
NO_DATA_WRITTEN, // No table data has been written to the sink; if the writer is closed or | ||
// destroyed in this state, it should not write the footer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But does it write the magic? Please clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't guarantee that the magic header is not written if the writer throws after the encode, e.g. while writing to the sink. In this case both the writer and the file are in invalid states (and thus we should not try to write the footer). Your strong exception guarantee only covers the encode.
Can you clarify the cases if we have both successful and failed writes, please? Consider an example: We are writing tables 1, 2, 3. What if only writes of tables 1 and 3 were successful? If we flush, we may have data corruption. Should we just write partial data that way, or should we consider it the same as nothing written at all? If it is difficult to decide, we can expand the enum to cover such situations. In addition, we should return the write state (upon closing, or the user can query it) so they can know what happened. |
The user has information about table 2 write failing; an exception was thrown. They can choose what to do. We do write the footer in this situation regardless of what user does with the exception (close is done in writer destructor), but should end up with a valid file if the encode failed in table 2. This is not a silent corruption situation. We could add an option for user to give up on the chunked writer (i.e. skip close), but this would be a separate PR IMO. |
size_t bytes_written() override { return 0; } | ||
}; | ||
|
||
auto sequence = thrust::make_counting_iterator(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto sequence = thrust::make_counting_iterator(0); | |
auto const sequence = thrust::make_counting_iterator(0); |
…bug-chunked-orc-no-writes
…cudf into bug-chunked-orc-no-writes
/merge |
Description
Closes #15386, #15387
The fixes for the two issues overlap, so I included both in a single PR.
Expanded the
_closed
flag to an enum that tracks if the operations inclose()
should be performed (one or more tables were written to the sink). This way, we don't perform the steps in close when there is no valid file to write the footer for.This includes:
write
calls;write
calls failed;The new enum replaces
skip_close()
that used to fix this issue for a smaller subset of cases.Additionally, writing of the ORC header has been moved after the encode and uses the new state to only write the header in the first
write
call. This way we don't write anything to the sink if there were nowrite
calls with the writer, and if the encode failed in thewrite
s.Checklist