Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix errors in chunked ORC writer when no tables were (successfully) written #15393

Merged
merged 8 commits into from
Apr 2, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Mar 26, 2024

Description

Closes #15386, #15387

The fixes for the two issues overlap, so I included both in a single PR.

Expanded the _closed flag to an enum that tracks if the operations in close() should be performed (one or more tables were written to the sink). This way, we don't perform the steps in close when there is no valid file to write the footer for.
This includes:

  • No write calls;
  • All write calls failed;

The new enum replaces skip_close() that used to fix this issue for a smaller subset of cases.

Additionally, writing of the ORC header has been moved after the encode and uses the new state to only write the header in the first write call. This way we don't write anything to the sink if there were no write calls with the writer, and if the encode failed in the writes.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Mar 26, 2024
@vuule vuule self-assigned this Mar 26, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 26, 2024
@vuule vuule changed the title Improve the behavior of chunked ORC writer when no tables were (successfully) written Fix errors in chunked ORC writer when no tables were (successfully) written Mar 26, 2024
@vuule vuule marked this pull request as ready for review March 26, 2024 21:28
@vuule vuule requested a review from a team as a code owner March 26, 2024 21:28
@vuule vuule requested review from harrism and davidwendt March 26, 2024 21:28
Comment on lines +2685 to +2688
if (_state != writer_state::DATA_WRITTEN) {
// writer is either closed or no data has been written
_state = writer_state::CLOSED;
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So no more exception throwing right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes the exception (i.e. terminate) in close when footer is empty. this happens when no write calls are made with a chunked writer, and when all write calls threw.
Not sure if this answers the question.

Comment on lines +231 to +232
NO_DATA_WRITTEN, // No table data has been written to the sink; if the writer is closed or
// destroyed in this state, it should not write the footer.
Copy link
Contributor

@ttnghia ttnghia Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does it write the magic? Please clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't guarantee that the magic header is not written if the writer throws after the encode, e.g. while writing to the sink. In this case both the writer and the file are in invalid states (and thus we should not try to write the footer). Your strong exception guarantee only covers the encode.

@ttnghia
Copy link
Contributor

ttnghia commented Mar 27, 2024

Can you clarify the cases if we have both successful and failed writes, please?

Consider an example: We are writing tables 1, 2, 3. What if only writes of tables 1 and 3 were successful? If we flush, we may have data corruption. Should we just write partial data that way, or should we consider it the same as nothing written at all?

If it is difficult to decide, we can expand the enum to cover such situations. In addition, we should return the write state (upon closing, or the user can query it) so they can know what happened.

@vuule
Copy link
Contributor Author

vuule commented Mar 27, 2024

Can you clarify the cases if we have both successful and failed writes, please?

Consider an example: We are writing tables 1, 2, 3. What if only writes of tables 1 and 3 were successful? If we flush, we may have data corruption. Should we just write partial data that way, or should we consider it the same as nothing written at all?

If it is difficult to decide, we can expand the enum to cover such situations. In addition, we should return the write state (upon closing, or the user can query it) so they can know what happened.

The user has information about table 2 write failing; an exception was thrown. They can choose what to do. We do write the footer in this situation regardless of what user does with the exception (close is done in writer destructor), but should end up with a valid file if the encode failed in table 2. This is not a silent corruption situation.

We could add an option for user to give up on the chunked writer (i.e. skip close), but this would be a separate PR IMO.

size_t bytes_written() override { return 0; }
};

auto sequence = thrust::make_counting_iterator(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto sequence = thrust::make_counting_iterator(0);
auto const sequence = thrust::make_counting_iterator(0);

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor Author

vuule commented Apr 2, 2024

/merge

@rapids-bot rapids-bot bot merged commit 08d86c9 into rapidsai:branch-24.06 Apr 2, 2024
68 checks passed
@vuule vuule deleted the bug-chunked-orc-no-writes branch April 2, 2024 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Chunked ORC writer throws in close() if not tables have been written successfully
3 participants