-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43833: [Integration] validate that arrow IPC files include a valid embedded IPC stream #43834
base: main
Are you sure you want to change the base?
Conversation
ARROW_ASSIGN_OR_RAISE(auto footer_cookie, arrow_file->ReadAt(file_size - 10, 10)); | ||
auto footer_size = | ||
bit_util::FromLittleEndian(util::SafeLoadAs<int32_t>(footer_cookie->data())); | ||
int64_t footer_offset = 8 + file_size - footer_size - 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 8 +
is a bit cryptic. I realize it accounts for the 8-byte padded magic at the start of file, but a comment would make this easier to understand.
Or perhaps you can change the checks to something like:
// The footer is followed by the 10-byte footer length + magic
const int64_t footer_offset = file_size - footer_size - 10;
// Make sure the stream ends before the footer
ARROW_ASSIGN_OR_RAISE(const int64_t stream_size, stream->Tell());
const int64_t stream_end_offset = stream_size + 8;
if (stream_end_offset > footer_offset) {
} | ||
ARROW_ASSIGN_OR_RAISE(int64_t stream_size, stream->Tell()); | ||
|
||
if (footer_offset <= stream_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why not footer_offset < stream_size
? It should be ok if the stream ends just before the footer.
Rationale for this change
IPC files should contain a valid IPC stream with EOS. Some arrow producers don't guarantee that the EOS is written. However there might be readers depending on simply reading the embedded IPC stream (and depending on the EOS to avoid reading the Footer as a corrupted message) so we should validate stream reading from IPC files as well as file reading.
What changes are included in this PR?
arrow-json-integration-test will now also try to read a file as a stream, which will fail if the file doesn't include an EOS or otherwise isn't a valid stream.
Are these changes tested?
yes
Are there any user-facing changes?
no