Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(arrow-ipc) Add: Support FileReader and StreamReader skip array data validation #6938

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

totoroyyb
Copy link

@totoroyyb totoroyyb commented Jan 4, 2025

Which issue does this PR close?

Rationale for this change

Beforehand, array data validation is performed for each array creation when reading an IPC file (or stream), which comes with a significant overhead. In some cases, this overhead is unwanted or the file content is trusted.

There are existing functions defined in the codebase to avoid data validation but this is not exposed to the upper level APIs. This PR brings options for both FileReader and StreamReader to disable it.

What changes are included in this PR?

  • Add skip_validations, as an argument, to multiple functions signatures.
  • Provide a new constructor try_new_unvalidated to FileReader
  • Provide a new function with_skip_validations to StreamReader
  • Various related minor changes.

Are there any user-facing changes?

No, I don't think so. There are no API-breaking changes, and essentially, two new APIs are introduced. Other changes are on the internal codes I believe.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 4, 2025
@totoroyyb
Copy link
Author

totoroyyb commented Jan 4, 2025

FYI, clippy does complain about too many arguments on one or two function signatures atm. I just want to give this PR a go for now.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately as implemented this is unsound, in that it permits UB via safe functions, this probably needs a bit more thought

@@ -79,6 +79,7 @@ fn create_array(
field: &Field,
variadic_counts: &mut VecDeque<i64>,
require_alignment: bool,
skip_validations: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically all of these functions are now unsound and must be marked unsafe. This is not ideal, and probably needs a bit more thought into handling this better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility might be to create a new function like:

fn unsafe create_array_unchecked(
    reader: &mut ArrayReader,
    field: &Field,
    variadic_counts: &mut VecDeque<i64>,
    require_alignment: bool,
    skip_validations: bool,
) -> Result<ArrayRef, ArrowError> {

And change the existing function to call it:

fn create_array(
    reader: &mut ArrayReader,
    field: &Field,
    variadic_counts: &mut VecDeque<i64>,
    require_alignment: bool,
) -> Result<ArrayRef, ArrowError> {
  
    let skip_validations = false;
    // safety: enable validatiions when checking
    unsafe { create_array_unchecked(reader, field, variadic_counts, require_aligment, skip_validations) };
}

This is kind of messy though -- finding some way to encapsulate the settings into a struct would be better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls check the latest commit, which separating unsafe codes.

Each function like create_array, now has a unsafe version that skips validation. Codebase is not idea, but currently I don't have much better idea to encapusulate it.

arrow-ipc/src/reader.rs Show resolved Hide resolved
@totoroyyb
Copy link
Author

totoroyyb commented Jan 5, 2025

Refactor all unsafe codes into separate functions. It is a bit messy as mentioned @alamb mentioned. From bottom-to-up, there are many code movings and renames.

Now, FileReader has a new wrapper struct:

/// An iterator over the record batches (without validation) in an Arrow file
pub struct UnvalidatedFileReader<R: Read + Seek> {
    reader: FileReader<R>,
}

impl<R: Read + Seek> Iterator for UnvalidatedFileReader<R> {
    type Item = Result<RecordBatch, ArrowError>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.reader.current_block < self.reader.total_blocks {
            // Use the unsafe `maybe_next_unvalidated` function
            unsafe {
                match self.reader.maybe_next_unvalidated() {
                    Ok(Some(batch)) => Some(Ok(batch)),
                    Ok(None) => None, // End of the file
                    Err(e) => Some(Err(e)),
                }
            }
        } else {
            None
        }
    }
}

This still unsound iteractor implementation, but I am not sure how to express it so that we can use iterator with validation disabled.

Any thoughts or suggestions?

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

Thanks @totoroyyb - I will try and review this PR in the next day or two.

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

I am starting to review this PR now -- thank you for your patience

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you @totoroyyb -- I think this is a very nice feature and it would be great to get in. The current code is quite nicely docmented

At a minimum I think the PR needs

  1. Tests: There should be tests that demonstrate that reading / writing data while skipping validation still works functionally. This includes at a minimum testing any new pub apis
  2. Benchmarks: given the goal of this PR is to improve performance I think we need a benchmark to show it is actually improving the speed. Unfortunately, I wasn't able to find any existing arrow-ipc benchmarks, but can help draft one shortly

I would like to suggest we break it down a little differently to minimize API changes

  1. Avoid adding ArrayDataBuilder::build_aligned_unchecked and instead simply add a flag on the builder. I will try and add a PR on this shortly
  2. Consider adding flags to StreamReader and FilerReader rather than new structs like UnvalidatedStreamReader

arrow-ipc/src/reader.rs Outdated Show resolved Hide resolved
/// This is useful when the file is known to be valid and the user wants to skip validations.
/// This might be useful when the content is trusted and the user wants to avoid the overhead of
/// validating the content.
pub fn try_new_unvalidated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it doesn't set the unvalidated flag. It seems like maybe we can direct the users to the `FileReaderBuilder API if they want to avoid validation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was left here due to the changes I made in the last commit.

@@ -1148,6 +1528,30 @@ impl<R: Read + Seek> RecordBatchReader for FileReader<R> {
}
}

/// An iterator over the record batches (without validation) in an Arrow file
pub struct UnvalidatedFileReader<R: Read + Seek> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for a whole different struct -- wouldn't a flag on FileReader (to mirror the flag on StreamReader) be more consistent?

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

TLDR is I think this is an important PR / feature @totoroyyb and I would like to help make it happen. I will make some initial PRs that hopefully move us forwards

Would you be willing to help write tests and benchmarks?

@totoroyyb
Copy link
Author

@alamb Thanks for reviewing this. Yes, I'd be happy to help with benchmarks and test cases.

Co-authored-by: Andrew Lamb <[email protected]>
@totoroyyb
Copy link
Author

totoroyyb commented Jan 11, 2025

@alamb Regarding adding a flag in the builder, which I did that way in the first commit. tustvold has suggested that we may need to mark unsafe for all functions that skip validation (here).

Seems like with a flag, there is no good way to mark it as unsafe for related functions. Let me know if I misunderstand something. Thanks!

@alamb
Copy link
Contributor

alamb commented Jan 12, 2025

@alamb Regarding adding a flag in the builder, which I did that way in the first commit. tustvold has suggested that we may need to mark unsafe for all functions that skip validation (here).

Seems like with a flag, there is no good way to mark it as unsafe for related functions. Let me know if I misunderstand something. Thanks!

Yeah, sorry this is a bit subtle -- a new bool parameter on a bunch of functions is problematic due to the issue @tustvold mentions.

However I think a bool field on a struct, with an unsafe API on any method that disables validation is reasonable. This is the approach I took in

However, given that the code in arrow-ipc/src/reader.rs is currently mostly structured as free functions, rather than methods on a struct, there is no place to put such a field

So what I was suggesting was to restructure the code in arrow-ipc/src/reader.rs , so instead of

fn create_array(
    reader: &mut ArrayReader,
    field: &Field,
    variadic_counts: &mut VecDeque<i64>,
    require_alignment: bool,
) -> Result<ArrayRef, ArrowError> {

We turn this into a method on ArrayReader:

impl ArrayReader { 
  fn create_array(
      &mut self,
      field: &Field,
      variadic_counts: &mut VecDeque<i64>,
      require_alignment: bool,
  ) -> Result<ArrayRef, ArrowError> {

And then we can add a field on the ArrayReader for skipping validation

Note the methods might not belong on ArrayReader(maybe they need their own struct or something) -- I am not familiar enough with the code to know for sure

@alamb
Copy link
Contributor

alamb commented Jan 12, 2025

@alamb Thanks for reviewing this. Yes, I'd be happy to help with benchmarks and test cases.

Thank you @totoroyyb

Thank you so much.For benchmarks, I wrote up a description here:

For tests, I think we could leverage the existing tests. For example in these locations:

https://github.com/apache/arrow-rs/blob/955180b5cd451c83343e44bc2c510bf377ef2c80/arrow-ipc/src/writer.rs#L1778-L1777

fn deserialize_stream(bytes: Vec<u8>) -> RecordBatch {
let mut stream_reader = StreamReader::try_new(Cursor::new(bytes), None).unwrap();
stream_reader.next().unwrap().unwrap()

We could add a few lines of code that deserialized the Vec<u8> a second time with validation disabled and verify the returned RecordBatch was the same as what was deserialized with validation.

@alamb
Copy link
Contributor

alamb commented Jan 12, 2025

In terms of next steps for this PR I recommend:

  1. Simplify Validation/Alignment APIs of ArrayDataBuilder: validate and align #6966 (I'll merge this in the next day or two after anyone else who is interested has a chance to review)
  2. Try and refactor the reader code into a struct, per feat(arrow-ipc) Add: Support FileReader and StreamReader skip array data validation #6938 (comment)
  3. benchmarks: Add arrow-ipc benchmarks for the IPC reader and writer #6968

Once those three PRs are done and merged, then we can rebase / rework this PR to add the skip_validation flag to the FileReader and StreamReader and add some tests per #6938 (comment)

Let me know what you think @totoroyyb . If you have time, the refactoring or benchmarks might be a good next step. Depending on if I can find time, I may also be able to help (maybe try to refactor the code into a struct)

@totoroyyb
Copy link
Author

Thanks @alamb for detailed description and explanation on everything.

I might not be readily available in the following a few days, so if anyone wants to jump in, please feel free to do so. I will take another detail look into this asap.

I agree that there is no good way to adhere Rust unsafe semantics here and I feel like put it as a field is the best we can do.

When I got time, I will take on benchmark and refactor and let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
3 participants