-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CSV build_buffered (#3338) #3368
Conversation
/// Explicit schema for the CSV file | ||
schema: SchemaRef, | ||
/// Optional projection for which columns to load (zero-based column indices) | ||
projection: Option<Vec<usize>>, | ||
/// File reader | ||
reader: RecordReader<BufReader<R>>, | ||
reader: RecordReader<R>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, this looks like the same as before? Previously BufReader
is also there. So reader
is either RecordReader<BufReader<R>>
before and now. Wondering why it causes the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because BufRead != BufReader
, so previously it would use RecordReader<BufReader<Cursor<Vec<u8>>>>
it will now use RecordReader<Cursor<Vec<u8>>>
exploiting the fact that Cursor<Vec<u8>>: BufRead
@@ -1081,11 +1089,8 @@ impl ReaderBuilder { | |||
if let Some(t) = self.terminator { | |||
reader_builder.terminator(csv_core::Terminator::Any(t)); | |||
} | |||
let reader = RecordReader::new( | |||
BufReader::new(reader), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the "magic" occurs, this BufReader
has been moved to only be added to Read
passed to build
and crucially not to types passed to build_buffered
mut self, | ||
mut reader: R, | ||
) -> Result<Reader<R>, ArrowError> { | ||
) -> Result<BufReader<R>, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I didn't see build_buffered
returns BufReader<R>
instead of Reader<R>
. For R
is a BufRead
case, there is no more std::io::BufReader
under RecordReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think this is more flexible one. 👍
/// Create a new `Reader` from a non-buffered reader | ||
/// | ||
/// If `R: BufRead` consider using [`Self::build_buffered`] to avoid unnecessary additional | ||
/// buffering, as internally this method wraps `reader` in [`std::io::BufReader`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update.
Benchmark runs are scheduled for baseline = e664208 and contender = 8cab7a2. 8cab7a2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
Yields a roughly 5% performance uplift when reading from memory, as is common when streaming from object storage. Whilst fairly minor as things go, it is effectively free so seemed harmless enough
What changes are included in this PR?
Are there any user-facing changes?
No