Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSV build_buffered (#3338) #3368

Merged
merged 2 commits into from
Dec 19, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 19, 2022

Which issue does this PR close?

Closes #.

Rationale for this change

Yields a roughly 5% performance uplift when reading from memory, as is common when streaming from object storage. Whilst fairly minor as things go, it is effectively free so seemed harmless enough

What changes are included in this PR?

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 19, 2022
@tustvold tustvold requested a review from viirya December 19, 2022 17:28
/// Explicit schema for the CSV file
schema: SchemaRef,
/// Optional projection for which columns to load (zero-based column indices)
projection: Option<Vec<usize>>,
/// File reader
reader: RecordReader<BufReader<R>>,
reader: RecordReader<R>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, this looks like the same as before? Previously BufReader is also there. So reader is either RecordReader<BufReader<R>> before and now. Wondering why it causes the difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because BufRead != BufReader, so previously it would use RecordReader<BufReader<Cursor<Vec<u8>>>> it will now use RecordReader<Cursor<Vec<u8>>> exploiting the fact that Cursor<Vec<u8>>: BufRead

@@ -1081,11 +1089,8 @@ impl ReaderBuilder {
if let Some(t) = self.terminator {
reader_builder.terminator(csv_core::Terminator::Any(t));
}
let reader = RecordReader::new(
BufReader::new(reader),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the "magic" occurs, this BufReader has been moved to only be added to Read passed to build and crucially not to types passed to build_buffered

mut self,
mut reader: R,
) -> Result<Reader<R>, ArrowError> {
) -> Result<BufReader<R>, ArrowError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I didn't see build_buffered returns BufReader<R> instead of Reader<R>. For R is a BufRead case, there is no more std::io::BufReader under RecordReader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think this is more flexible one. 👍

arrow-csv/src/reader/mod.rs Outdated Show resolved Hide resolved
/// Create a new `Reader` from a non-buffered reader
///
/// If `R: BufRead` consider using [`Self::build_buffered`] to avoid unnecessary additional
/// buffering, as internally this method wraps `reader` in [`std::io::BufReader`]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

@tustvold tustvold merged commit 8cab7a2 into apache:master Dec 19, 2022
@ursabot
Copy link

ursabot commented Dec 19, 2022

Benchmark runs are scheduled for baseline = e664208 and contender = 8cab7a2. 8cab7a2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants