Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BallistaContext::collect streaming #535

Merged
merged 1 commit into from
Jun 11, 2021

Conversation

edrevo
Copy link
Contributor

@edrevo edrevo commented Jun 10, 2021

Which issue does this PR close?

Closes #534.

Rationale for this change

The collect implementation in BallistaContext is bringing all of the contents in memory even though there is no need for it.

What changes are included in this PR?

Collect will now use streams all the way to avoid hogging memory

Are there any user-facing changes?

No

@edrevo
Copy link
Contributor Author

edrevo commented Jun 10, 2021

cc @andygrove

Comment on lines +77 to +102
struct WrappedStream {
stream: Pin<Box<dyn Stream<Item = ArrowResult<RecordBatch>> + Send + Sync>>,
schema: SchemaRef,
}

impl RecordBatchStream for WrappedStream {
fn schema(&self) -> SchemaRef {
self.schema.clone()
}
}

impl Stream for WrappedStream {
type Item = ArrowResult<RecordBatch>;

fn poll_next(
mut self: Pin<&mut Self>,
cx: &mut std::task::Context<'_>,
) -> std::task::Poll<Option<Self::Item>> {
self.stream.poll_next_unpin(cx)
}

fn size_hint(&self) -> (usize, Option<usize>) {
self.stream.size_hint()
}
}

Copy link
Contributor Author

@edrevo edrevo Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised I couldn't find anything like this. If there is a similar struct that I missed please do let me know and I'll use that one.

Also, since this is a pretty general wrapper, if you want me to move this to another place and make it public, I can do that too.

@codecov-commenter
Copy link

Codecov Report

Merging #535 (71b51ba) into master (d5bca0e) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #535      +/-   ##
==========================================
- Coverage   76.03%   76.02%   -0.02%     
==========================================
  Files         157      157              
  Lines       26990    26994       +4     
==========================================
  Hits        20521    20521              
- Misses       6469     6473       +4     
Impacted Files Coverage Δ
ballista/rust/client/src/context.rs 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5bca0e...71b51ba. Read the comment docs.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove
Copy link
Member

I ran integration tests locally and they passed so I am going to go ahead and merge this.

@andygrove andygrove merged commit 63e3045 into apache:master Jun 11, 2021
@houqp houqp added ballista performance Make DataFusion faster labels Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make BallisitaContext::collect streaming
4 participants