Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parallel CSV scan #6922

Closed
2010YOUY01 opened this issue Jul 12, 2023 · 4 comments
Closed

Improve parallel CSV scan #6922

2010YOUY01 opened this issue Jul 12, 2023 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@2010YOUY01
Copy link
Contributor

Is your feature request related to a problem or challenge?

This issue is to address the remaining tasks from an initial parallel CSV scan PR #6801

The remaining tasks:

  1. Use get_opts() for range read on local FS
    get_opts() is an interface for range streaming read from ObjectStore (local FS/ cloud storage), currently it's not supported for range read on local FS https://github.com/apache/arrow-rs/blob/0d4e6a727f113f42d58650d2dbecab89b22d4e28/object_store/src/lib.rs#L355
    When it's implemented in arrow-rs, we can use it in parallel CSV scan implementation and possibly get some performance improvement (the current implementation will copy the whole CSV file range into memory at once instead of in a streaming fashion)
  2. Use only 1 get operation from ObjectStore for each partition instead of 3 (see original PR discussion)

It's easier to do task 2 after 1 is done (can do tests on the local filesystem)

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@2010YOUY01 2010YOUY01 added the enhancement New feature or request label Jul 12, 2023
@alamb alamb added the good first issue Good for newcomers label Jul 12, 2023
@alamb
Copy link
Contributor

alamb commented Jul 12, 2023

Since the items on this enhancement are well understood I think it would be a good one for someone who wants to improve things in DataFusion to try

@parkma99
Copy link
Contributor

parkma99 commented Jul 13, 2023

I will take it.

I find apache/arrow-rs#4352 is tracked range read on local FS.

I would like to handle this after that Issue is closed.

@parkma99
Copy link
Contributor

parkma99 commented Aug 29, 2023

It was fixed by #7282. cc @alamb

@alamb
Copy link
Contributor

alamb commented Sep 5, 2023

Thanks for the ping @parkma99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants