-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should parallel collecting statistics like infer schema? #7573
Comments
Hi @hengfeiyang -- this sound like a great idea to me Rather than hard coding the concurrency, perhaps you can add a config parameter, perhaps like The rationale for a configuration setting is that the optimal value will likely depend on the network configuration of the system, so there is no good constant that would work in all cases. Perhaps it can default to 10 or 32 ? cc @Ted-Jiang who I think was working on some other settings to cache statistics for multiple queries in the same session. See #7570 |
@alamb You are right, we should add an option for it. the Maybe we should both change it. |
@Ted-Jiang So, you will improve this part in your PR, I don't need to create another PR for this, Right? |
@Ted-Jiang 's PR is merged, so this change would need a follow on PR |
@alamb Okay, I will do it. thanks. |
Thank you @hengfeiyang -- most appreciated |
Closed in #7595 |
Is your feature request related to a problem or challenge?
When i searched data from s3 I found Datafusion fetches parquet file metadata one by one, it is a bit slow when I have many files.
The code is here:
https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/listing/table.rs#L960C1-L985
I found this code uses
iter::then
, and next it will fetch data one by one.But I found something here:
https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/file_format/parquet.rs#L170-L175
When fetching schema it uses concurrent requests.
Is possible to do the same things here? user concurrent request for collecting statistics?
Describe the solution you'd like
Actually i tried change the code:
https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/listing/table.rs#L960C1-L985
To this:
And set a const variable:
The search speed is much improved in my local because it can concurrently fetch parquet files from s3 to collect statistics, earlier it requested files one by one. to
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: