-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arc<Statistics>
rather than Statistics
in PartitionedFile
#11885
Comments
I think this is a good first issue as the code is relatively simple and mechanical. We'll help do the benchmarking The benchmarks are described in https://github.com/apache/datafusion/tree/main/benchmarks |
take |
May try the idea mentioned soon after the working experiment. |
its seems easy and i can give a try on it @alamb |
Oh, sorry... alreay took and workig on... In fact it is related to a strange performance problem I am pursuing the reason. And maybe some other works should be tried to solve it based on the profile result. |
@alamb It seems using I can almost make sure that, it is not noise... (slower 400ms, and can reproduce in all tries)...
I try to drop the
|
Is your feature request related to a problem or challenge?
We are trying to improve the speed of DataFusion when running the ClickBench partitioned test (which has 100 files) -- this means the per-file overhead is important to redudce
One structure that has non trivial overhead is the
Statistics
structure (as it has aScalarValue
for each column of each file so there are 100 * (number columns) * 2 at leastScalarValues
Describe the solution you'd like
It would be great to reduce the overhead of passing around these values.
Describe alternatives you've considered
One way to do so is to avoid copying them when the underlying
ParquetExec
is copied by using anOption<Arc<Statistics>>
here:https://github.com/apache/datafusion/blob/9503456388544788e1a881a0a80a3c61ac015a86/datafusion/core/src/datasource/listing/mod.rs#L81-L80
Additional context
Interestingly @Rachelint
#11802 (comment)
The text was updated successfully, but these errors were encountered: