Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Arc<Statistics> rather than Statistics in PartitionedFile #11885

Open
alamb opened this issue Aug 8, 2024 · 6 comments
Open

Use Arc<Statistics> rather than Statistics in PartitionedFile #11885

alamb opened this issue Aug 8, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Aug 8, 2024

Is your feature request related to a problem or challenge?

We are trying to improve the speed of DataFusion when running the ClickBench partitioned test (which has 100 files) -- this means the per-file overhead is important to redudce

One structure that has non trivial overhead is the Statistics structure (as it has a ScalarValue for each column of each file so there are 100 * (number columns) * 2 at least ScalarValues

Describe the solution you'd like

It would be great to reduce the overhead of passing around these values.

Describe alternatives you've considered

One way to do so is to avoid copying them when the underlying ParquetExec is copied by using an Option<Arc<Statistics>> here:

https://github.com/apache/datafusion/blob/9503456388544788e1a881a0a80a3c61ac015a86/datafusion/core/src/datasource/listing/mod.rs#L81-L80

Additional context

Interestingly @Rachelint
#11802 (comment)

@alamb alamb added enhancement New feature or request good first issue Good for newcomers labels Aug 8, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2024

I think this is a good first issue as the code is relatively simple and mechanical. We'll help do the benchmarking

The benchmarks are described in https://github.com/apache/datafusion/tree/main/benchmarks

@Rachelint
Copy link
Contributor

take

@Rachelint
Copy link
Contributor

Rachelint commented Aug 8, 2024

May try the idea mentioned soon after the working experiment.

#11802 (comment)

@albinsabu2023
Copy link

albinsabu2023 commented Aug 8, 2024

its seems easy and i can give a try on it @alamb

@Rachelint
Copy link
Contributor

Rachelint commented Aug 8, 2024

its seems easy and i can give a try on it @alamb

Oh, sorry... alreay took and workig on...

In fact it is related to a strange performance problem I am pursuing the reason.

And maybe some other works should be tried to solve it based on the profile result.

@Rachelint
Copy link
Contributor

Rachelint commented Aug 8, 2024

@alamb It seems using Arc<Statistic> in PartitionedFile actually hurts some long queries (I test the q22 in clickbench as last pr).

I can almost make sure that, it is not noise... (slower 400ms, and can reproduce in all tries)...

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ arc-statistic ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │ 8775.79ms │     9218.05ms │ 1.05x slower │
└──────────────┴───────────┴───────────────┴──────────────┘

I try to drop the Arc<Statistic> before the actual execution... It still slower the q22...

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ arc-statistic ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │ 8775.79ms │     9218.05ms │ 1.05x slower │
└──────────────┴───────────┴───────────────┴──────────────┘

https://github.com/apache/datafusion/pull/11894/files#diff-5af0519ea922d2f9e6e4101e0a22ea8b3e64fca9c73d06147a734cc86de797d4R395-R398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
3 participants