Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel scan in mito engine #2806

Closed
7 tasks done
evenyag opened this issue Nov 23, 2023 · 1 comment
Closed
7 tasks done

Support parallel scan in mito engine #2806

evenyag opened this issue Nov 23, 2023 · 1 comment
Assignees
Labels
C-enhancement Category Enhancements C-performance Category Performance tracking-issue A tracking issue for a feature.

Comments

@evenyag
Copy link
Contributor

evenyag commented Nov 23, 2023

What type of enhancement is this?

Performance

What does the enhancement do?

Currently, the Mito engine only supports single-threaded scanning. We can consider using parallel scanning to improve speed when dealing with larger amounts of data.

Implementation challenges

There are several approaches to parallel scanning:

  • Parallel scanning of each file: If there is only one file to scan, parallelization is not possible.
  • Parallel scanning of row groups: It is important to note that, in order to ensure the final result is sorted, it is necessary to scan the row groups used by the MergeReader. Implementing this approach can be more complex.

We also need to figure out a way to control the parallelism of the query so spawning a task for each file might not be the best solution (We might need some experiments).

Steps

@evenyag evenyag added C-enhancement Category Enhancements C-performance Category Performance labels Nov 23, 2023
@evenyag evenyag self-assigned this Nov 23, 2023
@evenyag evenyag added this to mito2 Nov 27, 2023
@evenyag evenyag moved this to Todo in mito2 Nov 27, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in mito2 Dec 12, 2023
@evenyag
Copy link
Contributor Author

evenyag commented Apr 25, 2024

I'm going to reopen this issue as file-level parallelism is not enough if the number of parquet files is less than the parallelism. We still need to implement a more fine-grained parallel scan strategy, such as row group level parallel scan. However, parallel scanning row groups might not be able to maintain a sorted order of the SST file. As a result, we can implement this parallel strategy in append only tables that use UnorderedScan. In the future, we might support parallel merge to solve this restriction.

I did some experiments on this branch. I also found that we must support multiple output partitioning to maximize computations. I also fixed an issue that the parquet reader might have unexpected pruning behavior as it always uses the file's schema to create the physical expression.

Another potential optimization is to scan columns in parallel if the number of row groups is not enough like polar-rs. But this is not very necessary.

I'll update this issue to track the implementation of parallel scanning row groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements C-performance Category Performance tracking-issue A tracking issue for a feature.
Projects
Status: Done
Development

No branches or pull requests

2 participants