Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel processing in Optimize command #1171

Closed
roeap opened this issue Feb 23, 2023 · 0 comments · Fixed by #1383
Closed

parallel processing in Optimize command #1171

roeap opened this issue Feb 23, 2023 · 0 comments · Fixed by #1383
Labels
enhancement New feature or request

Comments

@roeap
Copy link
Collaborator

roeap commented Feb 23, 2023

Description

Since moving the async reader when optimizing (binning) files, we can now leverage parallel processing to speed up the operation. Specifically the loops when executing the MergePlan are straight forward to parallelize.

Use Case

Related Issue(s)

@roeap roeap added the enhancement New feature or request label Feb 23, 2023
wjones127 added a commit that referenced this issue Jun 3, 2023
# Description

Refactors such that:

1. Runs compaction tasks in parallel, with parallelism controlled by the
user but defaulting to number of cpus. (The `num_cpu` crate is used by
`tokio`, so we already have it transitively.)
2. Turns on zstd compression by default at level 4. In a future PR, we
can make this configurable for Python and maybe benchmark different
levels.
3. Initial prep to have other types of optimize commands.

However, the writer isn't very good at writing for a target row size,
because the code that checks the size of the written file only knows the
size of the serialized row groups and not the current row group. So if
your row groups are 100MB in size, and you target 150MB, you will get
200MB files. There is upstream work in
apache/arrow-rs#4280 that will allow us to write
much more exactly sized files, so this will improve in the near future.

# Related Issue(s)

closes #1171

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Robert Pack <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant