You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since moving the async reader when optimizing (binning) files, we can now leverage parallel processing to speed up the operation. Specifically the loops when executing the MergePlan are straight forward to parallelize.
Use Case
Related Issue(s)
The text was updated successfully, but these errors were encountered:
# Description
Refactors such that:
1. Runs compaction tasks in parallel, with parallelism controlled by the
user but defaulting to number of cpus. (The `num_cpu` crate is used by
`tokio`, so we already have it transitively.)
2. Turns on zstd compression by default at level 4. In a future PR, we
can make this configurable for Python and maybe benchmark different
levels.
3. Initial prep to have other types of optimize commands.
However, the writer isn't very good at writing for a target row size,
because the code that checks the size of the written file only knows the
size of the serialized row groups and not the current row group. So if
your row groups are 100MB in size, and you target 150MB, you will get
200MB files. There is upstream work in
apache/arrow-rs#4280 that will allow us to write
much more exactly sized files, so this will improve in the near future.
# Related Issue(s)
closes#1171
# Documentation
<!---
Share links to useful documentation
--->
---------
Co-authored-by: Robert Pack <[email protected]>
Description
Since moving the async reader when optimizing (binning) files, we can now leverage parallel processing to speed up the operation. Specifically the loops when executing the
MergePlan
are straight forward to parallelize.Use Case
Related Issue(s)
The text was updated successfully, but these errors were encountered: