parallel processing in Optimize command #1171

roeap · 2023-02-23T05:52:17Z

Description

Since moving the async reader when optimizing (binning) files, we can now leverage parallel processing to speed up the operation. Specifically the loops when executing the MergePlan are straight forward to parallelize.

Use Case

Related Issue(s)

The text was updated successfully, but these errors were encountered:

# Description Refactors such that: 1. Runs compaction tasks in parallel, with parallelism controlled by the user but defaulting to number of cpus. (The `num_cpu` crate is used by `tokio`, so we already have it transitively.) 2. Turns on zstd compression by default at level 4. In a future PR, we can make this configurable for Python and maybe benchmark different levels. 3. Initial prep to have other types of optimize commands. However, the writer isn't very good at writing for a target row size, because the code that checks the size of the written file only knows the size of the serialized row groups and not the current row group. So if your row groups are 100MB in size, and you target 150MB, you will get 200MB files. There is upstream work in apache/arrow-rs#4280 that will allow us to write much more exactly sized files, so this will improve in the near future. # Related Issue(s) closes #1171 # Documentation  --------- Co-authored-by: Robert Pack <[email protected]>

roeap added the enhancement New feature or request label Feb 23, 2023

Blajda mentioned this issue May 15, 2023

vacuum is very slow on Cloudflare R2 #1366

Closed

roeap mentioned this issue Jun 3, 2023

feat: allow concurrent file compaction #1383

Merged

wjones127 closed this as completed in #1383 Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel processing in Optimize command #1171

parallel processing in Optimize command #1171

roeap commented Feb 23, 2023

parallel processing in Optimize command #1171

parallel processing in Optimize command #1171

Comments

roeap commented Feb 23, 2023

Description