FIX-#6558: Normalize the number of partitions after `.read_parquet()` #6559

dchigarev · 2023-09-14T14:01:01Z

What do these changes do?

This PR brings the following changes to how parquet reading is being distributed:

The number of column partitions is now depended on the number of row partitions. Before, the implementation tended to create as much column partitions as possible (it often resulted into that column partitions consisted of only one column) which worked relatively fine if there were no row-groups in the parquet file (and so no row partitions):

NUM_CPUS=16
parquet_file # has 1 row group and 16 columns
pd.read_parquet(parquet_file)._partitions.shape # (1, 16) - 1 row partition and 16 column parts

However, if there were enough row groups to keep all the workers busy, this excessive amount of column partitions resulted into an over partitioned frame:

file1 # has 9 row groups and 16 columns
pd.read_parquet(file1)._partitions.shape # (9, 16)

file2 # has 24 row groups and 16 columns
pd.read_parquet(file2)._partitions.shape # (16, 16) - square partitioned frame

Not only this logic generates much more reading kernels than the cores user potentially have, thus slowing down the reading part, but this also tends to generate over partitioned frames that further will slow other operations in the workflow (Reduce amount of remote calls for square-like dataframes #5394).

This logic was changed, so now we first determine how many row parts the output dataframe will have and then depending on the number of remaining partitions (NPartitions.get() / num_row_parts) create that number of column partitions. BUT, if the number of columns is greater than the cfg.MinPartitionSize parameter, then the columns splitting logic is the same as in .from_pandas(), allowing to create more column partitions (up to the square partitioning).

Here are few examples of how the new splitting logic for .read_parquet() works:

NUM_CPUS = 16
MIN_PARTITION_SIZE = 32

parquet file schema -> partitioning of modin df
(row_grps=1, columns=9) -> (row_parts=1, col_parts=9) # no row parts, create col parts as much as possible
(row_grps=1, columns=18) -> (row_parts=1, col_parts=16) # no row parts, create col parts as much as possible
(row_grps=9, columns=9) -> (row_parts=9, col_parts=2) # 9 row parts, so only 2 col parts
(row_grps=9, columns=64) -> (row_parts=9, col_parts=2) # 9 row parts, so only 2 col parts according to MIN_PARTITION_SIZE param
(row_grps=9, columns=65) -> (row_parts=9, col_parts=3) # 9 row parts, so only 3 col parts according to MIN_PARTITION_SIZE param
(row_grps=100, columns=9) -> (row_parts=16, col_parts=1) # 16 row parts, so only 1 col part
(row_grps=100, columns=32) -> (row_parts=16, col_parts=1) # 16 row parts, so only 1 col part according to MIN_PARTITION_SIZE param
(row_grps=100, columns=1_000) -> (row_parts=16, col_parts=16) # 16 row parts and 16 col parts according to MIN_PARTITION_SIZE param

More distributed reading across the row groups. Before, if the number of row groups was greater than NPartitions but didn't divide without reminder (18 row groups and 16 NPartitions) then the resulted dataframe will have LESS than 16 row partitions:

step = ceil(num_row_groups / npartitions) # ceil(18 / 16) = ceil(1.125) = 2
row_parts = [
    row_groups[i * step : i * (step + 1)] for i in range(num_row_groups / step)
] # len(row_parts) == 9

It was changed to allow unequal row partition sizes in order to fill all the row partitions:

step = num_row_groups // npartitions # 18 // 16 = 1
reminder = num_row_groups % npartitions # 18 % 16 = 2
row_partition_sizes = [step] * (npartitions - reminder) + [step + 1] * reminder
# here we will have that 14 workers will read 1 row group and the last 2 workers will read 2 row groups

Performance difference

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Make the number of column partitions dependent on the number of row groups at .read_parquet() #6558
tests ~~added and~~ are passing
module layout described at docs/development/architecture.rst is up-to-date

…ad_parquet()' Signed-off-by: Dmitry Chigarev <[email protected]>

anmyachev

LGTM!

modin/core/io/column_stores/column_store_dispatcher.py

modin/core/io/column_stores/parquet_dispatcher.py

Signed-off-by: Dmitry Chigarev <[email protected]>

Co-authored-by: Anatoly Myachev <[email protected]>

Signed-off-by: Dmitry Chigarev <[email protected]>

vnlitvinov · 2023-09-15T17:29:15Z

modin/core/io/column_stores/parquet_dispatcher.py

+        list[list[ParquetFileToRead]]
+            Each element in the returned list describes a list of files that a partition has to read.
+        """
+        from modin.core.storage_formats.pandas.parsers import ParquetFileToRead


why cannot we import it at the top without if TYPE_CHECKING condition?

it then triggers "an import from a partially initialized module" problem

dchigarev changed the title ~~FIX-#6558: Normalize the number of column partitions after '.read_par…~~ FIX-#6558: Normalize the number of column partitions after .read_parquet() Sep 14, 2023

dchigarev changed the title ~~FIX-#6558: Normalize the number of column partitions after .read_parquet()~~ FIX-#6558: Normalize the number of partitions after .read_parquet() Sep 15, 2023

FIX-modin-project#6558: Normalize the number of partitions after '.re…

6ca6bd8

…ad_parquet()' Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev force-pushed the issue_pq branch from 9462bcc to 6ca6bd8 Compare September 15, 2023 12:39

dchigarev marked this pull request as ready for review September 15, 2023 14:57

dchigarev requested a review from a team as a code owner September 15, 2023 14:57

Merge remote-tracking branch 'origin/master' into issue_pq

3c87077

anmyachev reviewed Sep 15, 2023

View reviewed changes

modin/core/io/column_stores/column_store_dispatcher.py Outdated Show resolved Hide resolved

modin/core/io/column_stores/column_store_dispatcher.py Outdated Show resolved Hide resolved

modin/core/io/column_stores/parquet_dispatcher.py Outdated Show resolved Hide resolved

dchigarev and others added 3 commits September 15, 2023 16:14

apply isort

9fd6f80

Signed-off-by: Dmitry Chigarev <[email protected]>

Apply suggestions from code review

15d9518

Co-authored-by: Anatoly Myachev <[email protected]>

fix comment

d0ba06a

Signed-off-by: Dmitry Chigarev <[email protected]>

anmyachev approved these changes Sep 15, 2023

View reviewed changes

vnlitvinov reviewed Sep 15, 2023

View reviewed changes

anmyachev merged commit 2880990 into modin-project:master Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#6558: Normalize the number of partitions after `.read_parquet()` #6559

FIX-#6558: Normalize the number of partitions after `.read_parquet()` #6559

dchigarev commented Sep 14, 2023 •

edited

Loading

anmyachev left a comment

vnlitvinov Sep 15, 2023

dchigarev Sep 15, 2023

FIX-#6558: Normalize the number of partitions after .read_parquet() #6559

FIX-#6558: Normalize the number of partitions after .read_parquet() #6559

Conversation

dchigarev commented Sep 14, 2023 • edited Loading

What do these changes do?

Performance difference

anmyachev left a comment

Choose a reason for hiding this comment

vnlitvinov Sep 15, 2023

Choose a reason for hiding this comment

dchigarev Sep 15, 2023

Choose a reason for hiding this comment

FIX-#6558: Normalize the number of partitions after `.read_parquet()` #6559

FIX-#6558: Normalize the number of partitions after `.read_parquet()` #6559

dchigarev commented Sep 14, 2023 •

edited

Loading