-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6558: Normalize the number of partitions after .read_parquet()
#6559
Conversation
.read_parquet()
.read_parquet()
.read_parquet()
…ad_parquet()' Signed-off-by: Dmitry Chigarev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Dmitry Chigarev <[email protected]>
Co-authored-by: Anatoly Myachev <[email protected]>
Signed-off-by: Dmitry Chigarev <[email protected]>
list[list[ParquetFileToRead]] | ||
Each element in the returned list describes a list of files that a partition has to read. | ||
""" | ||
from modin.core.storage_formats.pandas.parsers import ParquetFileToRead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why cannot we import it at the top without if TYPE_CHECKING
condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it then triggers "an import from a partially initialized module" problem
What do these changes do?
This PR brings the following changes to how parquet reading is being distributed:
The number of column partitions is now depended on the number of row partitions. Before, the implementation tended to create as much column partitions as possible (it often resulted into that column partitions consisted of only one column) which worked relatively fine if there were no row-groups in the parquet file (and so no row partitions):
However, if there were enough row groups to keep all the workers busy, this excessive amount of column partitions resulted into an over partitioned frame:
Not only this logic generates much more reading kernels than the cores user potentially have, thus slowing down the reading part, but this also tends to generate over partitioned frames that further will slow other operations in the workflow (Reduce amount of remote calls for square-like dataframes #5394).
This logic was changed, so now we first determine how many row parts the output dataframe will have and then depending on the number of remaining partitions (
NPartitions.get() / num_row_parts
) create that number of column partitions. BUT, if the number of columns is greater than thecfg.MinPartitionSize
parameter, then the columns splitting logic is the same as in.from_pandas()
, allowing to create more column partitions (up to the square partitioning).Here are few examples of how the new splitting logic for
.read_parquet()
works:More distributed reading across the row groups. Before, if the number of row groups was greater than
NPartitions
but didn't divide without reminder (18 row groups and 16NPartitions
) then the resulted dataframe will have LESS than 16 row partitions:It was changed to allow unequal row partition sizes in order to fill all the row partitions:
Performance difference
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
.read_parquet()
#6558added andare passingdocs/development/architecture.rst
is up-to-date