Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Try width and length caches before materializing all partition lengths/widths in Modin frame #4493

Closed
mvashishtha opened this issue May 25, 2022 · 1 comment · Fixed by #4495
Assignees
Labels
Internals Internal modin functionality Performance 🚀 Performance related issues and pull requests.

Comments

@mvashishtha
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey 12.2.1
  • Modin version (modin.__version__): 0f70e82
  • Python Version: 3.9.12

Describe the problem

Currently I see two points in the Modin frame where we compute either all partition lengths for the first column of partitions, or all column widths for the first row of partitions:

  1. def get_axis_lengths(partitions, axis):

The first line, in _copartition, recently caused single-threaded execution for a frame with partitions of Decimal objects. Each frame had a transpose on the queue. Executing a multiply then caused the widths to be computed serially, so each partition's call queue was drained in sequence. The result was that serially, each partition slowly put the transpose result in the object store. (The objects took a while to put in the object store because Decimal data is slow to serialize.) However, in this case the lengths and widths were cache, so there was no need to compute lengths and widths at all.

Attached is a ray timeline and here is an image of the single threaded execution for the transpose in the middle (from a similar script). Screen Shot 2022-05-25 at 5 32 32 AM

Reproduction script

import numpy as np
import modin.pandas as pd
from decimal import Decimal

height = 50_000
width = 751
im = pd.DataFrame(np.random.randint(0, 2, size=(height, width)))
im = im.applymap(lambda x: Decimal(str(x)))
weight_vec = pd.Series(np.random.rand(height)).apply(lambda x: Decimal(str(x)))
print(im.T.multiply(weight_vec))
@mvashishtha mvashishtha added Performance 🚀 Performance related issues and pull requests. Internals Internal modin functionality labels May 25, 2022
@mvashishtha
Copy link
Collaborator Author

mvashishtha commented May 25, 2022

It turns out that in the case of 1) reindexed_base has unknown axis lengths because we might have to add elements along the axis to align with the other frame e.g. for

import pandas as pd

A =  pd.DataFrame([[1]])
B = pd.DataFrame([[2]], index=['b'])
print(A+B)

the new reindexed_base has length 2 ([0, 'b']) instead of 1

To fix the single-threadedness there we will need #4494. I don't see an easy fix. We could maybe some extra code for the case where we don't expect the union with the other frames' indices to change the partition sizes.

For 2) I think we really can use self._column_widths and self._row_lengths. I will make a PR for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Internal modin functionality Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant