FEAT-#4725: Make index and columns lazy in Modin DataFrame #4726

vnlitvinov · 2022-07-27T12:20:34Z

Signed-off-by: Vasily Litvinov [email protected]

What do these changes do?

Allow not specifying index and columns when constructing PandasDataframe, they would be computed on-demand when accessing .index and .columns.

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Support lazy initialization of index and columns in Modin Dataframe #4725
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

codecov · 2022-07-27T12:32:10Z

Codecov Report

Merging #4726 (9efba8d) into master (cfafbb2) will increase coverage by 4.54%.
The diff coverage is 82.69%.

❗ Current head 9efba8d differs from pull request most recent head a91992a. Consider uploading reports for the commit a91992a to get more accurate results

@@            Coverage Diff             @@
##           master    #4726      +/-   ##
==========================================
+ Coverage   85.26%   89.80%   +4.54%     
==========================================
  Files         259      260       +1     
  Lines       19215    19521     +306     
==========================================
+ Hits        16383    17531    +1148     
+ Misses       2832     1990     -842

Impacted Files	Coverage Δ
modin/experimental/batch/pipeline.py	`100.00% <ø> (+100.00%)`	⬆️
modin/core/dataframe/pandas/dataframe/dataframe.py	`94.91% <82.69%> (-0.66%)`	⬇️
...ns/pandas_on_ray/partitioning/virtual_partition.py	`87.00% <0.00%> (-4.00%)`	⬇️
modin/logging/config.py	`94.59% <0.00%> (-1.30%)`	⬇️
...mentations/pandas_on_ray/partitioning/partition.py	`90.82% <0.00%> (-0.92%)`	⬇️
modin/experimental/batch/test/test_pipeline.py	`100.00% <0.00%> (ø)`
modin/pandas/series.py	`94.33% <0.00%> (+0.24%)`	⬆️
modin/pandas/series_utils.py	`99.43% <0.00%> (+0.56%)`	⬆️
modin/distributed/dataframe/pandas/partitions.py	`88.88% <0.00%> (+0.65%)`	⬆️
... and 37 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Signed-off-by: Vasily Litvinov <[email protected]>

YarShev · 2022-07-27T18:38:44Z

Related discussion on handling metadata (index and columns) in #3673.

mvashishtha

I have some minor style comments. Thanks @vnlitvinov !

modin/core/dataframe/pandas/dataframe/dataframe.py

vnlitvinov · 2022-07-28T07:25:40Z

Related discussion on handling metadata (index and columns) in #3673.

That issue is talking about improving pivot speed if we can omit computing index and labels... well, after this PR we will be able to! 😄

Co-authored-by: Mahesh Vashishtha <[email protected]> Signed-off-by: Vasily Litvinov <[email protected]>

modin/core/dataframe/pandas/dataframe/dataframe.py

YarShev · 2022-07-28T17:00:40Z

Related discussion on handling metadata (index and columns) in #3673.

That issue is talking about improving pivot speed if we can omit computing index and labels... well, after this PR we will be able to! 😄

There are some thoughts on handling metadata for PandasOnRay and PandasOnDask executions in the issue.

Co-authored-by: Yaroslav Igoshev <[email protected]>

modin/core/dataframe/pandas/dataframe/dataframe.py

Signed-off-by: Vasily Litvinov <[email protected]>

modin/core/dataframe/pandas/dataframe/dataframe.py

YarShev · 2022-07-29T19:00:42Z

@vnlitvinov, is there a case for now when we construct PandasDataframe with empty index or columns?

vnlitvinov · 2022-08-01T10:26:01Z

@vnlitvinov, is there a case for now when we construct PandasDataframe with empty index or columns?

Not yet, though I have another PR in my queue waiting for this one to be merged to improve df[df.col < threshold] query performance.

Also I'm guessing that some other queries like df1.merge(df2) could get a speedup as a separate PR when parallel shuffle and this one are merged - basically any operation which produces some dataframe which sizes aren't known before the operation is complete should be benefitting from this optimization.

YarShev · 2022-08-01T10:29:27Z

@vnlitvinov, is there a case for now when we construct PandasDataframe with empty index or columns?

Not yet, though I have another PR in my queue waiting for this one to be merged to improve df[df.col < threshold] query performance.

Also I'm guessing that some other queries like df1.merge(df2) could get a speedup as a separate PR when parallel shuffle and this one are merged - basically any operation which produces some dataframe which sizes aren't known before the operation is complete should be benefitting from this optimization.

Got it, thanks! Let's resolve the rest of the comments and get this PR merged.

Co-authored-by: Yaroslav Igoshev <[email protected]> Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov · 2022-08-01T13:28:02Z

@YarShev I think I've addressed everything now.

@modin-project/modin-core is there anything missing? Can we get this merged, so I can submit a df[mask] PR please? 🙃

modin/core/dataframe/pandas/dataframe/dataframe.py

YarShev · 2022-08-01T13:56:26Z

There are also some CI jobs failed, please take a look.

YarShev

LGTM!

vnlitvinov · 2022-08-01T16:25:17Z

@prutskov another use case for this "lazy index" thing is it could help with ingest like read_csv - we no longer would have to wait for the reads to happen before returning from pd.read_csv() as we can postpone the computation of df.index.

mvashishtha · 2022-08-01T16:27:27Z

@vnlitvinov I was thinking the exact same thing 😄

YarShev · 2022-08-01T20:26:49Z

Merging the changes as CI failures do not relate to them. See more in #4745.

…me (modin-project#4726) Co-authored-by: Mahesh Vashishtha <[email protected]> Co-authored-by: Yaroslav Igoshev <[email protected]> Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov force-pushed the lazy-index-in-mdf branch 2 times, most recently from 89056d3 to 9e0f50b Compare July 27, 2022 12:28

FEAT-modin-project#4725: Make index and columns lazy in Modin DataFrame

2c702b0

Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov force-pushed the lazy-index-in-mdf branch from 9e0f50b to 2c702b0 Compare July 27, 2022 12:49

vnlitvinov mentioned this pull request Jul 27, 2022

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

Draft

8 tasks

vnlitvinov added the Ready for review label Jul 27, 2022

vnlitvinov marked this pull request as ready for review July 27, 2022 13:55

vnlitvinov requested a review from a team as a code owner July 27, 2022 13:55

mvashishtha self-assigned this Jul 27, 2022

mvashishtha reviewed Jul 28, 2022

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

mvashishtha assigned vnlitvinov and unassigned mvashishtha Jul 28, 2022

Apply suggestions from code review

fda1cdf

Co-authored-by: Mahesh Vashishtha <[email protected]> Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov force-pushed the lazy-index-in-mdf branch from a0fd1e2 to fda1cdf Compare July 28, 2022 07:31

YarShev reviewed Jul 28, 2022

View reviewed changes

mvashishtha previously approved these changes Jul 29, 2022

View reviewed changes

Apply suggestions from code review

e09525c

Co-authored-by: Yaroslav Igoshev <[email protected]>

vnlitvinov dismissed mvashishtha’s stale review via e09525c July 29, 2022 12:57

YarShev reviewed Jul 29, 2022

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Show resolved Hide resolved

YarShev reviewed Jul 29, 2022

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Show resolved Hide resolved

Address Yaroslav comments

8538076

Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov force-pushed the lazy-index-in-mdf branch from ad65ad1 to 8538076 Compare July 29, 2022 16:18

YarShev reviewed Jul 29, 2022

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

Address another set of Yaroslav comments

a91992a

Co-authored-by: Yaroslav Igoshev <[email protected]> Signed-off-by: Vasily Litvinov <[email protected]>

vnlitvinov force-pushed the lazy-index-in-mdf branch from 9efba8d to a91992a Compare August 1, 2022 13:25

YarShev reviewed Aug 1, 2022

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Show resolved Hide resolved

YarShev approved these changes Aug 1, 2022

View reviewed changes

mvashishtha mentioned this pull request Aug 1, 2022

PERF: unnecessary (expensive) concat #4740

Open

YarShev merged commit adb16a1 into modin-project:master Aug 1, 2022

vnlitvinov deleted the lazy-index-in-mdf branch August 2, 2022 09:10

vnlitvinov mentioned this pull request Sep 6, 2022

PERF: Stop recomputing both indices for user-defined and dict-like axis-wide applies #4445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#4725: Make index and columns lazy in Modin DataFrame #4726

FEAT-#4725: Make index and columns lazy in Modin DataFrame #4726

vnlitvinov commented Jul 27, 2022 •

edited

Loading

codecov bot commented Jul 27, 2022 •

edited

Loading

YarShev commented Jul 27, 2022

mvashishtha left a comment

vnlitvinov commented Jul 28, 2022

YarShev commented Jul 28, 2022

YarShev commented Jul 29, 2022

vnlitvinov commented Aug 1, 2022

YarShev commented Aug 1, 2022

vnlitvinov commented Aug 1, 2022

YarShev commented Aug 1, 2022

YarShev left a comment

vnlitvinov commented Aug 1, 2022

mvashishtha commented Aug 1, 2022

YarShev commented Aug 1, 2022

FEAT-#4725: Make index and columns lazy in Modin DataFrame #4726

FEAT-#4725: Make index and columns lazy in Modin DataFrame #4726

Conversation

vnlitvinov commented Jul 27, 2022 • edited Loading

What do these changes do?

codecov bot commented Jul 27, 2022 • edited Loading

Codecov Report

YarShev commented Jul 27, 2022

mvashishtha left a comment

Choose a reason for hiding this comment

vnlitvinov commented Jul 28, 2022

YarShev commented Jul 28, 2022

YarShev commented Jul 29, 2022

vnlitvinov commented Aug 1, 2022

YarShev commented Aug 1, 2022

vnlitvinov commented Aug 1, 2022

YarShev commented Aug 1, 2022

YarShev left a comment

Choose a reason for hiding this comment

vnlitvinov commented Aug 1, 2022

mvashishtha commented Aug 1, 2022

YarShev commented Aug 1, 2022

vnlitvinov commented Jul 27, 2022 •

edited

Loading

codecov bot commented Jul 27, 2022 •

edited

Loading