Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4725: Make index and columns lazy in Modin DataFrame #4726

Merged
merged 5 commits into from
Aug 1, 2022

Conversation

vnlitvinov
Copy link
Collaborator

@vnlitvinov vnlitvinov commented Jul 27, 2022

Signed-off-by: Vasily Litvinov [email protected]

What do these changes do?

Allow not specifying index and columns when constructing PandasDataframe, they would be computed on-demand when accessing .index and .columns.

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Support lazy initialization of index and columns in Modin Dataframe #4725
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@vnlitvinov vnlitvinov force-pushed the lazy-index-in-mdf branch 2 times, most recently from 89056d3 to 9e0f50b Compare July 27, 2022 12:28
@codecov
Copy link

codecov bot commented Jul 27, 2022

Codecov Report

Merging #4726 (9efba8d) into master (cfafbb2) will increase coverage by 4.54%.
The diff coverage is 82.69%.

❗ Current head 9efba8d differs from pull request most recent head a91992a. Consider uploading reports for the commit a91992a to get more accurate results

@@            Coverage Diff             @@
##           master    #4726      +/-   ##
==========================================
+ Coverage   85.26%   89.80%   +4.54%     
==========================================
  Files         259      260       +1     
  Lines       19215    19521     +306     
==========================================
+ Hits        16383    17531    +1148     
+ Misses       2832     1990     -842     
Impacted Files Coverage Δ
modin/experimental/batch/pipeline.py 100.00% <ø> (+100.00%) ⬆️
modin/core/dataframe/pandas/dataframe/dataframe.py 94.91% <82.69%> (-0.66%) ⬇️
...ns/pandas_on_ray/partitioning/virtual_partition.py 87.00% <0.00%> (-4.00%) ⬇️
modin/logging/config.py 94.59% <0.00%> (-1.30%) ⬇️
...mentations/pandas_on_ray/partitioning/partition.py 90.82% <0.00%> (-0.92%) ⬇️
modin/experimental/batch/test/test_pipeline.py 100.00% <0.00%> (ø)
modin/pandas/series.py 94.33% <0.00%> (+0.24%) ⬆️
modin/pandas/series_utils.py 99.43% <0.00%> (+0.56%) ⬆️
modin/distributed/dataframe/pandas/partitions.py 88.88% <0.00%> (+0.65%) ⬆️
... and 37 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@vnlitvinov vnlitvinov marked this pull request as ready for review July 27, 2022 13:55
@vnlitvinov vnlitvinov requested a review from a team as a code owner July 27, 2022 13:55
@YarShev
Copy link
Collaborator

YarShev commented Jul 27, 2022

Related discussion on handling metadata (index and columns) in #3673.

@mvashishtha mvashishtha self-assigned this Jul 27, 2022
Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor style comments. Thanks @vnlitvinov !

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
@vnlitvinov
Copy link
Collaborator Author

Related discussion on handling metadata (index and columns) in #3673.

That issue is talking about improving pivot speed if we can omit computing index and labels... well, after this PR we will be able to! 😄

Co-authored-by: Mahesh Vashishtha <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
@YarShev
Copy link
Collaborator

YarShev commented Jul 28, 2022

Related discussion on handling metadata (index and columns) in #3673.

That issue is talking about improving pivot speed if we can omit computing index and labels... well, after this PR we will be able to! 😄

There are some thoughts on handling metadata for PandasOnRay and PandasOnDask executions in the issue.

mvashishtha
mvashishtha previously approved these changes Jul 29, 2022
Co-authored-by: Yaroslav Igoshev <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
@YarShev
Copy link
Collaborator

YarShev commented Jul 29, 2022

@vnlitvinov, is there a case for now when we construct PandasDataframe with empty index or columns?

@vnlitvinov
Copy link
Collaborator Author

@vnlitvinov, is there a case for now when we construct PandasDataframe with empty index or columns?

Not yet, though I have another PR in my queue waiting for this one to be merged to improve df[df.col < threshold] query performance.

Also I'm guessing that some other queries like df1.merge(df2) could get a speedup as a separate PR when parallel shuffle and this one are merged - basically any operation which produces some dataframe which sizes aren't known before the operation is complete should be benefitting from this optimization.

@YarShev
Copy link
Collaborator

YarShev commented Aug 1, 2022

@vnlitvinov, is there a case for now when we construct PandasDataframe with empty index or columns?

Not yet, though I have another PR in my queue waiting for this one to be merged to improve df[df.col < threshold] query performance.

Also I'm guessing that some other queries like df1.merge(df2) could get a speedup as a separate PR when parallel shuffle and this one are merged - basically any operation which produces some dataframe which sizes aren't known before the operation is complete should be benefitting from this optimization.

Got it, thanks! Let's resolve the rest of the comments and get this PR merged.

Co-authored-by: Yaroslav Igoshev <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
@vnlitvinov
Copy link
Collaborator Author

@YarShev I think I've addressed everything now.

@modin-project/modin-core is there anything missing? Can we get this merged, so I can submit a df[mask] PR please? 🙃

@YarShev
Copy link
Collaborator

YarShev commented Aug 1, 2022

There are also some CI jobs failed, please take a look.

Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vnlitvinov
Copy link
Collaborator Author

@prutskov another use case for this "lazy index" thing is it could help with ingest like read_csv - we no longer would have to wait for the reads to happen before returning from pd.read_csv() as we can postpone the computation of df.index.

@mvashishtha
Copy link
Collaborator

@vnlitvinov I was thinking the exact same thing 😄

@YarShev
Copy link
Collaborator

YarShev commented Aug 1, 2022

Merging the changes as CI failures do not relate to them. See more in #4745.

@YarShev YarShev merged commit adb16a1 into modin-project:master Aug 1, 2022
@vnlitvinov vnlitvinov deleted the lazy-index-in-mdf branch August 2, 2022 09:10
YarShev added a commit to YarShev/modin that referenced this pull request Aug 2, 2022
…me (modin-project#4726)

Co-authored-by: Mahesh Vashishtha <[email protected]>
Co-authored-by: Yaroslav Igoshev <[email protected]>
Signed-off-by: Vasily Litvinov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support lazy initialization of index and columns in Modin Dataframe
3 participants