Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#6540: Correct handling of range indices in read_parquet #6545

Merged
merged 3 commits into from
Sep 12, 2023

Conversation

zmbc
Copy link
Contributor

@zmbc zmbc commented Sep 9, 2023

This change also fixes #6543.

What do these changes do?

Corrects the handling of range indices in read_parquet.

Specifically:

  • Use all the pandas metadata present, including start, step, and name

  • Emulate pyarrow and fastparquet behavior in the case of invalid metadata for a directory (which was actually the common case, or at least the only one we were testing)

  • Add tests for all this functionality, including more cases such as (correctly) shared metadata between files in a directory

  • first commit message and PR title follow format outlined here

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py

  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py

  • signed commit with git commit -s

  • Resolves BUG: read_parquet with filters doesn't handle indices correctly #6540

  • tests added and passing

  • module layout described at docs/development/architecture.rst is up-to-date

@zmbc zmbc requested a review from a team as a code owner September 9, 2023 01:50
@@ -1453,6 +1454,65 @@ def comparator(df1, df2):
comparator=comparator,
)

@pytest.mark.parametrize("columns", [None, ["col1"]])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I don't know very much about how Modin thinks about its "test budget." This PR adds several hundred new tests. I'm not clear on whether that is too many and/or whether some of them should be marked @pytest.mark.exclude_in_sanity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aim at PR time to be under 40-45 minutes per job, and I see this PR mostly fitting in there, so it should be fine. That said, if you could measure the timings of newly added tests, it might show that some of those might be better to exclude.

I won't exclude without measuring first anyway.

Copy link
Contributor Author

@zmbc zmbc Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine:

  • test_read_parquet_range_index (new) takes about 0.5 minutes
  • test_read_parquet_directory (existing) takes 8.5 minutes after this change, which I assume was roughly doubled by my added parameterization (+4.25 minutes)
  • test_read_parquet_directory_range_index (new) takes about 3.5 minutes
  • test_read_parquet_directory_range_index_consistent_metadata (new) takes about 1.75 minutes
  • test_read_parquet_partitioned_directory (existing) takes about 40 seconds after this change, which I assume was roughly quadrupled by my added parameterizations (+0.5 minutes)

I'll defer to you on whether it makes sense to exclude something here from the sanity check or parameterize less. Note that test_read_parquet_directory is already excluded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep them added. We can always reduce the set later on if needed.

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left a few questions before stamping approval

Also limits the race condition check to Windows only,
and removes the unnecessary dtype checking.
@zmbc
Copy link
Contributor Author

zmbc commented Sep 12, 2023

@vnlitvinov I believe I've addressed all your comments now.

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zmbc!

@vnlitvinov vnlitvinov merged commit 7ec9fdb into modin-project:master Sep 12, 2023
zmbc added a commit to zmbc/modin that referenced this pull request Sep 19, 2023
…ames in read_parquet (modin-project#6545)

This change also fixes modin-project#6543.

Signed-off-by: Zeb Burke-Conte <[email protected]>
zmbc added a commit to zmbc/modin that referenced this pull request Sep 19, 2023
…ames in read_parquet (modin-project#6545)

This change also fixes modin-project#6543.

Signed-off-by: Zeb Burke-Conte <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants