FIX-#6540: Correct handling of range indices in read_parquet #6545

zmbc · 2023-09-09T01:50:09Z

This change also fixes #6543.

What do these changes do?

Corrects the handling of range indices in read_parquet.

Specifically:

Use all the pandas metadata present, including start, step, and name
Emulate pyarrow and fastparquet behavior in the case of invalid metadata for a directory (which was actually the common case, or at least the only one we were testing)
Add tests for all this functionality, including more cases such as (correctly) shared metadata between files in a directory
first commit message and PR title follow format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves BUG: read_parquet with filters doesn't handle indices correctly #6540
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…quet This change also fixes modin-project#6543. Signed-off-by: Zeb Burke-Conte <[email protected]>

zmbc · 2023-09-09T02:11:23Z

modin/pandas/test/test_io.py

@@ -1453,6 +1454,65 @@ def comparator(df1, df2):
                comparator=comparator,
            )

+    @pytest.mark.parametrize("columns", [None, ["col1"]])


Note: I don't know very much about how Modin thinks about its "test budget." This PR adds several hundred new tests. I'm not clear on whether that is too many and/or whether some of them should be marked @pytest.mark.exclude_in_sanity.

We aim at PR time to be under 40-45 minutes per job, and I see this PR mostly fitting in there, so it should be fine. That said, if you could measure the timings of newly added tests, it might show that some of those might be better to exclude.

I won't exclude without measuring first anyway.

On my machine:

test_read_parquet_range_index (new) takes about 0.5 minutes

test_read_parquet_directory (existing) takes 8.5 minutes after this change, which I assume was roughly doubled by my added parameterization (+4.25 minutes)

test_read_parquet_directory_range_index (new) takes about 3.5 minutes

test_read_parquet_directory_range_index_consistent_metadata (new) takes about 1.75 minutes

test_read_parquet_partitioned_directory (existing) takes about 40 seconds after this change, which I assume was roughly quadrupled by my added parameterizations (+0.5 minutes)

I'll defer to you on whether it makes sense to exclude something here from the sanity check or parameterize less. Note that test_read_parquet_directory is already excluded.

Let's keep them added. We can always reduce the set later on if needed.

vnlitvinov

Overall LGTM, left a few questions before stamping approval

modin/pandas/test/test_io.py

Also limits the race condition check to Windows only, and removes the unnecessary dtype checking.

zmbc · 2023-09-12T00:06:58Z

@vnlitvinov I believe I've addressed all your comments now.

vnlitvinov

LGTM, thanks @zmbc!

…ames in read_parquet (modin-project#6545) This change also fixes modin-project#6543. Signed-off-by: Zeb Burke-Conte <[email protected]>

FIX-modin-project#6540: Correct handling of range indices in read_par…

edf2b4f

…quet This change also fixes modin-project#6543. Signed-off-by: Zeb Burke-Conte <[email protected]>

zmbc requested a review from a team as a code owner September 9, 2023 01:50

zmbc commented Sep 9, 2023

View reviewed changes

vnlitvinov reviewed Sep 11, 2023

View reviewed changes

modin/pandas/test/test_io.py Outdated Show resolved Hide resolved

modin/pandas/test/test_io.py Outdated Show resolved Hide resolved

zmbc added 2 commits September 11, 2023 10:56

Extract common logic from parquet file tests

f1e0b8d

Also limits the race condition check to Windows only, and removes the unnecessary dtype checking.

nit: Rename test for consistency

0905b2d

vnlitvinov approved these changes Sep 12, 2023

View reviewed changes

vnlitvinov merged commit 7ec9fdb into modin-project:master Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#6540: Correct handling of range indices in read_parquet #6545

FIX-#6540: Correct handling of range indices in read_parquet #6545

zmbc commented Sep 9, 2023 •

edited

Loading

zmbc Sep 9, 2023

vnlitvinov Sep 11, 2023

zmbc Sep 11, 2023 •

edited

Loading

vnlitvinov Sep 12, 2023

vnlitvinov left a comment

zmbc commented Sep 12, 2023

vnlitvinov left a comment

FIX-#6540: Correct handling of range indices in read_parquet #6545

FIX-#6540: Correct handling of range indices in read_parquet #6545

Conversation

zmbc commented Sep 9, 2023 • edited Loading

What do these changes do?

zmbc Sep 9, 2023

Choose a reason for hiding this comment

vnlitvinov Sep 11, 2023

Choose a reason for hiding this comment

zmbc Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

vnlitvinov Sep 12, 2023

Choose a reason for hiding this comment

vnlitvinov left a comment

Choose a reason for hiding this comment

zmbc commented Sep 12, 2023

vnlitvinov left a comment

Choose a reason for hiding this comment

zmbc commented Sep 9, 2023 •

edited

Loading

zmbc Sep 11, 2023 •

edited

Loading