-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6540: Correct handling of range indices in read_parquet #6545
Conversation
…quet This change also fixes modin-project#6543. Signed-off-by: Zeb Burke-Conte <[email protected]>
@@ -1453,6 +1454,65 @@ def comparator(df1, df2): | |||
comparator=comparator, | |||
) | |||
|
|||
@pytest.mark.parametrize("columns", [None, ["col1"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I don't know very much about how Modin thinks about its "test budget." This PR adds several hundred new tests. I'm not clear on whether that is too many and/or whether some of them should be marked @pytest.mark.exclude_in_sanity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aim at PR time to be under 40-45 minutes per job, and I see this PR mostly fitting in there, so it should be fine. That said, if you could measure the timings of newly added tests, it might show that some of those might be better to exclude.
I won't exclude without measuring first anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine:
test_read_parquet_range_index
(new) takes about 0.5 minutestest_read_parquet_directory
(existing) takes 8.5 minutes after this change, which I assume was roughly doubled by my added parameterization (+4.25 minutes)test_read_parquet_directory_range_index
(new) takes about 3.5 minutestest_read_parquet_directory_range_index_consistent_metadata
(new) takes about 1.75 minutestest_read_parquet_partitioned_directory
(existing) takes about 40 seconds after this change, which I assume was roughly quadrupled by my added parameterizations (+0.5 minutes)
I'll defer to you on whether it makes sense to exclude something here from the sanity check or parameterize less. Note that test_read_parquet_directory
is already excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep them added. We can always reduce the set later on if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, left a few questions before stamping approval
Also limits the race condition check to Windows only, and removes the unnecessary dtype checking.
@vnlitvinov I believe I've addressed all your comments now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @zmbc!
…ames in read_parquet (modin-project#6545) This change also fixes modin-project#6543. Signed-off-by: Zeb Burke-Conte <[email protected]>
…ames in read_parquet (modin-project#6545) This change also fixes modin-project#6543. Signed-off-by: Zeb Burke-Conte <[email protected]>
This change also fixes #6543.
What do these changes do?
Corrects the handling of range indices in read_parquet.
Specifically:
Use all the pandas metadata present, including start, step, and name
Emulate pyarrow and fastparquet behavior in the case of invalid metadata for a directory (which was actually the common case, or at least the only one we were testing)
Add tests for all this functionality, including more cases such as (correctly) shared metadata between files in a directory
first commit message and PR title follow format outlined here
passes
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with
git commit -s
Resolves BUG: read_parquet with filters doesn't handle indices correctly #6540
tests added and passing
module layout described at
docs/development/architecture.rst
is up-to-date