Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore the order locally for repartition tests #163

Merged
merged 2 commits into from
Jun 13, 2020
Merged

Ignore the order locally for repartition tests #163

merged 2 commits into from
Jun 13, 2020

Conversation

firestarman
Copy link
Collaborator

This PR is to fix a comparison issue for repartition tests.

Repartition is likely to mess up the order of the rows in a DataFrame, especially when running on multiple executors, so better to ignore the order locally when comparing the result.

@@ -37,6 +38,7 @@ def test_coalesce_df(num_parts, length):

@pytest.mark.parametrize('num_parts', [1, 10, 100, 1000, 2000], ids=idfn)
@pytest.mark.parametrize('length', [0, 2048, 4096], ids=idfn)
@ignore_order('local')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why local? local sort does not scale well. Is it because of floating point -0.0 vs 0.0 comparison. If so we should mark it as such with a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, just try to avoid involving any extra operation on the input DataFrame to make the test pure, since sort on Spark will shuffle the data again as far as i know.
Verifed sort on Spark also has the test passed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you prefer, i can remove the 'local'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local sort is very slow compared to the distributed sort, and it does not scale for larger tests. I am fine with it being local, especially because this does deal with partitioning, but I would prefer a comment explaining why you picked local. Perhaps something like.

@ignore_order('local') # don't repartition again for sort

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add that this is just a nit if you want to merge it in as-is that is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bobby, really a good suggestion, updated to add a comment.

@revans2
Copy link
Collaborator

revans2 commented Jun 12, 2020

build

@revans2
Copy link
Collaborator

revans2 commented Jun 12, 2020

build

revans2
revans2 previously approved these changes Jun 13, 2020
@revans2
Copy link
Collaborator

revans2 commented Jun 13, 2020

build

@revans2 revans2 merged commit 4f12285 into NVIDIA:branch-0.1 Jun 13, 2020
@sameerz sameerz added the test Only impacts tests label Jun 15, 2020
@sameerz sameerz added this to the Jun 8 - Jun 19 milestone Jun 16, 2020
@firestarman firestarman deleted the ignore_order_repart branch June 17, 2020 02:46
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Ignore the order locally for repartition tests

* Update repart_test.py

Co-authored-by: Liangcai Li <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Ignore the order locally for repartition tests

* Update repart_test.py

Co-authored-by: Liangcai Li <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants