Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise and avoid data loss of meta provided to P2P shuffle is wrong #8520

Merged
merged 3 commits into from
Feb 22, 2024

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Feb 22, 2024

Closes #8519

The KeyError was used for control flow but I believe one should use dedicated exception types for this use case and keep the std exceptions clear for error reporting. This logic swallowed a very obvious error.

I'll see if I can reproduce #8519 even with artificial input somehow. Even if I don't have a reproducer I believe this change is sensible.

Comment on lines +2761 to +2771
ddf = dd.from_delayed(
[data_gen()] * 2, meta=[("a", int), ("b", int)], verify_meta=False
)

with raises_with_cause(
RuntimeError,
r"shuffling \w* failed",
ValueError,
"meta",
):
await c.gather(c.compute(ddf.shuffle(on="a")))
Copy link
Member Author

@fjetter fjetter Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main this indeed causes data loss and is returning an empty frame. While the from_delayed path above might be slightly artificial I don't think it's that difficult to produce a slight meta drift in general.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_csv is an example where this could happen in real world use cases

@fjetter fjetter changed the title Use dedicated exception class if data is not in buffer Avoid data loss of meta provided to P2P shuffle is wrong Feb 22, 2024
@fjetter fjetter changed the title Avoid data loss of meta provided to P2P shuffle is wrong Raise and avoid data loss of meta provided to P2P shuffle is wrong Feb 22, 2024
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ± 0      27 suites  ±0   9h 54m 19s ⏱️ - 3m 46s
 3 995 tests + 1   3 883 ✅ + 3    110 💤 ±0  2 ❌  - 2 
50 241 runs  +29  47 946 ✅ +37  2 293 💤  - 6  2 ❌  - 2 

For more details on these failures, see this check.

Results for commit 67eb545. ± Comparison against base commit a5a6e99.

@fjetter fjetter merged commit 1211e79 into dask:main Feb 22, 2024
29 of 34 checks passed
@fjetter fjetter deleted the shuffle_keyerror_nodata branch February 22, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2P shuffle silently dropping data if provided meta includes columns that do not exist in data
2 participants