Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MG PLC algos intermittent hang #2607

Merged

Conversation

jnke2016
Copy link
Contributor

Dask doesn't always release some of the inactive futures fast enough. This can be problematic when running the same algo several times with the same PLC graph because those futures can be cache in the next iteration causing a hang if some get released midway.

This PR manually delete inactive futures.
closes #2568

@jnke2016 jnke2016 requested a review from a team as a code owner August 23, 2022 01:36
@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@2a40c07). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10    #2607   +/-   ##
===============================================
  Coverage                ?   60.07%           
===============================================
  Files                   ?      112           
  Lines                   ?     6154           
  Branches                ?        0           
===============================================
  Hits                    ?     3697           
  Misses                  ?     2457           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rlratzel rlratzel added bug Something isn't working non-breaking Non-breaking change labels Aug 23, 2022
@rlratzel rlratzel added this to the 22.10 milestone Aug 23, 2022
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am requesting to delete cudf_result, looks good otherwise.

# the same PLC graph, the current iteration might try to cache
# the past iteration's futures and this can cause a hang if some
# of those futures get released midway
del result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing result is still a good practice as when cudf_result is completed we will still have result in memory so from a memory relief standpoint it makes sense.


ddf = dask_cudf.from_delayed(cudf_result).persist()
wait(ddf)

Copy link
Member

@VibhuJawa VibhuJawa Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we also del cudf_result here to save on memory . We have a renumbered call after this so result does not go out of scope immediately, hence its important .

@jnke2016
Copy link
Contributor Author

rerun tests

1 similar comment
@jnke2016
Copy link
Contributor Author

rerun tests

@jnke2016 jnke2016 requested a review from a team as a code owner August 29, 2022 19:33
@jnke2016
Copy link
Contributor Author

rerun tests

@jnke2016 jnke2016 changed the title Fix MG PLC algos intermittent hang Fix MG PLC algos intermittent hang (DO NOT MERGE) Aug 30, 2022
@jnke2016
Copy link
Contributor Author

rerun tests

@jnke2016 jnke2016 changed the title Fix MG PLC algos intermittent hang (DO NOT MERGE) Fix MG PLC algos intermittent hang Aug 31, 2022
@rlratzel
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 821571d into rapidsai:branch-22.10 Aug 31, 2022
@jnke2016 jnke2016 deleted the branch-22.10_fix-mg-plc-algos-hang branch September 24, 2022 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Failure occurs when running mg PLC algos numerous times
5 participants