[FEA] Avoid memory over usage on GPU nodes in the SparkPlan #7252
Labels
epic
Issue that encompasses a significant feature or body of work
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
Is your feature request related to a problem? Please describe.
The goal of this epic is to provide a framework and update a few SparkPlan nodes so that they can intelligently retry tasks when OOM failures are encountered.
The text was updated successfully, but these errors were encountered: