-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve unschedulable task warning messages by integrating with the autoscaler #18724
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 17 of 17 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @AmeerHajAli, @DmitriGekhtman, @ericl, @ijrsvt, @pcmoritz, @raulchen, @robertnishihara, and @wuisawesome)
python/ray/worker.py, line 1091 at r1 (raw file):
yield ("Tip: use `ray status` to view detailed " "cluster status. To disable these " "messages, set RAY_SCHEDULER_EVENTS=0.")
Right now AUTOSCALER_EVENTS are not documented.
Do we want to document RAY_SCHEDULER_EVENTS?
python/ray/autoscaler/_private/autoscaler.py, line 255 at r1 (raw file):
def schedule_node_termination(node_id: NodeID, reason_opt: Optional[str]) -> None: if self.provider.is_readonly():
It is a bit confusing that we allow mutable operations for readonly providers, ignoring them.
Also, I think that we re-check is_readonly() below (which obsolete if we skip here).
python/ray/autoscaler/_private/monitor.py, line 237 at r1 (raw file):
mirror_node_types = {} resource_deadlock = False
resource_deadlock sounds like a bug, maybe rename to something like
not_enought_resources?
Agree to support legacy logs for now, I'll add a feature flag. |
Done with pass on comments. @AmeerHajAli I attached a "ray status" output in the PR description for readonly cluster status (autoscaler status is unchanged). You can also check out the asserts in test_cli.py and test_output.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There are some tests to patch up.
Windows build seems to not trigger correctly, but yolo |
Why are these changes needed?
Today, we raise warnings when tasks / actors are not schedulable immediately. These warnings are confusing since they don't take into account future possible autoscaling, and hence can be false positives. False positives are bad since:
Library users like Serve today disable these warnings by using placement groups, which is not ideal.
This PR eliminates these false positives via integration with the autoscaler. Instead of the raylet printing messages when resources are not schedulable, it defers to the autoscaler. The autoscaler can determine if a task will be infeasible even after autoscaling. This PR:
In future PRs, we can close the loop by raising exceptions for "permanently infeasible" tasks. This would require the autoscaler to send statuses back to the scheduler about what task types are infeasible.
PRD doc: https://docs.google.com/document/d/1OT6m4xQDN8UtsBgnAMpX6nhXpNAfdeHJVve-iGhw1WI/edit#
Sample output on laptop (autoscaler output is unchanged):
Related issue number
Closes #15933
TODO: