-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train v2+tune] Add Train v2 + Tune integration test #49601
[train v2+tune] Add Train v2 + Tune integration test #49601
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left some comments.
Signed-off-by: Justin Yu <[email protected]>
ray.train.report({"loss": 0.1}, checkpoint=checkpoint) | ||
|
||
def launch_training(tune_config): | ||
# TODO: Add TuneReportCallback to report intermediate metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this report all metrics or only those with checkpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All metrics. Will only append checkpoint path as another metric if a checkpoint was reported.
param_space={ | ||
"train_loop_config": { | ||
"trial_idx": ray.tune.grid_search(list(range(num_trials))) | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice!! If we're using this as an "example" it might also be good to include more params outside of train_loop_config
, e.g. num_workers_per_trial
.
…_revamp/working_integration
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor question.
@@ -64,6 +65,17 @@ def setup(self, config): | |||
) | |||
self._last_training_result: Optional[_TrainingResult] = None | |||
|
|||
# NOTE: This environment variable is used to disable the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense to me that the controller is not another new actor with tune. Just want to double check if the structured logging still looks good. I think the structured logging is turned on by default, that a ray.train
logger will be configured on the FunctionTrainable/train_driver/train_controller
process. On this process, train library calls will generate structured logging, but tune funcs will not be captured. But I think it should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yeah, I think that should be ok -- structuring Tune logs would need to be handled as a separate effort.
…_revamp/working_integration
Signed-off-by: Justin Yu <[email protected]>
Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in #49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]>
Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in ray-project#49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
Summary
Add an integration test for Train v2 + Tune.
This PR also contains sets the environment variable introduced in #49522 to not run
TrainController
as an actor when running in a Tune trainable actor.Remaining Issues
This raylet error message appears flakily when the Trainable / Ray Train driver exits. It complains about an actor task being killed, and it always points to the
SynchronizationActor.__init__
. TheSynchronizationActor
should be killed in the worker group shutdown, so it's unclear why this is message is being printed. It does not result in any major issues from what I can tell though.