-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train v2+tune] Add an environment variable to disable running the TrainController
as an actor
#49522
[train v2+tune] Add an environment variable to disable running the TrainController
as an actor
#49522
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
scheduling_strategy=NodeAffinitySchedulingStrategy( | ||
node_id=ray.get_runtime_context().get_node_id(), soft=False | ||
), | ||
runtime_env={"env_vars": get_env_vars_to_propagate()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but would this merge with environment variables set by a user-defined runtime env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know actually, I'll test it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will update a user-defined runtime env, which is as expected.
import ray
import os
ray.init(runtime_env={"env_vars": {"RAY_TRAIN_V2_ENABLED": "0"}})
import ray.train
from ray.train.v2.api.data_parallel_trainer import DataParallelTrainer
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"
trainer = DataParallelTrainer(
lambda: print(os.getenv("RAY_TRAIN_V2_ENABLED")),
scaling_config=ray.train.ScalingConfig(num_workers=2)
)
trainer.fit()
(RayTrainWorker pid=69742) 1
(RayTrainWorker pid=69742)
(RayTrainWorker pid=69743) 1
(RayTrainWorker pid=69743)
…_revamp/run_controller_as_actor
Signed-off-by: Justin Yu <[email protected]>
…rainController` as an actor (ray-project#49522) For the Train v2 + Tune integration, the `TrainController` cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call `ray.tune.report` to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the `TrainController` just runs on the process that `trainer.fit()` was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically. --------- Signed-off-by: Justin Yu <[email protected]>
…rainController` as an actor (ray-project#49522) For the Train v2 + Tune integration, the `TrainController` cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call `ray.tune.report` to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the `TrainController` just runs on the process that `trainer.fit()` was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>
Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in #49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]>
Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in #49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]>
…rainController` as an actor (ray-project#49522) For the Train v2 + Tune integration, the `TrainController` cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call `ray.tune.report` to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the `TrainController` just runs on the process that `trainer.fit()` was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in ray-project#49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
Why are these changes needed?
Add the
RAY_TRAIN_RUN_CONTROLLER_AS_ACTOR
environment variable that disables theTrainController
running as an actor.For the Train v2 + Tune integration, the
TrainController
cannot run as a separate actor, since callbacks would run in a separate process and would not be able to callray.tune.report
to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where theTrainController
just runs on the process thattrainer.fit()
was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically.