[train v2+tune] Add an environment variable to disable running the `TrainController` as an actor #49522

justinvyu · 2025-01-01T02:13:42Z

Why are these changes needed?

Add the RAY_TRAIN_RUN_CONTROLLER_AS_ACTOR environment variable that disables the TrainController running as an actor.

For the Train v2 + Tune integration, the TrainController cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call ray.tune.report to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the TrainController just runs on the process that trainer.fit() was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically.

Signed-off-by: Justin Yu <[email protected]>

python/ray/train/v2/api/data_parallel_trainer.py

matthewdeng · 2025-01-03T20:57:12Z

python/ray/train/v2/api/data_parallel_trainer.py

+                scheduling_strategy=NodeAffinitySchedulingStrategy(
+                    node_id=ray.get_runtime_context().get_node_id(), soft=False
+                ),
+                runtime_env={"env_vars": get_env_vars_to_propagate()},


Unrelated to this PR, but would this merge with environment variables set by a user-defined runtime env?

I don't know actually, I'll test it out

This will update a user-defined runtime env, which is as expected.

import ray import os ray.init(runtime_env={"env_vars": {"RAY_TRAIN_V2_ENABLED": "0"}}) import ray.train from ray.train.v2.api.data_parallel_trainer import DataParallelTrainer os.environ["RAY_TRAIN_V2_ENABLED"] = "1" trainer = DataParallelTrainer( lambda: print(os.getenv("RAY_TRAIN_V2_ENABLED")), scaling_config=ray.train.ScalingConfig(num_workers=2) ) trainer.fit()

(RayTrainWorker pid=69742) 1 (RayTrainWorker pid=69742) (RayTrainWorker pid=69743) 1 (RayTrainWorker pid=69743)

python/ray/train/v2/tests/test_data_parallel_trainer.py

…_revamp/run_controller_as_actor

Signed-off-by: Justin Yu <[email protected]>

…rainController` as an actor (ray-project#49522) For the Train v2 + Tune integration, the `TrainController` cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call `ray.tune.report` to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the `TrainController` just runs on the process that `trainer.fit()` was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically. --------- Signed-off-by: Justin Yu <[email protected]>

…rainController` as an actor (ray-project#49522) For the Train v2 + Tune integration, the `TrainController` cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call `ray.tune.report` to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the `TrainController` just runs on the process that `trainer.fit()` was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>

Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in #49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]>

…rainController` as an actor (ray-project#49522) For the Train v2 + Tune integration, the `TrainController` cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call `ray.tune.report` to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the `TrainController` just runs on the process that `trainer.fit()` was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in ray-project#49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

justinvyu added 2 commits December 31, 2024 18:11

add an env var for running the controller as an actor

6935ddc

Signed-off-by: Justin Yu <[email protected]>

fix lint

35f12f7

Signed-off-by: Justin Yu <[email protected]>

justinvyu requested review from hongpeng-guo, matthewdeng, raulchen and woshiyyya as code owners January 1, 2025 02:13

matthewdeng approved these changes Jan 3, 2025

View reviewed changes

justinvyu added 2 commits January 3, 2025 16:47

Merge branch 'master' of https://github.com/ray-project/ray into tune…

c50f16d

…_revamp/run_controller_as_actor

read env var on fit

af89b51

Signed-off-by: Justin Yu <[email protected]>

justinvyu enabled auto-merge (squash) January 4, 2025 01:32

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 4, 2025

justinvyu merged commit 644bc08 into ray-project:master Jan 4, 2025
7 checks passed

justinvyu deleted the tune_revamp/run_controller_as_actor branch January 6, 2025 17:57

justinvyu mentioned this pull request Jan 6, 2025

[train v2+tune] Add Train v2 + Tune integration test #49601

Merged

justinvyu mentioned this pull request Jan 17, 2025

[train v2+tune] Add TuneReportCallback for propagating intermediate Train results to Tune #49927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train v2+tune] Add an environment variable to disable running the `TrainController` as an actor #49522

[train v2+tune] Add an environment variable to disable running the `TrainController` as an actor #49522

justinvyu commented Jan 1, 2025 •

edited

Loading

matthewdeng Jan 3, 2025

justinvyu Jan 3, 2025

justinvyu Jan 4, 2025

[train v2+tune] Add an environment variable to disable running the TrainController as an actor #49522

[train v2+tune] Add an environment variable to disable running the TrainController as an actor #49522

Conversation

justinvyu commented Jan 1, 2025 • edited Loading

Why are these changes needed?

matthewdeng Jan 3, 2025

Choose a reason for hiding this comment

justinvyu Jan 3, 2025

Choose a reason for hiding this comment

justinvyu Jan 4, 2025

Choose a reason for hiding this comment

[train v2+tune] Add an environment variable to disable running the `TrainController` as an actor #49522

[train v2+tune] Add an environment variable to disable running the `TrainController` as an actor #49522

justinvyu commented Jan 1, 2025 •

edited

Loading