Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train v2+tune] Add an environment variable to disable running the TrainController as an actor #49522

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jan 1, 2025

Why are these changes needed?

Add the RAY_TRAIN_RUN_CONTROLLER_AS_ACTOR environment variable that disables the TrainController running as an actor.

For the Train v2 + Tune integration, the TrainController cannot run as a separate actor, since callbacks would run in a separate process and would not be able to call ray.tune.report to propagate intermediate metrics/checkpoints to Tune. Therefore, Train needs to be able to run in a mode where the TrainController just runs on the process that trainer.fit() was called in. For Tune, this it the function Trainable that acts as the Ray Train driver. This is an internal implementation detail, which is why I introduce this as an environment variable that Ray Tune will set automatically.

scheduling_strategy=NodeAffinitySchedulingStrategy(
node_id=ray.get_runtime_context().get_node_id(), soft=False
),
runtime_env={"env_vars": get_env_vars_to_propagate()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but would this merge with environment variables set by a user-defined runtime env?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know actually, I'll test it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will update a user-defined runtime env, which is as expected.

import ray
import os

ray.init(runtime_env={"env_vars": {"RAY_TRAIN_V2_ENABLED": "0"}})


import ray.train
from ray.train.v2.api.data_parallel_trainer import DataParallelTrainer

os.environ["RAY_TRAIN_V2_ENABLED"] = "1"

trainer = DataParallelTrainer(
    lambda: print(os.getenv("RAY_TRAIN_V2_ENABLED")),
    scaling_config=ray.train.ScalingConfig(num_workers=2)
)
trainer.fit()
(RayTrainWorker pid=69742) 1
(RayTrainWorker pid=69742) 
(RayTrainWorker pid=69743) 1
(RayTrainWorker pid=69743) 

@justinvyu justinvyu enabled auto-merge (squash) January 4, 2025 01:32
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 4, 2025
@justinvyu justinvyu merged commit 644bc08 into ray-project:master Jan 4, 2025
7 checks passed
@justinvyu justinvyu deleted the tune_revamp/run_controller_as_actor branch January 6, 2025 17:57
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 7, 2025
…rainController` as an actor (ray-project#49522)

For the Train v2 + Tune integration, the `TrainController` cannot run as
a separate actor, since callbacks would run in a separate process and
would not be able to call `ray.tune.report` to propagate intermediate
metrics/checkpoints to Tune. Therefore, Train needs to be able to run in
a mode where the `TrainController` just runs on the process that
`trainer.fit()` was called in. For Tune, this it the function Trainable
that acts as the Ray Train driver. This is an internal implementation
detail, which is why I introduce this as an environment variable that
Ray Tune will set automatically.

---------

Signed-off-by: Justin Yu <[email protected]>
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 9, 2025
…rainController` as an actor (ray-project#49522)

For the Train v2 + Tune integration, the `TrainController` cannot run as
a separate actor, since callbacks would run in a separate process and
would not be able to call `ray.tune.report` to propagate intermediate
metrics/checkpoints to Tune. Therefore, Train needs to be able to run in
a mode where the `TrainController` just runs on the process that
`trainer.fit()` was called in. For Tune, this it the function Trainable
that acts as the Ray Train driver. This is an internal implementation
detail, which is why I introduce this as an environment variable that
Ray Tune will set automatically.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Roshan Kathawate <[email protected]>
justinvyu added a commit that referenced this pull request Jan 16, 2025
Add an integration test for Train v2 + Tune.

This PR also contains sets the environment variable introduced in
#49522 to not run
`TrainController` as an actor when running in a Tune trainable actor.

---------

Signed-off-by: Justin Yu <[email protected]>
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025
Add an integration test for Train v2 + Tune.

This PR also contains sets the environment variable introduced in
#49522 to not run
`TrainController` as an actor when running in a Tune trainable actor.

---------

Signed-off-by: Justin Yu <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
…rainController` as an actor (ray-project#49522)

For the Train v2 + Tune integration, the `TrainController` cannot run as
a separate actor, since callbacks would run in a separate process and
would not be able to call `ray.tune.report` to propagate intermediate
metrics/checkpoints to Tune. Therefore, Train needs to be able to run in
a mode where the `TrainController` just runs on the process that
`trainer.fit()` was called in. For Tune, this it the function Trainable
that acts as the Ray Train driver. This is an internal implementation
detail, which is why I introduce this as an environment variable that
Ray Tune will set automatically.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
Add an integration test for Train v2 + Tune.

This PR also contains sets the environment variable introduced in
ray-project#49522 to not run
`TrainController` as an actor when running in a Tune trainable actor.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants