Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train v2+tune] Add Train v2 + Tune integration test #49601

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jan 6, 2025

Summary

Add an integration test for Train v2 + Tune.

This PR also contains sets the environment variable introduced in #49522 to not run TrainController as an actor when running in a Tune trainable actor.

Remaining Issues

This raylet error message appears flakily when the Trainable / Ray Train driver exits. It complains about an actor task being killed, and it always points to the SynchronizationActor.__init__. The SynchronizationActor should be killed in the worker group shutdown, so it's unclear why this is message is being printed. It does not result in any major issues from what I can tell though.

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff780b169b8fdb5d40fa0c077101000000 Worker ID: afa2477441ea81b0ad3cf11ac5f7a796482b26741e38789cdb083f13 Node ID: 033c8d8f54a1ccc37a31dc52500baff1a7c7d4c0102472c2e2c71e13 Worker IP address: 127.0.0.1 Worker port: 61062 Worker PID: 64260 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM.

@justinvyu justinvyu changed the title [train v2+tune] Add working Tune integration test [train v2+tune] Add Train v2 + Tune integration test Jan 6, 2025
Copy link
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some comments.

Signed-off-by: Justin Yu <[email protected]>
ray.train.report({"loss": 0.1}, checkpoint=checkpoint)

def launch_training(tune_config):
# TODO: Add TuneReportCallback to report intermediate metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this report all metrics or only those with checkpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All metrics. Will only append checkpoint path as another metric if a checkpoint was reported.

Comment on lines 71 to 75
param_space={
"train_loop_config": {
"trial_idx": ray.tune.grid_search(list(range(num_trials)))
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice!! If we're using this as an "example" it might also be good to include more params outside of train_loop_config, e.g. num_workers_per_trial .

Copy link
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor question.

@@ -64,6 +65,17 @@ def setup(self, config):
)
self._last_training_result: Optional[_TrainingResult] = None

# NOTE: This environment variable is used to disable the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to me that the controller is not another new actor with tune. Just want to double check if the structured logging still looks good. I think the structured logging is turned on by default, that a ray.train logger will be configured on the FunctionTrainable/train_driver/train_controller process. On this process, train library calls will generate structured logging, but tune funcs will not be captured. But I think it should be OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah, I think that should be ok -- structuring Tune logs would need to be handled as a separate effort.

@justinvyu justinvyu enabled auto-merge (squash) January 15, 2025 20:12
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 15, 2025
Signed-off-by: Justin Yu <[email protected]>
@github-actions github-actions bot disabled auto-merge January 16, 2025 00:32
@justinvyu justinvyu merged commit 9a65d0c into ray-project:master Jan 16, 2025
5 checks passed
@justinvyu justinvyu deleted the tune_revamp/working_integration branch January 16, 2025 17:49
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025
Add an integration test for Train v2 + Tune.

This PR also contains sets the environment variable introduced in
#49522 to not run
`TrainController` as an actor when running in a Tune trainable actor.

---------

Signed-off-by: Justin Yu <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
Add an integration test for Train v2 + Tune.

This PR also contains sets the environment variable introduced in
ray-project#49522 to not run
`TrainController` as an actor when running in a Tune trainable actor.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants