[train v2+tune] Add Train v2 + Tune integration test #49601

justinvyu · 2025-01-06T22:14:53Z

Summary

Add an integration test for Train v2 + Tune.

This PR also contains sets the environment variable introduced in #49522 to not run TrainController as an actor when running in a Tune trainable actor.

Remaining Issues

This raylet error message appears flakily when the Trainable / Ray Train driver exits. It complains about an actor task being killed, and it always points to the SynchronizationActor.__init__. The SynchronizationActor should be killed in the worker group shutdown, so it's unclear why this is message is being printed. It does not result in any major issues from what I can tell though.

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff780b169b8fdb5d40fa0c077101000000 Worker ID: afa2477441ea81b0ad3cf11ac5f7a796482b26741e38789cdb083f13 Node ID: 033c8d8f54a1ccc37a31dc52500baff1a7c7d4c0102472c2e2c71e13 Worker IP address: 127.0.0.1 Worker port: 61062 Worker PID: 64260 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM.

Signed-off-by: Justin Yu <[email protected]>

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

hongpeng-guo

Thanks! Left some comments.

Signed-off-by: Justin Yu <[email protected]>

matthewdeng · 2025-01-08T22:00:18Z

python/ray/tune/tests/test_train_v2_integration.py

+            ray.train.report({"loss": 0.1}, checkpoint=checkpoint)
+
+    def launch_training(tune_config):
+        # TODO: Add TuneReportCallback to report intermediate metrics


Would this report all metrics or only those with checkpoints?

All metrics. Will only append checkpoint path as another metric if a checkpoint was reported.

python/ray/tune/tests/test_train_v2_integration.py

matthewdeng · 2025-01-08T22:04:19Z

python/ray/tune/tests/test_train_v2_integration.py

+        param_space={
+            "train_loop_config": {
+                "trial_idx": ray.tune.grid_search(list(range(num_trials)))
+            }
+        },


This is nice!! If we're using this as an "example" it might also be good to include more params outside of train_loop_config, e.g. num_workers_per_trial .

…_revamp/working_integration

Signed-off-by: Justin Yu <[email protected]>

hongpeng-guo

LGTM with a minor question.

hongpeng-guo · 2025-01-15T19:53:15Z

python/ray/tune/trainable/function_trainable.py

@@ -64,6 +65,17 @@ def setup(self, config):
        )
        self._last_training_result: Optional[_TrainingResult] = None

+        # NOTE: This environment variable is used to disable the


I think this makes sense to me that the controller is not another new actor with tune. Just want to double check if the structured logging still looks good. I think the structured logging is turned on by default, that a ray.train logger will be configured on the FunctionTrainable/train_driver/train_controller process. On this process, train library calls will generate structured logging, but tune funcs will not be captured. But I think it should be OK.

Gotcha. Yeah, I think that should be ok -- structuring Tune logs would need to be handled as a separate effort.

…_revamp/working_integration

Signed-off-by: Justin Yu <[email protected]>

Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in #49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]>

Add an integration test for Train v2 + Tune. This PR also contains sets the environment variable introduced in ray-project#49522 to not run `TrainController` as an actor when running in a Tune trainable actor. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

justinvyu added 7 commits January 6, 2025 13:52

fix graceful worker shutdown timeout

c17be8e

Signed-off-by: Justin Yu <[email protected]>

do not run train controller as an actor inside tune trainable

a462927

Signed-off-by: Justin Yu <[email protected]>

driveby: do not log error stack trace for worker group start errors

59e2c2a

Signed-off-by: Justin Yu <[email protected]>

add integration test

64a8867

Signed-off-by: Justin Yu <[email protected]>

revert logger level

f151f09

Signed-off-by: Justin Yu <[email protected]>

startup -> launch

a87b5d6

Signed-off-by: Justin Yu <[email protected]>

add usage of split ray.tune / ray.train apis

8b120a7

Signed-off-by: Justin Yu <[email protected]>

justinvyu assigned matthewdeng and hongpeng-guo Jan 6, 2025

justinvyu requested review from hongpeng-guo, matthewdeng, raulchen and woshiyyya as code owners January 6, 2025 22:14

justinvyu changed the title ~~[train v2+tune] Add working Tune integration test~~ [train v2+tune] Add Train v2 + Tune integration test Jan 6, 2025

hongpeng-guo reviewed Jan 7, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

hongpeng-guo reviewed Jan 7, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

hongpeng-guo reviewed Jan 7, 2025

View reviewed changes

v2 env var in BUILD

4cac9f5

Signed-off-by: Justin Yu <[email protected]>

matthewdeng reviewed Jan 8, 2025

View reviewed changes

justinvyu added 4 commits January 13, 2025 11:55

Merge branch 'master' of https://github.com/ray-project/ray into tune…

20aec93

…_revamp/working_integration

update test to search over num_workers

04b37e9

Signed-off-by: Justin Yu <[email protected]>

Add a note for the default env var setting in tune

f50f635

Signed-off-by: Justin Yu <[email protected]>

propagate more env vars properly

f6c4512

Signed-off-by: Justin Yu <[email protected]>

justinvyu requested review from matthewdeng and hongpeng-guo January 13, 2025 21:43

matthewdeng approved these changes Jan 15, 2025

View reviewed changes

hongpeng-guo approved these changes Jan 15, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) January 15, 2025 20:12

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 15, 2025

Merge branch 'master' of https://github.com/ray-project/ray into tune…

f2c8de2

…_revamp/working_integration

fix test

910ce6a

Signed-off-by: Justin Yu <[email protected]>

github-actions bot disabled auto-merge January 16, 2025 00:32

justinvyu merged commit 9a65d0c into ray-project:master Jan 16, 2025
5 checks passed

justinvyu deleted the tune_revamp/working_integration branch January 16, 2025 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train v2+tune] Add Train v2 + Tune integration test #49601

[train v2+tune] Add Train v2 + Tune integration test #49601

justinvyu commented Jan 6, 2025 •

edited

Loading

hongpeng-guo left a comment

matthewdeng Jan 8, 2025

justinvyu Jan 9, 2025

matthewdeng Jan 8, 2025

hongpeng-guo left a comment

hongpeng-guo Jan 15, 2025

justinvyu Jan 15, 2025

[train v2+tune] Add Train v2 + Tune integration test #49601

[train v2+tune] Add Train v2 + Tune integration test #49601

Conversation

justinvyu commented Jan 6, 2025 • edited Loading

Summary

Remaining Issues

hongpeng-guo left a comment

Choose a reason for hiding this comment

matthewdeng Jan 8, 2025

Choose a reason for hiding this comment

justinvyu Jan 9, 2025

Choose a reason for hiding this comment

matthewdeng Jan 8, 2025

Choose a reason for hiding this comment

hongpeng-guo left a comment

Choose a reason for hiding this comment

hongpeng-guo Jan 15, 2025

Choose a reason for hiding this comment

justinvyu Jan 15, 2025

Choose a reason for hiding this comment

justinvyu commented Jan 6, 2025 •

edited

Loading