Train + Tune API Revamp #57

justinvyu · 2024-10-18T23:38:53Z

Summary

Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries.

Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system.
The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development.

The goal of these changes is to improve:

Usability by mostly keeping feature parity, while introducing more intuitive APIs in place of inherited Ray Tune APIs that do not fit in the context of Ray Train.
Extensibility by introducing more modular execution components that can be more easily customized.
Interoperability between Ray Tune and Ray Train so that using both together is more natural and still provides separation between the two libraries.
Testability with proper unit tests rather than hundreds of mini end to end tests.

Signed-off-by: Justin Yu <[email protected]>

hongpeng-guo

This is legendary! Thanks for putting everything together.

Will approve before getting public discussion feedbacks

Signed-off-by: Justin Yu <[email protected]>

robertnishihara · 2024-12-13T19:53:50Z

reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md

+    trainer = TorchTrainer(
+        train_fn_per_worker,
+        ...,
+        run_config=ray.train.RunConfig(


A few questions

If there is nothing to restore, e.g., this is the first time the job is submitted, then unique_run_name will be None and the TorchTrainer will see that there is no checkpoint to load from and will not load anything?

What's the reason for removing the explicit restore call?

Why do I need the storage info in RunConfig? Couldn't that all just go into get_checkpoint?

That's right.

This is actually something I was debating back and forth about. This section outlines some of the issues with the existing TorchTrainer.restore API.

Re-implementing with exact API parity would require us to pickle the user code and python objects again.

The explicit Trainer.restore API also had some UX friction (needing the user to manually concatenate their storage_path with the run name.

I took this chance to just improve the usability by pulling the "restore path" from the RunConfig directly and avoiding all pickling. So, the RunConfig(storage_path, name, storage_filesystem) tuple is what uniquely defines a Ray Train run, and users should pass in a unique name per job rather than re-use the same name for multiple runs.

Using a colliding name runs into other problems such as checkpoints being uploaded to the same directory with possible overwriting, so it's already unsupported and undefined behavior.

We can also definitely add a restore_run_if_exists flag to make the job restoration behavior explicit if users find this behavior is too magical.

The storage info sets the location where checkpoints and Ray Train's driver state gets saved. Without it, we can't load the state if restoring the run from a new Ray cluster, and we don't know what checkpoint to populate get_checkpoint with.

robertnishihara · 2024-12-13T19:55:26Z

reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md

+* **Execution Control**:
+  * It triggers the distributed training function across the workers in the `WorkerGroup` and shuts down the workers when training has either finished or failed.
+* **Polling & Health Checking**:
+  * The Controller regularly polls the workers to assess their health and training progress. Based on the worker statuses, it can handle errors and make scaling decisions.


This looks awesome. A few questions.
Training health monitoring is becoming more and more important and can be quite complex and compute intensive (not just looking at lots of metrics, but also running evals). Where does this logic live?

If the health monitoring takes a while to run but eventually decides that we need to restart from an earlier checkpoint, how does that get implemented?

Where does this logic live?

This logic lives in the Train driver right now, where the controller actor periodically pings the worker actor tasks and handles any errors raised by those health-check tasks.

If the health monitoring takes a while to run but eventually decides that we need to restart from an earlier checkpoint, how does that get implemented?

Health-checks are every few seconds at the moment, so the decisions happen pretty quickly after Ray detects that a node/actor has died. We also handle more extreme edge-cases where the underlying RPC times out randomly, in which case we'll wait for some configurable time until restarting from the latest checkpoint.

(not just looking at lots of metrics, but also running evals)

Oh, maybe this part was a bit misleading. At the moment, the "health and training progress" just refers to the underlying actor/node health and reported checkpoints, rather than the health of the training jobs (we do NOT implement anything to handle how well things are converging, how validation metrics are doing, etc).

robertnishihara · 2024-12-13T19:58:58Z

reps/2024-10-18-train-tune-api-revamp/2024-10-18-train-tune-api-revamp.md

+def train_fn_per_worker(config: dict):
+    # Equivalent behavior that is explicit and more flexible.
+    checkpoint = (
+        ray.train.get_checkpoint()


Minor, but isn't config.get("resume_from_checkpoint") enough? Why do you also need ray.train.get_checkpoint()?

ray.train.get_checkpoint is still needed if you want to support fault tolerance, in addition to a checkpoint to start fine-tuning from initially. Otherwise, if a node fails, the training progress always gets reset to the initial checkpoint, throwing away all finetuning progress.

…ne API Revamp" REP (#49376) Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries. Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system. The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development. ray-project/enhancements#57 proposed a much clearer design to improve the **Usability**, **Extensibility**, **Interoperability**, and **Testability**. This PR contains the implementation of the above REP for the revamped Ray Train. This implementation is contained in the `python/ray/train/v2` directory. These changes pave the way for improved feature development and enhanced user experience. Please refer to the REP for details on the design, as well as the remaining changes which will be added shortly in follow-up PRs. --------- Signed-off-by: Hongpeng Guo <[email protected]>

…49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]>

…rain driver (#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]>

…ay-project#49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]>

…rain driver (ray-project#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]>

…ay-project#49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>

…rain driver (ray-project#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>

Populates all the deprecation warnings in V2 code and links to the migration issue: #49454 This covers all the dropped APIs that are mentioned in the REP: ray-project/enhancements#57 --------- Signed-off-by: Justin Yu <[email protected]>

#50322) Enable deprecation and migration messages for the API changes listed here: ray-project/enhancements#57 --------- Signed-off-by: Justin Yu <[email protected]>

…ne API Revamp" REP (ray-project#49376) Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries. Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system. The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development. ray-project/enhancements#57 proposed a much clearer design to improve the **Usability**, **Extensibility**, **Interoperability**, and **Testability**. This PR contains the implementation of the above REP for the revamped Ray Train. This implementation is contained in the `python/ray/train/v2` directory. These changes pave the way for improved feature development and enhanced user experience. Please refer to the REP for details on the design, as well as the remaining changes which will be added shortly in follow-up PRs. --------- Signed-off-by: Hongpeng Guo <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

…ay-project#49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

…rain driver (ray-project#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

…project#49455) Populates all the deprecation warnings in V2 code and links to the migration issue: ray-project#49454 This covers all the dropped APIs that are mentioned in the REP: ray-project/enhancements#57 --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

…0435) #49317 initiated the decoupling of Ray Train and Ray Tune top-level APIs. This PR updates all of the internal usage in Ray Tune examples and tests to switch from `ray.air` (super out-dated) and `ray.train` imports to `ray.tune` imports instead. See ray-project/enhancements#57 for context around the changes. --------- Signed-off-by: Justin Yu <[email protected]>

…y-project#50435) ray-project#49317 initiated the decoupling of Ray Train and Ray Tune top-level APIs. This PR updates all of the internal usage in Ray Tune examples and tests to switch from `ray.air` (super out-dated) and `ray.train` imports to `ray.tune` imports instead. See ray-project/enhancements#57 for context around the changes. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: 400Ping <[email protected]>

…0435) #49317 initiated the decoupling of Ray Train and Ray Tune top-level APIs. This PR updates all of the internal usage in Ray Tune examples and tests to switch from `ray.air` (super out-dated) and `ray.train` imports to `ray.tune` imports instead. See ray-project/enhancements#57 for context around the changes. --------- Signed-off-by: Justin Yu <[email protected]>

justinvyu added 3 commits October 18, 2024 15:02

add initial draft

e0fb998

Signed-off-by: Justin Yu <[email protected]>

fix other code blocks

e9cc2ba

Signed-off-by: Justin Yu <[email protected]>

move

b134bc7

Signed-off-by: Justin Yu <[email protected]>

justinvyu assigned raulchen, matthewdeng and hongpeng-guo Oct 18, 2024

justinvyu added 7 commits October 18, 2024 16:49

typos

0fd75ef

Signed-off-by: Justin Yu <[email protected]>

add edited architecture figures

f9bc98e

Signed-off-by: Justin Yu <[email protected]>

add train tune interop figures

cc8798e

Signed-off-by: Justin Yu <[email protected]>

fix relative links

8300295

Signed-off-by: Justin Yu <[email protected]>

add api stability notes

9198381

Signed-off-by: Justin Yu <[email protected]>

add migration message

8172c46

Signed-off-by: Justin Yu <[email protected]>

add some missing changes

08b7a6d

Signed-off-by: Justin Yu <[email protected]>

justinvyu marked this pull request as ready for review October 24, 2024 01:50

hongpeng-guo previously approved these changes Oct 24, 2024

View reviewed changes

justinvyu added 3 commits October 30, 2024 11:33

add tune callback workaround

9783cfd

Signed-off-by: Justin Yu <[email protected]>

update code diff snippets

306d8de

Signed-off-by: Justin Yu <[email protected]>

small update

f8c0757

Signed-off-by: Justin Yu <[email protected]>

justinvyu changed the title ~~[WIP] Train + Tune API Revamp~~ Train + Tune API Revamp Oct 30, 2024

justinvyu added 7 commits November 6, 2024 16:29

remove stability column since it's confusing

e72ff1f

Signed-off-by: Justin Yu <[email protected]>

add can_restore removal

0d5f096

Signed-off-by: Justin Yu <[email protected]>

deprecate trainer_resources

1b1128a

Signed-off-by: Justin Yu <[email protected]>

default chekcpoint dir naming scheme

7175ad3

Signed-off-by: Justin Yu <[email protected]>

update one line summary at the top

10df195

Signed-off-by: Justin Yu <[email protected]>

edits

d59344c

Signed-off-by: Justin Yu <[email protected]>

add diagram

8d1de01

Signed-off-by: Justin Yu <[email protected]>

justinvyu mentioned this pull request Nov 28, 2024

[RFC] Feedback on Ray Train and Ray Tune API revamp proposals ray-project/ray#48979

Open

matthewdeng approved these changes Dec 5, 2024

View reviewed changes

hongpeng-guo approved these changes Dec 5, 2024

View reviewed changes

respond to feedback (pg train + tune)

ab0fdd4

Signed-off-by: Justin Yu <[email protected]>

robertnishihara reviewed Dec 13, 2024

View reviewed changes

pcmoritz merged commit 1459667 into main Dec 14, 2024
1 check passed

This was referenced Dec 17, 2024

[tune] Remove ray.train.report usage within Tune internal code ray-project/ray#49308

Merged

[train v2+tune] Split ray.train.* and ray.tune.* API entrypoints ray-project/ray#49317

Merged

hongpeng-guo mentioned this pull request Dec 20, 2024

[Train V2] Adding Ray Train V2 Codebase, implementing the "Train + Tune API Revamp" REP ray-project/ray#49376

Merged

justinvyu mentioned this pull request Dec 31, 2024

[train v2+tune] Propagate Train v2 feature flag from Tune driver to Train driver ray-project/ray#49519

Merged

justinvyu mentioned this pull request Jan 24, 2025

[train v2] Populate deprecation warnings for dropped APIs in v2 ray-project/ray#49455

Merged

justinvyu mentioned this pull request Feb 7, 2025

[train v2] Enable deprecation messages which link to a migration guide ray-project/ray#50322

Merged

justinvyu mentioned this pull request Feb 11, 2025

[tune] Update Ray Tune examples and tests off of ray.train APIs ray-project/ray#50435

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train + Tune API Revamp #57

Train + Tune API Revamp #57

justinvyu commented Oct 18, 2024 •

edited

Loading

hongpeng-guo left a comment

robertnishihara Dec 13, 2024

justinvyu Dec 13, 2024

robertnishihara Dec 13, 2024

justinvyu Dec 13, 2024 •

edited

Loading

robertnishihara Dec 13, 2024

justinvyu Dec 13, 2024

Train + Tune API Revamp #57

Train + Tune API Revamp #57

Conversation

justinvyu commented Oct 18, 2024 • edited Loading

Summary

hongpeng-guo left a comment

Choose a reason for hiding this comment

robertnishihara Dec 13, 2024

Choose a reason for hiding this comment

justinvyu Dec 13, 2024

Choose a reason for hiding this comment

robertnishihara Dec 13, 2024

Choose a reason for hiding this comment

justinvyu Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

robertnishihara Dec 13, 2024

Choose a reason for hiding this comment

justinvyu Dec 13, 2024

Choose a reason for hiding this comment

justinvyu commented Oct 18, 2024 •

edited

Loading

justinvyu Dec 13, 2024 •

edited

Loading