Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train + Tune API Revamp #57

Merged
merged 21 commits into from
Dec 14, 2024
Merged

Train + Tune API Revamp #57

merged 21 commits into from
Dec 14, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Oct 18, 2024

Summary

Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries.

Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system.
The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development.

The goal of these changes is to improve:

  • Usability by mostly keeping feature parity, while introducing more intuitive APIs in place of inherited Ray Tune APIs that do not fit in the context of Ray Train.
  • Extensibility by introducing more modular execution components that can be more easily customized.
  • Interoperability between Ray Tune and Ray Train so that using both together is more natural and still provides separation between the two libraries.
  • Testability with proper unit tests rather than hundreds of mini end to end tests.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
@justinvyu justinvyu marked this pull request as ready for review October 24, 2024 01:50
hongpeng-guo
hongpeng-guo previously approved these changes Oct 24, 2024
Copy link

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is legendary! Thanks for putting everything together.

@hongpeng-guo hongpeng-guo dismissed their stale review October 24, 2024 18:09

Will approve before getting public discussion feedbacks

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
@justinvyu justinvyu changed the title [WIP] Train + Tune API Revamp Train + Tune API Revamp Oct 30, 2024
trainer = TorchTrainer(
train_fn_per_worker,
...,
run_config=ray.train.RunConfig(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions

  1. If there is nothing to restore, e.g., this is the first time the job is submitted, then unique_run_name will be None and the TorchTrainer will see that there is no checkpoint to load from and will not load anything?
  2. What's the reason for removing the explicit restore call?
  3. Why do I need the storage info in RunConfig? Couldn't that all just go into get_checkpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. That's right.
  2. This is actually something I was debating back and forth about. This section outlines some of the issues with the existing TorchTrainer.restore API.
    • Re-implementing with exact API parity would require us to pickle the user code and python objects again.
    • The explicit Trainer.restore API also had some UX friction (needing the user to manually concatenate their storage_path with the run name.
    • I took this chance to just improve the usability by pulling the "restore path" from the RunConfig directly and avoiding all pickling. So, the RunConfig(storage_path, name, storage_filesystem) tuple is what uniquely defines a Ray Train run, and users should pass in a unique name per job rather than re-use the same name for multiple runs.
    • Using a colliding name runs into other problems such as checkpoints being uploaded to the same directory with possible overwriting, so it's already unsupported and undefined behavior.
    • We can also definitely add a restore_run_if_exists flag to make the job restoration behavior explicit if users find this behavior is too magical.
  3. The storage info sets the location where checkpoints and Ray Train's driver state gets saved. Without it, we can't load the state if restoring the run from a new Ray cluster, and we don't know what checkpoint to populate get_checkpoint with.

* **Execution Control**:
* It triggers the distributed training function across the workers in the `WorkerGroup` and shuts down the workers when training has either finished or failed.
* **Polling & Health Checking**:
* The Controller regularly polls the workers to assess their health and training progress. Based on the worker statuses, it can handle errors and make scaling decisions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. A few questions.
Training health monitoring is becoming more and more important and can be quite complex and compute intensive (not just looking at lots of metrics, but also running evals). Where does this logic live?

If the health monitoring takes a while to run but eventually decides that we need to restart from an earlier checkpoint, how does that get implemented?

Copy link
Contributor Author

@justinvyu justinvyu Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this logic live?

This logic lives in the Train driver right now, where the controller actor periodically pings the worker actor tasks and handles any errors raised by those health-check tasks.

If the health monitoring takes a while to run but eventually decides that we need to restart from an earlier checkpoint, how does that get implemented?

Health-checks are every few seconds at the moment, so the decisions happen pretty quickly after Ray detects that a node/actor has died. We also handle more extreme edge-cases where the underlying RPC times out randomly, in which case we'll wait for some configurable time until restarting from the latest checkpoint.

(not just looking at lots of metrics, but also running evals)

Oh, maybe this part was a bit misleading. At the moment, the "health and training progress" just refers to the underlying actor/node health and reported checkpoints, rather than the health of the training jobs (we do NOT implement anything to handle how well things are converging, how validation metrics are doing, etc).

def train_fn_per_worker(config: dict):
# Equivalent behavior that is explicit and more flexible.
checkpoint = (
ray.train.get_checkpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but isn't config.get("resume_from_checkpoint") enough? Why do you also need ray.train.get_checkpoint()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.train.get_checkpoint is still needed if you want to support fault tolerance, in addition to a checkpoint to start fine-tuning from initially. Otherwise, if a node fails, the training progress always gets reset to the initial checkpoint, throwing away all finetuning progress.

@pcmoritz pcmoritz merged commit 1459667 into main Dec 14, 2024
1 check passed
justinvyu pushed a commit to ray-project/ray that referenced this pull request Dec 23, 2024
…ne API Revamp" REP (#49376)

Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray
Tune became the common execution engine for both libraries.

Ray Train execution invokes Tune’s execution logic under the hood, which
leads to a complex, layered system. The original intention behind this
was to increase the interoperability of the two libraries, but the
dependency of Ray Train on Ray Tune has led to many usability and
stability issues, and it has stalled feature development.

ray-project/enhancements#57 proposed a much
clearer design to improve the **Usability**, **Extensibility**,
**Interoperability**, and **Testability**.

This PR contains the implementation of the above REP for the revamped
Ray Train. This implementation is contained in the `python/ray/train/v2`
directory. These changes pave the way for improved feature development
and enhanced user experience. Please refer to the REP for details on the
design, as well as the remaining changes which will be added shortly in
follow-up PRs.

---------

Signed-off-by: Hongpeng Guo <[email protected]>
srinathk10 pushed a commit to ray-project/ray that referenced this pull request Jan 3, 2025
…ne API Revamp" REP (#49376)

Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray
Tune became the common execution engine for both libraries.

Ray Train execution invokes Tune’s execution logic under the hood, which
leads to a complex, layered system. The original intention behind this
was to increase the interoperability of the two libraries, but the
dependency of Ray Train on Ray Tune has led to many usability and
stability issues, and it has stalled feature development.

ray-project/enhancements#57 proposed a much
clearer design to improve the **Usability**, **Extensibility**,
**Interoperability**, and **Testability**.

This PR contains the implementation of the above REP for the revamped
Ray Train. This implementation is contained in the `python/ray/train/v2`
directory. These changes pave the way for improved feature development
and enhanced user experience. Please refer to the REP for details on the
design, as well as the remaining changes which will be added shortly in
follow-up PRs.

---------

Signed-off-by: Hongpeng Guo <[email protected]>
justinvyu added a commit to ray-project/ray that referenced this pull request Jan 4, 2025
…49317)

This PR adds the Tune APIs proposed in
ray-project/enhancements#57, which are mostly
just wrappers that pass through to existing Ray Train V1
classes/methods.

Deprecation warnings are added for methods that need to have their
import changed, when using the old `ray.train` imports.

---------

Signed-off-by: Justin Yu <[email protected]>
justinvyu added a commit to ray-project/ray that referenced this pull request Jan 6, 2025
…rain driver (#49519)

To use the new revamped Ray Train proposed in
ray-project/enhancements#57, users should set
the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their
job driver. However, if using Ray Tune to launch Ray Train jobs, this
environment variable does not get propagated from the driver process to
the Ray actor that is now acting as the Ray Train driver process. This
PR propagates this environment variable automatically.

Signed-off-by: Justin Yu <[email protected]>
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 7, 2025
…ay-project#49317)

This PR adds the Tune APIs proposed in
ray-project/enhancements#57, which are mostly
just wrappers that pass through to existing Ray Train V1
classes/methods.

Deprecation warnings are added for methods that need to have their
import changed, when using the old `ray.train` imports.

---------

Signed-off-by: Justin Yu <[email protected]>
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 7, 2025
…rain driver (ray-project#49519)

To use the new revamped Ray Train proposed in
ray-project/enhancements#57, users should set
the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their
job driver. However, if using Ray Tune to launch Ray Train jobs, this
environment variable does not get propagated from the driver process to
the Ray actor that is now acting as the Ray Train driver process. This
PR propagates this environment variable automatically.

Signed-off-by: Justin Yu <[email protected]>
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 9, 2025
…ay-project#49317)

This PR adds the Tune APIs proposed in
ray-project/enhancements#57, which are mostly
just wrappers that pass through to existing Ray Train V1
classes/methods.

Deprecation warnings are added for methods that need to have their
import changed, when using the old `ray.train` imports.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Roshan Kathawate <[email protected]>
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 9, 2025
…rain driver (ray-project#49519)

To use the new revamped Ray Train proposed in
ray-project/enhancements#57, users should set
the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their
job driver. However, if using Ray Tune to launch Ray Train jobs, this
environment variable does not get propagated from the driver process to
the Ray actor that is now acting as the Ray Train driver process. This
PR propagates this environment variable automatically.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Roshan Kathawate <[email protected]>
justinvyu added a commit to ray-project/ray that referenced this pull request Jan 24, 2025
Populates all the deprecation warnings in V2 code and links to the
migration issue: #49454

This covers all the dropped APIs that are mentioned in the REP:
ray-project/enhancements#57

---------

Signed-off-by: Justin Yu <[email protected]>
srinathk10 pushed a commit to ray-project/ray that referenced this pull request Feb 2, 2025
Populates all the deprecation warnings in V2 code and links to the
migration issue: #49454

This covers all the dropped APIs that are mentioned in the REP:
ray-project/enhancements#57

---------

Signed-off-by: Justin Yu <[email protected]>
justinvyu added a commit to ray-project/ray that referenced this pull request Feb 8, 2025
#50322)

Enable deprecation and migration messages for the API changes listed
here: ray-project/enhancements#57

---------

Signed-off-by: Justin Yu <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
…ne API Revamp" REP (ray-project#49376)

Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray
Tune became the common execution engine for both libraries.

Ray Train execution invokes Tune’s execution logic under the hood, which
leads to a complex, layered system. The original intention behind this
was to increase the interoperability of the two libraries, but the
dependency of Ray Train on Ray Tune has led to many usability and
stability issues, and it has stalled feature development.

ray-project/enhancements#57 proposed a much
clearer design to improve the **Usability**, **Extensibility**,
**Interoperability**, and **Testability**.

This PR contains the implementation of the above REP for the revamped
Ray Train. This implementation is contained in the `python/ray/train/v2`
directory. These changes pave the way for improved feature development
and enhanced user experience. Please refer to the REP for details on the
design, as well as the remaining changes which will be added shortly in
follow-up PRs.

---------

Signed-off-by: Hongpeng Guo <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
…ay-project#49317)

This PR adds the Tune APIs proposed in
ray-project/enhancements#57, which are mostly
just wrappers that pass through to existing Ray Train V1
classes/methods.

Deprecation warnings are added for methods that need to have their
import changed, when using the old `ray.train` imports.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
…rain driver (ray-project#49519)

To use the new revamped Ray Train proposed in
ray-project/enhancements#57, users should set
the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their
job driver. However, if using Ray Tune to launch Ray Train jobs, this
environment variable does not get propagated from the driver process to
the Ray actor that is now acting as the Ray Train driver process. This
PR propagates this environment variable automatically.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
…project#49455)

Populates all the deprecation warnings in V2 code and links to the
migration issue: ray-project#49454

This covers all the dropped APIs that are mentioned in the REP:
ray-project/enhancements#57

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
justinvyu added a commit to ray-project/ray that referenced this pull request Feb 14, 2025
…0435)

#49317 initiated the decoupling
of Ray Train and Ray Tune top-level APIs. This PR updates all of the
internal usage in Ray Tune examples and tests to switch from `ray.air`
(super out-dated) and `ray.train` imports to `ray.tune` imports instead.

See ray-project/enhancements#57 for context
around the changes.

---------

Signed-off-by: Justin Yu <[email protected]>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 20, 2025
…y-project#50435)

ray-project#49317 initiated the decoupling
of Ray Train and Ray Tune top-level APIs. This PR updates all of the
internal usage in Ray Tune examples and tests to switch from `ray.air`
(super out-dated) and `ray.train` imports to `ray.tune` imports instead.

See ray-project/enhancements#57 for context
around the changes.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: 400Ping <[email protected]>
israbbani pushed a commit to ray-project/ray that referenced this pull request Feb 25, 2025
…0435)

#49317 initiated the decoupling
of Ray Train and Ray Tune top-level APIs. This PR updates all of the
internal usage in Ray Tune examples and tests to switch from `ray.air`
(super out-dated) and `ray.train` imports to `ray.tune` imports instead.

See ray-project/enhancements#57 for context
around the changes.

---------

Signed-off-by: Justin Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants