-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train + Tune API Revamp #57
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is legendary! Thanks for putting everything together.
Will approve before getting public discussion feedbacks
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
trainer = TorchTrainer( | ||
train_fn_per_worker, | ||
..., | ||
run_config=ray.train.RunConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions
- If there is nothing to restore, e.g., this is the first time the job is submitted, then
unique_run_name
will beNone
and theTorchTrainer
will see that there is no checkpoint to load from and will not load anything? - What's the reason for removing the explicit
restore
call? - Why do I need the storage info in
RunConfig
? Couldn't that all just go intoget_checkpoint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- That's right.
- This is actually something I was debating back and forth about. This section outlines some of the issues with the existing
TorchTrainer.restore
API.- Re-implementing with exact API parity would require us to pickle the user code and python objects again.
- The explicit
Trainer.restore
API also had some UX friction (needing the user to manually concatenate theirstorage_path
with the runname
. - I took this chance to just improve the usability by pulling the "restore path" from the
RunConfig
directly and avoiding all pickling. So, theRunConfig(storage_path, name, storage_filesystem)
tuple is what uniquely defines a Ray Train run, and users should pass in a uniquename
per job rather than re-use the samename
for multiple runs. - Using a colliding
name
runs into other problems such as checkpoints being uploaded to the same directory with possible overwriting, so it's already unsupported and undefined behavior. - We can also definitely add a
restore_run_if_exists
flag to make the job restoration behavior explicit if users find this behavior is too magical.
- The storage info sets the location where checkpoints and Ray Train's driver state gets saved. Without it, we can't load the state if restoring the run from a new Ray cluster, and we don't know what checkpoint to populate
get_checkpoint
with.
* **Execution Control**: | ||
* It triggers the distributed training function across the workers in the `WorkerGroup` and shuts down the workers when training has either finished or failed. | ||
* **Polling & Health Checking**: | ||
* The Controller regularly polls the workers to assess their health and training progress. Based on the worker statuses, it can handle errors and make scaling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome. A few questions.
Training health monitoring is becoming more and more important and can be quite complex and compute intensive (not just looking at lots of metrics, but also running evals). Where does this logic live?
If the health monitoring takes a while to run but eventually decides that we need to restart from an earlier checkpoint, how does that get implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this logic live?
This logic lives in the Train driver right now, where the controller actor periodically pings the worker actor tasks and handles any errors raised by those health-check tasks.
If the health monitoring takes a while to run but eventually decides that we need to restart from an earlier checkpoint, how does that get implemented?
Health-checks are every few seconds at the moment, so the decisions happen pretty quickly after Ray detects that a node/actor has died. We also handle more extreme edge-cases where the underlying RPC times out randomly, in which case we'll wait for some configurable time until restarting from the latest checkpoint.
(not just looking at lots of metrics, but also running evals)
Oh, maybe this part was a bit misleading. At the moment, the "health and training progress" just refers to the underlying actor/node health and reported checkpoints, rather than the health of the training jobs (we do NOT implement anything to handle how well things are converging, how validation metrics are doing, etc).
def train_fn_per_worker(config: dict): | ||
# Equivalent behavior that is explicit and more flexible. | ||
checkpoint = ( | ||
ray.train.get_checkpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but isn't config.get("resume_from_checkpoint")
enough? Why do you also need ray.train.get_checkpoint()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray.train.get_checkpoint
is still needed if you want to support fault tolerance, in addition to a checkpoint to start fine-tuning from initially. Otherwise, if a node fails, the training progress always gets reset to the initial checkpoint, throwing away all finetuning progress.
…ne API Revamp" REP (#49376) Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries. Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system. The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development. ray-project/enhancements#57 proposed a much clearer design to improve the **Usability**, **Extensibility**, **Interoperability**, and **Testability**. This PR contains the implementation of the above REP for the revamped Ray Train. This implementation is contained in the `python/ray/train/v2` directory. These changes pave the way for improved feature development and enhanced user experience. Please refer to the REP for details on the design, as well as the remaining changes which will be added shortly in follow-up PRs. --------- Signed-off-by: Hongpeng Guo <[email protected]>
…ne API Revamp" REP (#49376) Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries. Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system. The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development. ray-project/enhancements#57 proposed a much clearer design to improve the **Usability**, **Extensibility**, **Interoperability**, and **Testability**. This PR contains the implementation of the above REP for the revamped Ray Train. This implementation is contained in the `python/ray/train/v2` directory. These changes pave the way for improved feature development and enhanced user experience. Please refer to the REP for details on the design, as well as the remaining changes which will be added shortly in follow-up PRs. --------- Signed-off-by: Hongpeng Guo <[email protected]>
…49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]>
…rain driver (#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]>
…ay-project#49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]>
…rain driver (ray-project#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]>
…ay-project#49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>
…rain driver (ray-project#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>
Populates all the deprecation warnings in V2 code and links to the migration issue: #49454 This covers all the dropped APIs that are mentioned in the REP: ray-project/enhancements#57 --------- Signed-off-by: Justin Yu <[email protected]>
Populates all the deprecation warnings in V2 code and links to the migration issue: #49454 This covers all the dropped APIs that are mentioned in the REP: ray-project/enhancements#57 --------- Signed-off-by: Justin Yu <[email protected]>
#50322) Enable deprecation and migration messages for the API changes listed here: ray-project/enhancements#57 --------- Signed-off-by: Justin Yu <[email protected]>
…ne API Revamp" REP (ray-project#49376) Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries. Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system. The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development. ray-project/enhancements#57 proposed a much clearer design to improve the **Usability**, **Extensibility**, **Interoperability**, and **Testability**. This PR contains the implementation of the above REP for the revamped Ray Train. This implementation is contained in the `python/ray/train/v2` directory. These changes pave the way for improved feature development and enhanced user experience. Please refer to the REP for details on the design, as well as the remaining changes which will be added shortly in follow-up PRs. --------- Signed-off-by: Hongpeng Guo <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
…ay-project#49317) This PR adds the Tune APIs proposed in ray-project/enhancements#57, which are mostly just wrappers that pass through to existing Ray Train V1 classes/methods. Deprecation warnings are added for methods that need to have their import changed, when using the old `ray.train` imports. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
…rain driver (ray-project#49519) To use the new revamped Ray Train proposed in ray-project/enhancements#57, users should set the `RAY_TRAIN_V2_ENABLED=1` feature flag environment variable on their job driver. However, if using Ray Tune to launch Ray Train jobs, this environment variable does not get propagated from the driver process to the Ray actor that is now acting as the Ray Train driver process. This PR propagates this environment variable automatically. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
…project#49455) Populates all the deprecation warnings in V2 code and links to the migration issue: ray-project#49454 This covers all the dropped APIs that are mentioned in the REP: ray-project/enhancements#57 --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
…0435) #49317 initiated the decoupling of Ray Train and Ray Tune top-level APIs. This PR updates all of the internal usage in Ray Tune examples and tests to switch from `ray.air` (super out-dated) and `ray.train` imports to `ray.tune` imports instead. See ray-project/enhancements#57 for context around the changes. --------- Signed-off-by: Justin Yu <[email protected]>
…y-project#50435) ray-project#49317 initiated the decoupling of Ray Train and Ray Tune top-level APIs. This PR updates all of the internal usage in Ray Tune examples and tests to switch from `ray.air` (super out-dated) and `ray.train` imports to `ray.tune` imports instead. See ray-project/enhancements#57 for context around the changes. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: 400Ping <[email protected]>
…0435) #49317 initiated the decoupling of Ray Train and Ray Tune top-level APIs. This PR updates all of the internal usage in Ray Tune examples and tests to switch from `ray.air` (super out-dated) and `ray.train` imports to `ray.tune` imports instead. See ray-project/enhancements#57 for context around the changes. --------- Signed-off-by: Justin Yu <[email protected]>
Summary
Ray Tune and Ray Train have been tightly coupled since Ray 2.0, when Ray Tune became the common execution engine for both libraries.
Ray Train execution invokes Tune’s execution logic under the hood, which leads to a complex, layered system.
The original intention behind this was to increase the interoperability of the two libraries, but the dependency of Ray Train on Ray Tune has led to many usability and stability issues, and it has stalled feature development.
The goal of these changes is to improve: