Skip to content

Commit

Permalink
Merge pull request #2084 from mohantym:patch-3
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 452657099
  • Loading branch information
copybara-github committed Jun 3, 2022
2 parents c89eca0 + 27d363e commit 0dce67e
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions site/en/tutorials/distribute/multi_worker_with_keras.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
"* _Synchronous training_, where the steps of training are synced across the workers and replicas, such as `tf.distribute.MirroredStrategy`, `tf.distribute.TPUStrategy`, and `tf.distribute.MultiWorkerMirroredStrategy`. All workers train over different slices of input data in sync, and aggregating gradients at each step.\n",
"* _Asynchronous training_, where the training steps are not strictly synced, such as `tf.distribute.experimental.ParameterServerStrategy`. All workers are independently training over the input data and updating variables asynchronously.\n",
"\n",
"If you are looking for multi-worker synchronous training without TPU, then `tf.distribute.MultiWorkerMirroredStrategy` is your choice. It creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keeps the variables in sync. For those interested, check out the `tf.distribute.experimental.CommunicationOptions` parameter for the collective implementation options we are providing.\n",
"If you are looking for multi-worker synchronous training without TPU, then `tf.distribute.MultiWorkerMirroredStrategy` is your choice. It creates copies of all variables in the model's layers on each device across all workers. It uses `CollectiveOps`, a TensorFlow op for collective communication, to aggregate gradients and keeps the variables in sync. For those interested, check out the `tf.distribute.experimental.CommunicationOptions` parameter for the collective implementation options.\n",
"\n",
"For an overview of `tf.distribute.Strategy` APIs, refer to [Distributed training in TensorFlow](../../guide/distributed_training.ipynb)."
]
Expand Down Expand Up @@ -843,7 +843,7 @@
"\n",
"A repeated dataset (by calling `tf.data.Dataset.repeat`) is recommended for evaluation.\n",
"\n",
"Alternatively, you can also create another task that periodically reads checkpoints and runs the evaluation. This is what Estimator does. But this is not a recommended way to perform evaluation and thus its details are omitted."
"Alternatively, you can also create another task that periodically reads checkpoints and runs the evaluation. This is what an Estimator does. But this is not a recommended way to perform evaluation and thus its details are omitted."
]
},
{
Expand Down Expand Up @@ -892,7 +892,7 @@
"\n",
"When a worker becomes unavailable, other workers will fail (possibly after a timeout). In such cases, the unavailable worker needs to be restarted, as well as other workers that have failed.\n",
"\n",
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team are introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, which also adds the support to single-worker training for a consistent experience, and removed the fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new `BackupAndRestore` callback."
"Note: Previously, the `ModelCheckpoint` callback provided a mechanism to restore the training state upon a restart from a job failure for multi-worker training. The TensorFlow team is introducing a new [`BackupAndRestore`](#scrollTo=kmH8uCUhfn4w) callback, which also adds the support to single-worker training for a consistent experience, and removed the fault tolerance functionality from existing `ModelCheckpoint` callback. From now on, applications that rely on this behavior should migrate to the new `BackupAndRestore` callback."
]
},
{
Expand All @@ -907,7 +907,7 @@
"\n",
"The `ModelCheckpoint` callback can still be used to save checkpoints. But with this, if training was interrupted or successfully finished, in order to continue training from the checkpoint, the user is responsible to load the model manually.\n",
"\n",
"Optionally the user can choose to save and restore model/weights outside `ModelCheckpoint` callback."
"Optionally, users can choose to save and restore model/weights outside `ModelCheckpoint` callback."
]
},
{
Expand Down

0 comments on commit 0dce67e

Please sign in to comment.