-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bad interaction with GcsFileSystem caching #1225
Comments
The bad caching behavior that was on by default was removed in tensorflow/tensorflow@e43b946 (TF 1.15+) and replaced with a normal readahead cache that should be fine for TensorBoard usage, which also renders the env var we're setting a no-op. We might as well still set the env var to protect against bad behavior in case users are using recent TensorBoard with TF 1.9-1.14, since it's harmless to do so. I'm going to go ahead and close this issue; someday when we no longer care about old versions of TF at all, we can remove the env var entirely. |
Summary: This patch implements the extent of the Google Cloud Storage protocol that TensorBoard needs: list objects in a bucket with a given prefix, and read partial contents of an object. It turns out to be really easy. For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs is more complex because it needs to handle writable files and support general-purpose caching patterns. By contrast, we have a simple one-pass read pattern and already assume that files are append-only, so we avoid both the complexity and pathological interactions like #1225. For now, this only serves public buckets and objects. Authentication is also easy (and doesn’t require crypto or anything complicated), but, for ease of review, we defer it to a future patch. [tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud Test Plan: Included a simple client that supports `gsutil ls` and `gsutil cat`. Run with `RUST_LOG=debug cargo run --release --bin gsutil` and more args: - `ls tensorboard-bench-logs` to list all 33K objects in the bucket, across 34 pages of list operations (3.3s on my machine); - `ls tensorboard-bench-logs --prefix mnist/` to list just a single logdir, which should be much faster (0.1 seconds on my machine, which includes setting up the keep-alive connection); - `cat tensorboard-bench-logs mnist/README --to=11` to print the first 12 bytes (`Range: bytes=0-11` inclusive) of an object; - `cat tensorboard-bench-logs mnist/README --from=9999` to print nothing, since the object is shorter than 9999 bytes. wchargin-branch: rust-gcs-client wchargin-source: d9e404df57ecf5ee80089b810835a241084ffbc8
Summary: This patch implements the extent of the Google Cloud Storage protocol that TensorBoard needs: list objects in a bucket with a given prefix, and read partial contents of an object. It turns out to be really easy. For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs is more complex because it needs to handle writable files and support general-purpose caching patterns. By contrast, we have a simple one-pass read pattern and already assume that files are append-only, so we avoid both the complexity and pathological interactions like #1225. For now, this only serves public buckets and objects. Authentication is also easy (and doesn’t require crypto or anything complicated), but, for ease of review, we defer it to a future patch. [tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud Test Plan: Included a simple client that supports `gsutil ls` and `gsutil cat`. Run with `RUST_LOG=debug cargo run --release --bin gsutil` and more args: - `ls tensorboard-bench-logs` to list all 33K objects in the bucket, across 34 pages of list operations (3.3s on my machine); - `ls tensorboard-bench-logs --prefix mnist/` to list just a single logdir, which should be much faster (0.1 seconds on my machine, which includes setting up the keep-alive connection); - `ls tensorboard-bench-logs --prefix nopenope/` to check the case where there are no matching results; - `cat tensorboard-bench-logs mnist/README --to=11` to print the first 12 bytes (`Range: bytes=0-11` inclusive) of an object; - `cat tensorboard-bench-logs mnist/README --from=9999` to print nothing, since the object is shorter than 9999 bytes. wchargin-branch: rust-gcs-client
TensorBoard reads log directories using TensorFlow's
tf.gfile
API, which has built-in support for Google Cloud Storage (GCS) via GcsFileSystem. Unfortunately in TF 1.4+ this filesystem added a shared LRU cache that in its default configuration behaves pathologically badly for TensorBoard's read behavior (specifically, for the pattern of interleaved sequential consumption of 3 or more files that may be receiving newly appended data).This can cause various issues when running TensorBoard against a GCS logdir:
This will be fixed in the 1.9 release of TensorBoard only when used with TensorFlow 1.9+. GCS logdir users who can update to TensorFlow 1.9 should do so.
If you cannot use TF 1.9, we recommend avoiding the use of GCS logdirs except from TensorBoard instances running within the same Google Cloud Platform location where GCS egress traffic is free of charge. For example, you can run TensorBoard on a GCE instance in the same region as your GCS bucket, and optionally port forward using SSH if you want to continue to access TensorBoard at
localhost:6006
.The text was updated successfully, but these errors were encountered: