Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad interaction with GcsFileSystem caching #1225

Closed
nfelt opened this issue Jun 4, 2018 · 1 comment
Closed

Fix bad interaction with GcsFileSystem caching #1225

nfelt opened this issue Jun 4, 2018 · 1 comment

Comments

@nfelt
Copy link
Contributor

nfelt commented Jun 4, 2018

TensorBoard reads log directories using TensorFlow's tf.gfile API, which has built-in support for Google Cloud Storage (GCS) via GcsFileSystem. Unfortunately in TF 1.4+ this filesystem added a shared LRU cache that in its default configuration behaves pathologically badly for TensorBoard's read behavior (specifically, for the pattern of interleaved sequential consumption of 3 or more files that may be receiving newly appended data).

This can cause various issues when running TensorBoard against a GCS logdir:

This will be fixed in the 1.9 release of TensorBoard only when used with TensorFlow 1.9+. GCS logdir users who can update to TensorFlow 1.9 should do so.

If you cannot use TF 1.9, we recommend avoiding the use of GCS logdirs except from TensorBoard instances running within the same Google Cloud Platform location where GCS egress traffic is free of charge. For example, you can run TensorBoard on a GCE instance in the same region as your GCS bucket, and optionally port forward using SSH if you want to continue to access TensorBoard at localhost:6006.

@nfelt
Copy link
Contributor Author

nfelt commented Dec 17, 2019

The bad caching behavior that was on by default was removed in tensorflow/tensorflow@e43b946 (TF 1.15+) and replaced with a normal readahead cache that should be fine for TensorBoard usage, which also renders the env var we're setting a no-op.

We might as well still set the env var to protect against bad behavior in case users are using recent TensorBoard with TF 1.9-1.14, since it's harmless to do so.

I'm going to go ahead and close this issue; someday when we no longer care about old versions of TF at all, we can remove the env var entirely.

@nfelt nfelt closed this as completed Dec 17, 2019
wchargin added a commit that referenced this issue Feb 3, 2021
Summary:
This patch implements the extent of the Google Cloud Storage protocol
that TensorBoard needs: list objects in a bucket with a given prefix,
and read partial contents of an object. It turns out to be really easy.

For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs
is more complex because it needs to handle writable files and support
general-purpose caching patterns. By contrast, we have a simple one-pass
read pattern and already assume that files are append-only, so we avoid
both the complexity and pathological interactions like #1225.

For now, this only serves public buckets and objects. Authentication is
also easy (and doesn’t require crypto or anything complicated), but, for
ease of review, we defer it to a future patch.

[tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud

Test Plan:
Included a simple client that supports `gsutil ls` and `gsutil cat`. Run
with `RUST_LOG=debug cargo run --release --bin gsutil` and more args:

  - `ls tensorboard-bench-logs` to list all 33K objects in the bucket,
    across 34 pages of list operations (3.3s on my machine);
  - `ls tensorboard-bench-logs --prefix mnist/` to list just a single
    logdir, which should be much faster (0.1 seconds on my machine,
    which includes setting up the keep-alive connection);
  - `cat tensorboard-bench-logs mnist/README --to=11` to print the first
    12 bytes (`Range: bytes=0-11` inclusive) of an object;
  - `cat tensorboard-bench-logs mnist/README --from=9999` to print
    nothing, since the object is shorter than 9999 bytes.

wchargin-branch: rust-gcs-client
wchargin-source: d9e404df57ecf5ee80089b810835a241084ffbc8
wchargin added a commit that referenced this issue Feb 4, 2021
Summary:
This patch implements the extent of the Google Cloud Storage protocol
that TensorBoard needs: list objects in a bucket with a given prefix,
and read partial contents of an object. It turns out to be really easy.

For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs
is more complex because it needs to handle writable files and support
general-purpose caching patterns. By contrast, we have a simple one-pass
read pattern and already assume that files are append-only, so we avoid
both the complexity and pathological interactions like #1225.

For now, this only serves public buckets and objects. Authentication is
also easy (and doesn’t require crypto or anything complicated), but, for
ease of review, we defer it to a future patch.

[tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud

Test Plan:
Included a simple client that supports `gsutil ls` and `gsutil cat`. Run
with `RUST_LOG=debug cargo run --release --bin gsutil` and more args:

  - `ls tensorboard-bench-logs` to list all 33K objects in the bucket,
    across 34 pages of list operations (3.3s on my machine);
  - `ls tensorboard-bench-logs --prefix mnist/` to list just a single
    logdir, which should be much faster (0.1 seconds on my machine,
    which includes setting up the keep-alive connection);
  - `ls tensorboard-bench-logs --prefix nopenope/` to check the case
    where there are no matching results;
  - `cat tensorboard-bench-logs mnist/README --to=11` to print the first
    12 bytes (`Range: bytes=0-11` inclusive) of an object;
  - `cat tensorboard-bench-logs mnist/README --from=9999` to print
    nothing, since the object is shorter than 9999 bytes.

wchargin-branch: rust-gcs-client
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant