Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust: add GCS listing and reading #4645

Merged
merged 4 commits into from
Feb 4, 2021
Merged

Conversation

wchargin
Copy link
Contributor

@wchargin wchargin commented Feb 3, 2021

Summary:
This patch implements the extent of the Google Cloud Storage protocol
that TensorBoard needs: list objects in a bucket with a given prefix,
and read partial contents of an object. It turns out to be really easy.

For comparison, TensorFlow also rolls its own GCS client. Theirs
is more complex because it needs to handle writable files and support
general-purpose caching patterns. By contrast, we have a simple one-pass
read pattern and already assume that files are append-only, so we avoid
both the complexity and pathological interactions like #1225.

For now, this only serves public buckets and objects. Authentication is
also easy (and doesn’t require crypto or anything complicated), but, for
ease of review, we defer it to a future patch.

Test Plan:
Included a simple client that supports gsutil ls and gsutil cat. Run
with RUST_LOG=debug cargo run --release --bin gsutil and more args:

  • ls tensorboard-bench-logs to list all 33K objects in the bucket,
    across 34 pages of list operations (3.3s on my machine);
  • ls tensorboard-bench-logs --prefix mnist/ to list just a single
    logdir, which should be much faster (0.1 seconds on my machine,
    which includes setting up the keep-alive connection);
  • ls tensorboard-bench-logs --prefix nopenope/ to check the case
    where there are no matching results;
  • cat tensorboard-bench-logs mnist/README --to=11 to print the first
    12 bytes (Range: bytes=0-11 inclusive) of an object;
  • cat tensorboard-bench-logs mnist/README --from=9999 to print
    nothing, since the object is shorter than 9999 bytes.

wchargin-branch: rust-gcs-client

Summary:
The [`reqwest`] crate provides a high-level HTTP interface, similar in
spirit to Python’s `requests` package. It has built-in support for JSON
serialization and deserialization via `serde` (which we already use) and
is based on the `hyper` stack (which we also already use), so it should
fit in nicely. We’ll use it to make requests to GCS.

[`reqwest`]: https://crates.io/crates/reqwest

Test Plan:
It builds: `bazel build //third_party/rust:reqwest`.

wchargin-branch: rust-dep-reqwest
wchargin-source: d05d3de3a3d44974e282574ec965ad6bd0024246
Summary:
This patch implements the extent of the Google Cloud Storage protocol
that TensorBoard needs: list objects in a bucket with a given prefix,
and read partial contents of an object. It turns out to be really easy.

For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs
is more complex because it needs to handle writable files and support
general-purpose caching patterns. By contrast, we have a simple one-pass
read pattern and already assume that files are append-only, so we avoid
both the complexity and pathological interactions like #1225.

For now, this only serves public buckets and objects. Authentication is
also easy (and doesn’t require crypto or anything complicated), but, for
ease of review, we defer it to a future patch.

[tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud

Test Plan:
Included a simple client that supports `gsutil ls` and `gsutil cat`. Run
with `RUST_LOG=debug cargo run --release --bin gsutil` and more args:

  - `ls tensorboard-bench-logs` to list all 33K objects in the bucket,
    across 34 pages of list operations (3.3s on my machine);
  - `ls tensorboard-bench-logs --prefix mnist/` to list just a single
    logdir, which should be much faster (0.1 seconds on my machine,
    which includes setting up the keep-alive connection);
  - `cat tensorboard-bench-logs mnist/README --to=11` to print the first
    12 bytes (`Range: bytes=0-11` inclusive) of an object;
  - `cat tensorboard-bench-logs mnist/README --from=9999` to print
    nothing, since the object is shorter than 9999 bytes.

wchargin-branch: rust-gcs-client
wchargin-source: d9e404df57ecf5ee80089b810835a241084ffbc8
@wchargin wchargin added type:feature core:rustboard //tensorboard/data/server/... labels Feb 3, 2021
@google-cla google-cla bot added the cla: yes label Feb 3, 2021
@wchargin wchargin requested a review from stephanwlee February 3, 2021 04:40
wchargin-branch: rust-gcs-client
wchargin-source: 48943e73d17f4dadaeb7aa83b6eeaa3ec8f78707
Base automatically changed from wchargin-rust-dep-reqwest to master February 3, 2021 22:56
wchargin-branch: rust-gcs-client
wchargin-source: de3a575ea49828b24aabb92d3d6cfa8150237f1d
@wchargin wchargin merged commit cd69a8c into master Feb 4, 2021
@wchargin wchargin deleted the wchargin-rust-gcs-client branch February 4, 2021 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes core:rustboard //tensorboard/data/server/... type:feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants