Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorstore key value store driver not Retrying on 429 Responses with S3 API #194

Open
AdityaCohere opened this issue Sep 3, 2024 · 8 comments

Comments

@AdityaCohere
Copy link

I'm using Tensorstore's key value store driver with OCI object storage using their S3 compatibility layer and am seeing that when the OCI Object Storage service sends 429s Tensorstore fails immediately without any retries. This was tested with debug logging enabled which also didn't show any retries happening either with the default value of 32 or with s3_request_retries set to a custom number.
I also tried testing this @sjperkins 's PR here to see if that may enable retries to work here however I'm hitting bazel build issues on a M1 Mac of:

configurable attribute "content" in @com_github_aws_c_common//:write_config_h doesn't match this configuration. Would a default condition help?

I was wondering if there was anything I could do to debug the lack of retries further or any modifications I could make to force retries.

@laramiel
Copy link
Collaborator

laramiel commented Sep 3, 2024

The retry decision happens here (or one of the other methods in the file):

absl::Status AwsHttpResponseToStatus(const HttpResponse& response,

You could turn on logging. TENSORSTORE_VERBOSE_LOGGING=s3=1.

@sjperkins
Copy link
Contributor

Unfortunately #149 doesn't integrate with any kvstore at present.

@AdityaCohere
Copy link
Author

The retry decision happens here (or one of the other methods in the file):

absl::Status AwsHttpResponseToStatus(const HttpResponse& response,

You could turn on logging. TENSORSTORE_VERBOSE_LOGGING=s3=1.

Yea, I've tried turning on Logging as well with no output specifying any retries. I see that the 429 response is received and nothing beyond that other than a failure.
I will repro this and get the debug logs

@AdityaCohere
Copy link
Author

Unfortunately #149 doesn't integrate with any kvstore at present.

Ah, this is unfortunate to hear, would it non-trivial to integrate?

@sjperkins
Copy link
Contributor

Unfortunately #149 doesn't integrate with any kvstore at present.

Ah, this is unfortunate to hear, would it non-trivial to integrate?

It'd take some work, I guess the amount of code required would be smaller than the current s3 kvstore as it could defer to the C++ S3 SDK.

However, the C++ SDK is a fairly heavy dependency: the S3 C CRT looks much lighter. That's next, when I find the time.

@laramiel
Copy link
Collaborator

laramiel commented Sep 4, 2024

You could turn on logging. TENSORSTORE_VERBOSE_LOGGING=s3=1.

Yea, I've tried turning on Logging as well with no output specifying any retries. I see that the 429 response is received and nothing beyond that other than a failure. I will repro this and get the debug logs

With s3=1 you should see a line from s3_key_value_store.cc:435 ReadTask or s3_key_value_store.cc:717 WriteTask or such. Could you attach those lines?

Also, you could attempt to run localstack_test against OCI object storage.

git clone https://github.com/google/tensorstore.git
cd tenstorstore
bazelisk.py build tensorstore/kvstore/s3:localstack_test
./bazel-bin/tensorstore/kvstore/s3/localstack_test --tensorstore_verbose_logging=all=2 --localstack_endpoint=...  <other flags>

@AdityaCohere
Copy link
Author

Hey, I've tried turning on logging and I don't seem to see any logs regarding s3 being printed.
I get a rate limited error but no read/write tasks.
I've tried TENSORSTORE_VERBOSE_LOGGING=s3=1 as well as TENSORSTORE_VERBOSE_LOGGING=all
Is there anything else I can try?

@laramiel
Copy link
Collaborator

Ok, so there was an issue that tensorstore logging when used from python didn't initialize the absl logging level in a way which provides output. This was fixed by 041a999 and I will make another release soon. I would expect that logging in gunit tests would work, though you may need gunit specific flags for that such as --gmock_verbose=info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants