Opaque retry errors make debugging difficult #5923

drmorr0 · 2024-06-20T16:54:52Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am trying to use object_store to write to the three major cloud provider storage solutions; when the put request fails (or, I assume, any other request), it enters a retry loop that looks like the following (with debug logs enabled):

2024-06-20T16:02:04.415522Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/reqwest-0.12.5/src/connect.rs:497: starting new connection: https://s3.us-east-1.amazonaws.com/
2024-06-20T16:02:04.415898Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/connect/dns.rs:122: resolving host="s3.us-east-1.amazonaws.com"
2024-06-20T16:02:04.449611Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/connect/http.rs:631: connecting to 52.217.123.24:443
2024-06-20T16:02:04.529885Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/connect/http.rs:634: connected to 52.217.123.24:443
2024-06-20T16:02:04.616961Z  INFO /build/.build/cargo/git/checkouts/arrow-rs-d677d14273e07e8b/49d128d/object_store/src/client/retry.rs:331: Encountered transport error backing off for 0.1 seconds, retry 1 of 10: error sending request for url (https://s3.us-east-1.amazonaws.com/XXXXX)
2024-06-20T16:02:04.718218Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/reqwest-0.12.5/src/connect.rs:497: starting new connection: https://s3.us-east-1.amazonaws.com/
2024-06-20T16:02:04.718742Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/connect/dns.rs:122: resolving host="s3.us-east-1.amazonaws.com"
2024-06-20T16:02:04.723407Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/connect/http.rs:631: connecting to 52.216.54.104:443
2024-06-20T16:02:04.803010Z DEBUG /build/.build/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.5/src/client/legacy/connect/http.rs:634: connected to 52.216.54.104:443
2024-06-20T16:02:04.891240Z  INFO /build/.build/cargo/git/checkouts/arrow-rs-d677d14273e07e8b/49d128d/object_store/src/client/retry.rs:331: Encountered transport error backing off for 0.15082853 seconds, retry 2 of 10: error sending request for url (https://s3.us-east-1.amazonaws.com/XXXXX)
...

There's no indication in there as to what actually went wrong or how to fix it. The only way I was able to get the needed information was by hacking my own fork of object_store that printed out the raw error message, and then I discovered that the Docker container I was running in didn't have the needed certificates installed.

Describe the solution you'd like
Users shouldn't have to fork and patch the repo to discover why their requests are failing. Can we add more logging output around retries?

Additional context
I'm happy to submit a PR for this but I've seem some (I think related) issues that have been closed or suggest that the current behaviour is as expected, so wanted to submit the issue first and make sure that folks agree about the proposed fix.

Related links: #5345, #5607, #5013

The text was updated successfully, but these errors were encountered:

tustvold · 2024-06-20T18:08:05Z

I believe the error will be returned once the max retries is exceeded, but I don't see any issue with making the log more verbose.

I would, however, suggest you adopt a slightly less pejorative tone.

drmorr0 · 2024-06-20T19:48:42Z

Oh, uh, apologies. I'm not really sure what got interpreted as pejorative in there but it definitely wasn't intended!

Anyways, the only error I get back when max retries is exceeded is

Generic S3 error: Error after 10 retries in 3.848121029s, max_retries:10, retry_timeout:180s, source:error sending request for url (https://s3.us-east-1.amazonaws.com/XXXXX)

The information I needed to troubleshoot was the fact that there was an invalid certificate. This is what I got out of printing e.source():

Some(hyper_util::client::legacy::Error(Connect, Custom { kind: Other, error: Custom { kind: InvalidData, error: InvalidCertificate(UnknownIssuer) } }))

tustvold · 2024-06-23T10:16:22Z

Oh, uh, apologies. I'm not really sure what got interpreted as pejorative in there but it definitely wasn't intended!

I was referring to "Users shouldn't have to fork and patch the repo to discover why their requests are failing".

This is what I got out of printing e.source():

Generally it is considered best practice to either include an error as a source or include it in the display implementation. In this case reqwest/hyper is choosing the former option. To display an error you should walk the source chain and print them. If you use something like anyhow it will automatically do this for you when you call the debug implementation.

Now we blur this a little in some places for ergonomics reasons, but in general we want to leave it up to the caller to decide the level of verbosity they want.

The information I needed to troubleshoot was the fact that there was an invalid certificate

Did you try enabling more verbose logging? Hyper is quite chatty at the debug and trace levels and I would have expected it to provide further context on this? Perhaps we just need to better document how to get more verbose logging information out of tracing?

drmorr0 · 2024-06-24T21:02:49Z

I was referring to "Users shouldn't have to fork and patch the repo to discover why their requests are failing".

Sorry again -- was not intending this to be pejorative.

Generally it is considered best practice to either include an error as a source or include it in the display implementation.

So a (very simplified) version of my code looks like this:

async fn do_stuff() -> anyhow::Result<()> {
  let path = "s3://my-bucket/foo/bar
  let (store, path) = AmazonS3Builder::from_env().with_url(path).build()?;
  let data = store.get(&path).await?;
  // do stuff with data
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
  do_stuff().await?;
}

Is your suggestion here that instead of falling back to the try operator (?) I should instead parse the error that is returned from store.get?

Did you try enabling more verbose logging?

Yes, the logs that I included in the original comment are with DEBUG logging enabled. I guess I didn't specifically try bumping it to trace level, but I think what I expect as a user is that if there is an explicit error, there is (hopefully) enough information in the error logs to understand what caused the error.

drmorr0 · 2024-06-25T00:59:06Z

Ok so I'm a dumbass.

After a bunch more investigation, I discovered that what I thought was happening (that I posted above) is not actually what was happening, there was another lib sitting inbetween that was swallowing the anyhow:: chain. This wasn't at all obvious to me (in part because I didn't understand about the anyhow chain behaviour), but this doesn't seem like an issue with object_store.

Very sorry to waste your time on this, I will close the issue :(

drmorr0 added the enhancement Any new improvement worthy of a entry in the changelog label Jun 20, 2024

drmorr0 closed this as completed Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opaque retry errors make debugging difficult #5923

Opaque retry errors make debugging difficult #5923

drmorr0 commented Jun 20, 2024 •

edited

Loading

tustvold commented Jun 20, 2024 •

edited

Loading

drmorr0 commented Jun 20, 2024

tustvold commented Jun 23, 2024

drmorr0 commented Jun 24, 2024

drmorr0 commented Jun 25, 2024

Opaque retry errors make debugging difficult #5923

Opaque retry errors make debugging difficult #5923

Comments

drmorr0 commented Jun 20, 2024 • edited Loading

tustvold commented Jun 20, 2024 • edited Loading

drmorr0 commented Jun 20, 2024

tustvold commented Jun 23, 2024

drmorr0 commented Jun 24, 2024

drmorr0 commented Jun 25, 2024

drmorr0 commented Jun 20, 2024 •

edited

Loading

tustvold commented Jun 20, 2024 •

edited

Loading