Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Advice regarding behaviour when GCS is unavailable #3522

Closed
horgh opened this issue Jan 11, 2021 · 10 comments
Closed

storage: Advice regarding behaviour when GCS is unavailable #3522

horgh opened this issue Jan 11, 2021 · 10 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.

Comments

@horgh
Copy link

horgh commented Jan 11, 2021

I'm using v1.12.0 with go1.15.6.

Hello,

I make GCS API calls from a daemon that itself serves HTTP requests. I had a case where a client of my daemon would repeatedly time out due to the GCS request taking too long.

To try to avoid this, I have a health check that the daemon can contact GCS and mark itself offline if the check fails for too long, so requests go to other instances of the daemon. (It checks whether a bucket exists, in a loop with sleeps).

This check periodically fails for unknown reasons, though my guess is that GCS has issues periodically, which probably I must expect.

The check is also flawed in that there are multiple GCS IPs (not to mention hosts I'm sure), plus the HTTP/2 connections are cached. Which is to say some IPs/connections could be fine while others are not.

In debugging this, I noticed a cached HTTP/2 connection sticks around for ~4 minutes, even if it is consistently not returning a response (I have a context timeout in my health check). This makes me wonder whether I should disable keepalives and/or add a shorter timeout and retry in my own code, to avoid these connections that could be causing my clients to time out. Neither of these options seem attractive but waiting on a connection that is dead while using a new one that may succeed seems not ideal either.

Basically it seems like from the perspective of my clients there could be an outage due to a cached dead connection (and possibly other reasons), and I wonder if I could improve things somehow.

A couple questions then:

  • Do you have any advice about how I should check GCS is available in my daemon? Or should I rely on retries, while clients may time out?
  • Are there settings I could adjust to close connections that end up in the dead/timing out state more quickly? Or any other similar tweaks?
@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Jan 11, 2021
@codyoss codyoss added the type: question Request for information or clarification. Not an issue. label Jan 11, 2021
@tritone
Copy link
Contributor

tritone commented Jan 12, 2021

Thanks for your question. A few thoughts on this:

  • Generally I wouldn't advise micromanaging the HTTP/2 connections used by the client if you can avoid it; the idea is that stuff like closing stalled connections, etc gets handled inside the net/http client so you shouldn't have to worry about it at the application level.
  • You can try disabling HTTP/2 using GODEBUG=http2client=0 and see if that changes the behavior you see at all.
  • It's not clear to me that your health check failing indicates that the service will actually be unavailable if a separate GCS request is made. I don't think it's necessarily that common for all requests to stall unless you're experiencing general networking issues with the instance where you're running your application.
  • It's definitely recommended to use timeouts and retries with exponential backoff on requests to GCS. This library should retry failed requests automatically for most operations if a retryable error code is returned until the context deadline. However, if you're seeing hangs in your requests to GCS, you can also implement a shorter timeout and retry wrapping of your library calls as well. This will likely be more effective than the health check (again unless there is a networking issue on your side). Also, if you're seeing issues with specific library methods, feel free to file issue(s) with the specifics.
  • If you do want to try changing the HTTP client transport configs, we do provide a way to do that through the library using option.WithHTTPClient (though it is a bit clunky). Then you can configure the settings at the level of net/http Transport as you like. Here's a sample of how you'd do that:
package main
 
 
import (
	"context"
	"net/http"
 
	"cloud.google.com/go/storage"
	"google.golang.org/api/option"
	raw "google.golang.org/api/storage/v1"
	htransport "google.golang.org/api/transport/http"
)
 
func main() {
 
	ctx := context.Background()
 
	// Standard way to initialize client:
	// client, err := storage.NewClient(ctx)
	// if err != nil {
	// 	// handle error
	// }
 
	// Instead, create a http.Client using a custom transport.
	base := http.DefaultTransport.(*http.Transport).Clone()

	// TODO: configure base as desired.

	trans, err := htransport.NewTransport(ctx, base, option.WithScopes(raw.DevstorageFullControlScope),
		option.WithUserAgent("custom-user-agent"))
	if err != nil {
		// Handle error.
	}

	c := http.Client{Transport:trans}
 
	// Supply this client to storage.NewClient
	client, err := storage.NewClient(ctx, option.WithHTTPClient(&c))
	if err != nil {
		// Handle error.
	}

}

@horgh
Copy link
Author

horgh commented Jan 12, 2021

Thank you for your thoughts!

  • Yeah, I agree trying to manage the connections is not desirable.

  • Regarding disabling HTTP/2 - I tried this and surprisingly there was a difference. With HTTP/2, the dead cached connection (blocked with iptables) keeps being tried and my requests repeatedly time out (for ~4 minutes):

2021/01/12 15:57:02 success
# Added iptables rule here
2021/01/12 15:57:13 error: error retrieving bucket attributes: context deadline exceeded                                                                               
2021/01/12 15:57:24 error: error retrieving bucket attributes: context deadline exceeded                                                                               
2021/01/12 15:57:35 error: error retrieving bucket attributes: context deadline exceeded  

But if I disable HTTP/2, the dead connection seems to be recycled after it fails:

2021/01/12 15:57:55 success
# Added iptables rule here
2021/01/12 15:58:06 error: error retrieving bucket attributes: context deadline exceeded
2021/01/12 15:58:13 success

I wonder if that suggests the net/http client could be improved in this situation. It also suggests to me that disabling keepalives, though not ideal, would help as well. That way I could control the timeout with the dial timeout options at least, as well as rely on my context timeout and a retry actually trying a new connection. It's not clear to me what timeout to adjust for the cached HTTP/2 connection.

  • Regarding the healthcheck not being indicative of subsequent requests failing - agreed. It may be better than nothing as things stand though. Last evening for example 2 of my compute engine VMs (differente zones, same region) had apparent GCS issues leading to client timeouts, and the check did eventually pull both out of rotation for a time.

  • Yeah, retrying with a shorter timeout in my application code could help I think. My concern is that a dead HTTP/2 connection is going to keep being used, as above, making the retry kind of pointless. Together with either disabling HTTP/2 or keepalives though it might be an improvement at least...

@horgh
Copy link
Author

horgh commented Jan 12, 2021

This issue sounds related to what I'm seeing with cached HTTP/2 connections being reused - golang/go#36026. As well: golang/go#30702.

@tritone
Copy link
Contributor

tritone commented Jan 12, 2021

That's interesting about disabling HTTP/2, but I guess not entirely surprising given that HTTP/1.1 does not have long-lived connections in the same way. I would definitely file an issue for net/http (or comment on one of the existing ones with your experience).

If you use my code snippet above to create your storage client, you can play with some of these settings yourself via fields in Transport and see if that helps (e.g. DisableKeepAlives or IdleConnTimeout). You could also manually close idle open connections in the transport via Transport.CloseIdleConnections when your health check fails. I'd be curious to hear if any of these lead to an improvement on your side.

@tritone
Copy link
Contributor

tritone commented Jan 19, 2021

@horgh I looked at the golang issue and it looks like you used ReadIdleTimeout as a workaround; is that through golang.org/x/net/http2 ? Curious about how you did that.

@horgh
Copy link
Author

horgh commented Jan 19, 2021

Thanks for looking :). Yeah, through golang.org/x/net/http2 together with your example:

  httpTransport := http.DefaultTransport.(*http.Transport).Clone()

  http2Transport, err := http2.ConfigureTransports(httpTransport)
  if err != nil {
    return nil, errors.Wrap(err, "error configuring http/2 transport")
  }
  // Setting this enables the ping timeout functionality.
  http2Transport.ReadIdleTimeout = 5 * time.Second

  htrans, err := htransport.NewTransport(ctx, httpTransport)
  if err != nil {
    return nil, errors.Wrap(err, "error creating transport")
  }

  httpClient := &http.Client{
    Transport: htrans,
  }

  client, err := storage.NewClient(ctx, option.WithHTTPClient(httpClient))
  if err != nil {
    return nil, errors.Wrap(err, "error creating storage client")
  }

@tritone
Copy link
Contributor

tritone commented Jan 19, 2021

Awesome, it's very helpful to have this example!

I don't think we can do this by default because it would require depending on the x version of net/http but I can definitely suggest this to others who might run into a similar issue.

@tritone tritone closed this as completed Jan 19, 2021
@tritone
Copy link
Contributor

tritone commented Jan 19, 2021

By the way, the one change we do make in the default transport is to increase the value for MaxIdleConnectionsPerHost from 2 to 100. So you might want to do the same in your setup if you're doing workloads that require high read throughput.

@tritone
Copy link
Contributor

tritone commented Apr 22, 2021

Just to circle back on this closed issue-- after some internal discussion, we decided to add a ReadIdleTimeout of 15s to the transport by default (for Go 1.16+). This change went out today in the latest release of the storage client, v1.15.0.

@horgh
Copy link
Author

horgh commented Apr 22, 2021

That's awesome! Thank you for letting me/everyone know :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants