Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: multipart: stream error: stream ID 1; INTERNAL_ERROR #987

Closed
rayrutjes opened this issue May 2, 2018 · 7 comments
Closed

storage: multipart: stream error: stream ID 1; INTERNAL_ERROR #987

rayrutjes opened this issue May 2, 2018 · 7 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@rayrutjes
Copy link

rayrutjes commented May 2, 2018

We have a serious issue on our production machines.
We recorded the following errors while sending objects:

2018/05/02 09:47:38 Post https://www.googleapis.com/upload/storage/v1/b/***/o?alt=json&projection=full&uploadType=multipart: stream error: stream ID 1; INTERNAL_ERROR
2018/05/02 10:11:51 Post https://www.googleapis.com/upload/storage/v1/b/***/o?alt=json&projection=full&uploadType=multipart: stream error: stream ID 11; INTERNAL_ERROR
2018/05/02 10:17:25 Post https://www.googleapis.com/upload/storage/v1/b/***/o?alt=json&projection=full&uploadType=multipart: write tcp xxx.xxx.xx.xx:61038->xxx.xxx.xx.xx:443: write: broken pipe

The main issue being that when this arises, it takes 47 minutes to have the operation fail. As a result our operations are all very late.

After this error, it seems like that something is left in a weird state, because we are unable to continue using the client properly and need to manually restart our services.

  • Could you tell us what that internal error is?
  • Also, when we have such errors, what should we do regarding bucket handles and existing clients? Should we throw them away?
  • Could it be that the retry strategy implemented in the client is kicking in here and not behaving like expected?
  • If we were to cancel the context 2 minutes in, would we avoid this error? Is that a good practice?

We are using 0.20.

Any help would be much appreciated as this is currently affecting a lot of our production servers.

@jeanbza jeanbza added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: storage Issues related to the Cloud Storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels May 2, 2018
@jba
Copy link
Contributor

jba commented May 2, 2018

The internal error ultimately comes from the server. We don't know why it's sending that.

I see the upload type is multipart. That either means you're sending data that is smaller than your chunks ize, or you've set Writer.ChunkSize to zero. Which one?

If you find that just creating a new Writer and retrying isn't sufficient, then it seems you will have to create a new client.

Canceling the context and retrying should work and is a reasonable workaround.

Anything you can provide that will let us reproduce this issue would be really helpful.

@rayrutjes
Copy link
Author

Thank you so much for the fast answer @jba, this is strongly impacting us and we were about to open a ticket in the console.

In our case, we didn't override the ChunkSize so it defaults to the 8Mb, however the files are mostly lower than that.

Canceling the context and retrying should work and is a reasonable workaround.

We deployed that fix an hour ago, unfortunately it doesn't help. The context consistently reaches the 2 minutes timeout.

It is to note that this is affecting us since today 4am UTC, and not in the 3 weeks we have this piece of code deployed.
We saw a very big increase in the internal errors returned by the servers at 4am today.

I see my options as:

  • setting the chunksize to the minimum possible value
  • if that doesn't work, we need to create a new instance of the client each time we have this kind of error. I'm not a huge fan of this one, but we will need to find a fix even if temporary.

I'd be very open to share more information regarding our project and maybe more specifics.

I feel like this really is related to the server side of GCS as on our side, the data sizes and the code didn't change.

Let me know what the best would be to move forward. I'm pretty much dedicated to fixing this issue given it currently delays the processing of our production services.

@jba
Copy link
Contributor

jba commented May 2, 2018

I just tried again to reproduce this issue by writing a thousand 1M files concurrently. They all succeeded. Can you suggest another pattern that may match what you're doing better?

If you didn't change your code, then it does sound like a server-side issue. I would recommend opening a ticket.

If the context isn't helping, then I don't think the long delay is due to retries. But you can add logging around here to make sure.

@rayrutjes
Copy link
Author

rayrutjes commented May 2, 2018

In my current use case, I don't even write files concurrently.
Files are between very tiny, too 400Mb.

Here is our current implementation (only relevant part):

// Store stores gzip compressed file into GCS and publishes a message in a Pubsub topic according to the log kind.
func (s *LegacyStore) Store(ctx context.Context, kind LogKind, r io.Reader, ts time.Time) (written int64, err error) {
	ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)

	// When the context is canceled, GCS writers will abort their job and ensure no object is stored.
	// It is important NOT to Close the GCS writer if something went wrong.
	// See: https://godoc.org/cloud.google.com/go/storage#Writer.CloseWithError
	defer cancel()

	w := s.writer.Resolve(ctx, s.bucket, s.StoragePath(kind, ts))

	gzipWriter := gzip.NewWriter(w)
	written, err = io.Copy(gzipWriter, r)

	if err != nil {
		gzipWriter.Close()
		return
	}

	err = gzipWriter.Close()
	if err != nil {
		return
	}

	err = w.Close()
	if err != nil {
		return
	}

	err = s.notifier.Notify(ctx, kind, s.hostname, ts)
	return
}

// Resolve resolves a writer that writes an object in Google Cloud Storage.
func (w *GCSWriterResolver) Resolve(ctx context.Context, bucket string, name string) io.WriteCloser {
	b := w.client.Bucket(bucket)
	o := b.Object(name)

	writer := o.NewWriter(ctx)
	writer.ContentType = "application/x-gzip" // We know this is wrong, and we will adapt in the future.

	return writer
}

Is there anything that looks wrong to you?
If not we'll move ahead and open a ticket.

@jba
Copy link
Contributor

jba commented May 2, 2018

I don't see anything wrong with your code.

@JustinBeckwith JustinBeckwith added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jun 8, 2018
@jba
Copy link
Contributor

jba commented Jul 20, 2018

I took another look at this, to see if the information in golang/go#26338 was helpful. There the problem was that writing took too long and tripped a timeout in the Go HTTPS server. Although GCS doesn't use Go, I did something similar anyway. I slowed down writing so that it took me almost 12 hours to write a 20M file to GCS. The write succeeded.

Do you have any further updates from your end? Are you still encountering the problem?

I'm inclined to close this issue because whatever the problem, it's not in the storage client.

@rayrutjes
Copy link
Author

@jba thanks for the investigation.
We have no more information to provide here.
We still see it happening everyday, however we now use a context with a proper timeout in order to retry faster, which helps us to mitigate the problem.

I indeed think the issue doesn't come from this client. I'll close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants