storage: multipart: stream error: stream ID 1; INTERNAL_ERROR #987

rayrutjes · 2018-05-02T10:58:13Z

We have a serious issue on our production machines.
We recorded the following errors while sending objects:

2018/05/02 09:47:38 Post https://www.googleapis.com/upload/storage/v1/b/***/o?alt=json&projection=full&uploadType=multipart: stream error: stream ID 1; INTERNAL_ERROR
2018/05/02 10:11:51 Post https://www.googleapis.com/upload/storage/v1/b/***/o?alt=json&projection=full&uploadType=multipart: stream error: stream ID 11; INTERNAL_ERROR
2018/05/02 10:17:25 Post https://www.googleapis.com/upload/storage/v1/b/***/o?alt=json&projection=full&uploadType=multipart: write tcp xxx.xxx.xx.xx:61038->xxx.xxx.xx.xx:443: write: broken pipe

The main issue being that when this arises, it takes 47 minutes to have the operation fail. As a result our operations are all very late.

After this error, it seems like that something is left in a weird state, because we are unable to continue using the client properly and need to manually restart our services.

Could you tell us what that internal error is?
Also, when we have such errors, what should we do regarding bucket handles and existing clients? Should we throw them away?
Could it be that the retry strategy implemented in the client is kicking in here and not behaving like expected?
If we were to cancel the context 2 minutes in, would we avoid this error? Is that a good practice?

We are using 0.20.

Any help would be much appreciated as this is currently affecting a lot of our production servers.

The text was updated successfully, but these errors were encountered:

jba · 2018-05-02T15:57:50Z

The internal error ultimately comes from the server. We don't know why it's sending that.

I see the upload type is multipart. That either means you're sending data that is smaller than your chunks ize, or you've set Writer.ChunkSize to zero. Which one?

If you find that just creating a new Writer and retrying isn't sufficient, then it seems you will have to create a new client.

Canceling the context and retrying should work and is a reasonable workaround.

Anything you can provide that will let us reproduce this issue would be really helpful.

rayrutjes · 2018-05-02T16:08:48Z

Thank you so much for the fast answer @jba, this is strongly impacting us and we were about to open a ticket in the console.

In our case, we didn't override the ChunkSize so it defaults to the 8Mb, however the files are mostly lower than that.

Canceling the context and retrying should work and is a reasonable workaround.

We deployed that fix an hour ago, unfortunately it doesn't help. The context consistently reaches the 2 minutes timeout.

It is to note that this is affecting us since today 4am UTC, and not in the 3 weeks we have this piece of code deployed.
We saw a very big increase in the internal errors returned by the servers at 4am today.

I see my options as:

setting the chunksize to the minimum possible value
if that doesn't work, we need to create a new instance of the client each time we have this kind of error. I'm not a huge fan of this one, but we will need to find a fix even if temporary.

I'd be very open to share more information regarding our project and maybe more specifics.

I feel like this really is related to the server side of GCS as on our side, the data sizes and the code didn't change.

Let me know what the best would be to move forward. I'm pretty much dedicated to fixing this issue given it currently delays the processing of our production services.

jba · 2018-05-02T16:16:37Z

I just tried again to reproduce this issue by writing a thousand 1M files concurrently. They all succeeded. Can you suggest another pattern that may match what you're doing better?

If you didn't change your code, then it does sound like a server-side issue. I would recommend opening a ticket.

If the context isn't helping, then I don't think the long delay is due to retries. But you can add logging around here to make sure.

rayrutjes · 2018-05-02T16:39:27Z

In my current use case, I don't even write files concurrently.
Files are between very tiny, too 400Mb.

Here is our current implementation (only relevant part):

// Store stores gzip compressed file into GCS and publishes a message in a Pubsub topic according to the log kind.
func (s *LegacyStore) Store(ctx context.Context, kind LogKind, r io.Reader, ts time.Time) (written int64, err error) {
	ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)

	// When the context is canceled, GCS writers will abort their job and ensure no object is stored.
	// It is important NOT to Close the GCS writer if something went wrong.
	// See: https://godoc.org/cloud.google.com/go/storage#Writer.CloseWithError
	defer cancel()

	w := s.writer.Resolve(ctx, s.bucket, s.StoragePath(kind, ts))

	gzipWriter := gzip.NewWriter(w)
	written, err = io.Copy(gzipWriter, r)

	if err != nil {
		gzipWriter.Close()
		return
	}

	err = gzipWriter.Close()
	if err != nil {
		return
	}

	err = w.Close()
	if err != nil {
		return
	}

	err = s.notifier.Notify(ctx, kind, s.hostname, ts)
	return
}

// Resolve resolves a writer that writes an object in Google Cloud Storage.
func (w *GCSWriterResolver) Resolve(ctx context.Context, bucket string, name string) io.WriteCloser {
	b := w.client.Bucket(bucket)
	o := b.Object(name)

	writer := o.NewWriter(ctx)
	writer.ContentType = "application/x-gzip" // We know this is wrong, and we will adapt in the future.

	return writer
}

Is there anything that looks wrong to you?
If not we'll move ahead and open a ticket.

jba · 2018-05-02T19:43:02Z

I don't see anything wrong with your code.

jba · 2018-07-20T11:52:03Z

I took another look at this, to see if the information in golang/go#26338 was helpful. There the problem was that writing took too long and tripped a timeout in the Go HTTPS server. Although GCS doesn't use Go, I did something similar anyway. I slowed down writing so that it took me almost 12 hours to write a 20M file to GCS. The write succeeded.

Do you have any further updates from your end? Are you still encountering the problem?

I'm inclined to close this issue because whatever the problem, it's not in the storage client.

rayrutjes · 2018-07-20T12:16:38Z

@jba thanks for the investigation.
We have no more information to provide here.
We still see it happening everyday, however we now use a context with a proper timeout in order to retry faster, which helps us to mitigate the problem.

I indeed think the issue doesn't come from this client. I'll close the issue.

jeanbza added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: storage Issues related to the Cloud Storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels May 2, 2018

jeanbza assigned jba May 2, 2018

JustinBeckwith added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jun 8, 2018

jba mentioned this issue Jul 19, 2018

x/net/http2: seeing INTERNAL_ERROR golang/go#26338

Closed

rayrutjes closed this as completed Jul 20, 2018

willnorris mentioned this issue Mar 28, 2019

stream error: stream ID 19; INTERNAL_ERROR willnorris/imageproxy#156

Closed

kirillx mentioned this issue Jan 27, 2020

x/net/http2: errors don't implement Temporary interface golang/go#35286

Closed

tritone mentioned this issue Feb 10, 2020

storage: stream error: stream ID 3319; INTERNAL_ERROR #1773

Closed

mayocream mentioned this issue Nov 4, 2021

error building image: error building stage: failed to get filesystem from image: stream error: stream ID 3; PROTOCOL_ERROR GoogleContainerTools/kaniko#1360

Open

LBGarber mentioned this issue Nov 8, 2021

[Bug]: Error creating a Linode Stackscript stream error INTERNAL_ERROR linode/terraform-provider-linode#537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: multipart: stream error: stream ID 1; INTERNAL_ERROR #987

storage: multipart: stream error: stream ID 1; INTERNAL_ERROR #987

rayrutjes commented May 2, 2018 •

edited

Loading

jba commented May 2, 2018

rayrutjes commented May 2, 2018

jba commented May 2, 2018

rayrutjes commented May 2, 2018 •

edited

Loading

jba commented May 2, 2018

jba commented Jul 20, 2018

rayrutjes commented Jul 20, 2018

storage: multipart: stream error: stream ID 1; INTERNAL_ERROR #987

storage: multipart: stream error: stream ID 1; INTERNAL_ERROR #987

Comments

rayrutjes commented May 2, 2018 • edited Loading

jba commented May 2, 2018

rayrutjes commented May 2, 2018

jba commented May 2, 2018

rayrutjes commented May 2, 2018 • edited Loading

jba commented May 2, 2018

jba commented Jul 20, 2018

rayrutjes commented Jul 20, 2018

rayrutjes commented May 2, 2018 •

edited

Loading

rayrutjes commented May 2, 2018 •

edited

Loading