Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

okhttp fails with IOException: gzip finished without exhausting source but GZIPInputStream works #3759

Closed
1 task done
asarkar opened this issue Jan 5, 2018 · 44 comments
Closed
1 task done
Labels
bug Bug in existing code
Milestone

Comments

@asarkar
Copy link

asarkar commented Jan 5, 2018

What kind of issue is this?

This issue can't be reproduced in a test. I'll do my best to explain.

>> GET http://myserver.mycompany.com/.../businesses.20180104.json.gz
<< 200 OK
<< connection -> [keep-alive]
<< accept-ranges -> [bytes]
<< content-disposition -> [attachment; filename="businesses.20180104.json.gz"; filename*=UTF-8''businesses.20180104.json.gz] 
<< content-type -> [application/x-gzip]
<< content-length -> [3384998203]
<< date -> [Fri, 05 Jan 2018 00:43:32 GMT]
<< etag -> [0e49d5fa7ba9f68058bfbb4a98bef032c3a73871]
<< last-modified -> [Thu, 04 Jan 2018 23:54:26 GMT]
<< x-artifactory-id -> [9732f56568ea1e3d:59294f65:160b8066066:-8000]
<< x-checksum-md5 -> [451ca1b1414e7b511de874e61fd33eb2]
<< x-artifactory-filename -> [businesses.20180104.json.gz]
<< server -> [Artifactory/5.3.0]
<< x-checksum-sha1 -> [0e49d5fa7ba9f68058bfbb4a98bef032c3a73871]

As you can see, the server doesn't set a Content-Encoding = gzip header, so I do that in an interceptor.

Each record is newline delimited JSON string that is inserted into Couchbase. There are around 12 million records in total. Using okhttp, processing fails after about 130000 with the following exception:

Caused by: java.io.IOException: gzip finished without exhausting source
	at okio.GzipSource.read(GzipSource.java:100)
	at okio.RealBufferedSource$1.read(RealBufferedSource.java:430)

However, if I don't set the Content-Encoding header (thus skipping GzipSource), and wrap the input stream with GZIPInputStream, everything works as expected. I've also tried setting Transfer-Encoding = chunked on the response and removing the Content-Length header, but to no avail.

So, question is, if GZIPInputStream doesn't have a problem, why does GzipSource? And since it does, why won't it report what it thinks is the issue? I've a test that runs on a smaller file, 100 records, and it works.

I've seen #3457, but unlike the reporter, it's not possible for me to capture the hex body of a 3.4 GB stream.

@asarkar asarkar changed the title okhttp fails with IOException: gzip finished without exhausting source but okhttp fails with IOException: gzip finished without exhausting source but GZIPInputStream works Jan 5, 2018
@yschimke
Copy link
Collaborator

yschimke commented Jan 5, 2018

This isn't easily reproducible without a test case or url to test against. I am going to guess it is one of two cases

a) that the server is incorrectly sending the original content-length in the response. Is the content compressed on the fly? If so it seems strange that the server can correctly predict the content length. [edit: this case is unlikely since you tested removing the length header]

content-length -> [3384998203]

b) it is just serving a previously compressed file, which contains some trailing bytes (or server is incorrectly adding something). You should be able to debug this case by looking at what is left in the buffer at the point Okio is about to throw. In this case Okio is doing an additional valid check that happens to expose a slightly corrupted file.

@yschimke
Copy link
Collaborator

yschimke commented Jan 5, 2018

The hot take on this is that it seems like the server is sending back a file, that just happens to be compressed. As such adding "Content-Encoding = gzip" while convenient for you, is testing an edge case and more likely to break some normal web assumptions. Why not just handle this as you are doing via GZIPInputStream on the original response?

@asarkar
Copy link
Author

asarkar commented Jan 5, 2018

just handle this as you are doing via GZIPInputStream

That's what I've been forced to do, but when something doesn't work, I like to understand why, not just write it off as bad luck 😃

incorrectly sending the original content-length

The Content-Length returned is correct; I know that because I can download the file using curl or browser, and it is the said size on disk. Also, like you said, I've tried without Content-Length header as well.

contains some trailing bytes (or server is incorrectly adding something)...left in the buffer

I've collected exactly that using Netty (I've been trying multiple clients to rule out a potential okhttp bug). I can post that, or if you tell me how to collect what's left in the buffer using okhttp, I will do that.

@JakeWharton
Copy link
Collaborator

That makes it sounds like the content length is incorrect. If it's gzip-encoded then the length should be the gzip'd size and not the uncompressed size.

This seems like a pretty straightforward unit test to set up from the template so we can see exactly what you're doing.

@asarkar
Copy link
Author

asarkar commented Jan 5, 2018

I meant the compressed size only, not uncompressed. When I download the file, the archive (businesses.20180104.json.gz) size on disk is consistent with the reported one.

I have a unit test; it doesn't fail. Of course, if I report wrong content length in the response, it might, but I'm not testing GzipSource.

@asarkar
Copy link
Author

asarkar commented Jan 5, 2018

yelp-load.txt

Last packet I captured from Netty.

Wikipedia says:

its file format also allows for multiple such streams to be concatenated

Commons compress supports this:

For the bzip2, gzip and xz formats a single compressed file may actually consist of several streams that will be concatenated by the command line utilities when decompressing them. Starting with Commons Compress 1.4 the *CompressorInputStreams for these formats support concatenating streams as well

But I don't see anything in GzipSource to accommodate this. Especially this section:

if (section == SECTION_TRAILER) {
  consumeTrailer();
  section = SECTION_DONE;

  // Gzip streams self-terminate: they return -1 before their underlying
  // source returns -1. Here we attempt to force the underlying stream to
  // return -1 which may trigger it to release its resources. If it doesn't
  // return -1, then our Gzip data finished prematurely!
  if (!source.exhausted()) {
    throw new IOException("gzip finished without exhausting source");
  }
}

Due to okhttp code reaching into the guts of other classes by abusing package access, it's not possible for me to easily update the code and find out if my theory is correct.

@yschimke
Copy link
Collaborator

yschimke commented Jan 6, 2018

If the problem is the multiple streams the I think you are hitting a subtle difference between using gzip to compress and store content like tar archives of files on disk, and usage in a webserver to compress a single resource being served on the fly.

These headers are no longer valid after OkHttp decompresses the content automatically etc.

<< content-disposition -> [attachment; filename="businesses.20180104.json.gz"; filename*=UTF-8''businesses.20180104.json.gz] 
<< content-type -> [application/x-gzip]
<< x-checksum-md5 -> [451ca1b1414e7b511de874e61fd33eb2]
<< x-artifactory-filename -> [businesses.20180104.json.gz]
<< x-checksum-sha1 -> [0e49d5fa7ba9f68058bfbb4a98bef032c3a73871]

I would argue that adding the Content-Encoding header to cause OkHttp to decompress these previously compressed files on the fly is the problem.

It's been a great learning experience, but I don't think we should try to handle this case as it doesn't appear to be a normal situation in the wild with real webservers.

@asarkar
Copy link
Author

asarkar commented Jan 6, 2018

we should try to handle this case

I'd grudgingly agree, but I still don't know the problem. My goal here is to understand the issue, and get help from people who understand low-level networking better than I do. If the issue turns out to be GZIP non-compliance, I'll happily walk away knowing that I learned something. But without knowing the problem, the following statement can't be proven.

it doesn't appear to be a normal situation

@yschimke
Copy link
Collaborator

yschimke commented Jan 6, 2018

You've pointed to a real edge case of the Gzip spec that (I'm assuming) makes sense for a number of reasons when used for tar archives e.g. .tar.gz. append only additions of additional content, or something similar. I don't specifically know and I can't see the stream marker (1f 8b) in the packet you captured from Netty to confirm.

But pragmatically it doesn't appear to happen without a contrived situation e.g. manually adding a header.

SMITH: Doctor, it hurts when I do this.
DALE: Don't do that.

@asarkar
Copy link
Author

asarkar commented Jan 6, 2018

@yschimke
I think there was a mistake in how I took the previous hex dump. This time, I did the following:

1. protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) throws Exception {
2.    if (finished) {
3.        in.skipBytes(in.readableBytes());
4.        return;
5.    }
    ...
}

Put a breakpoint on line 3 of JdkZlibDecoder.decode above. For every invocation of it, dump the contents of the ByteBuf to a file by manually invoking the following method that I wrote:
ByteBufUtils.dumpByteBuf("yelp-dump.txt", in)

public static void dumpByteBuf(String out, ByteBuf msg) {
    StringBuilder buf = new StringBuilder(StringUtil.NEWLINE);
    appendPrettyHexDump(buf, msg);

    try (BufferedWriter w = newBufferedWriter(Paths.get(out), UTF_8, APPEND, CREATE)) {
        w.write(buf.toString());
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
}

That produced the attached dump, and I see that it starts with 1f 8b, as well contains the same sequence more than once.
Does this prove my theory of multiple streams?

yelp-dump.txt

@swankjesse
Copy link
Collaborator

I introduced this behavior and can explain it.

Gzip is a self-terminating format. The content of the stream itself indicates where you've read everything.

If ever there's data beyond the self-reported end, this data is effectively unreachable. This is potentially problematic for two reasons:

  • HTTP/1 connection pooling. If we don't consume the entire response body of call N, we can't use the connection for call (N+1).
  • HTTP response caching. We only persist response values once they're completely downloaded.

I made things strict to help to detect problems like this. It's possible this check is too strict and we should silently ignore the extra data.

@asarkar
Copy link
Author

asarkar commented Jan 7, 2018

@swankjesse It's the strict check that put me on the right track for debugging, so it's certainly useful. Other clients (Netty/Apache) silently ignore the data, causing a lot of hair tearing and "WTF is going on?" flying around.

I suggest that like Commons Compress, a variable concatenate is introduced, and set to false by default for backward compatibility. If concatenate = true and there is more data to be read and the next 2 bytes are 1f 8b, simply reset the CRC and make a recursive call. Otherwise, throw an exception. If you want to be lenient on the clients, there can be a second flag ignoreSilently, which can be used for either ignoring silently (if true), or throw an exception (default behavior).

How does that sound?

@yschimke
Copy link
Collaborator

yschimke commented Jan 7, 2018

(sounding like a broken record) I don't think we should change OkHttp behaviour here. I think what should probably happen is that OkHttp continues to only process the first stream until we find a mainstream Http server that actually encodes multiple streams on the fly.

It isn't clear to me what the semantics of multiple streams is? Raw concatenation? expose each stream for processing?

Clients should continue to post process these Gzip responses, so the one question for me is whether Okio should support. That seems useful for completeness, and Okio could make a nice API for clients processing each stream in turn.

@asarkar
Copy link
Author

asarkar commented Jan 7, 2018

Why don’t you think that okhttp should complete the job, rather than doing it half-a$$ed? It makes no sense for clients to process n - 1 streams. As I quoted earlier, command line tools do this transparently.
Unfortunately, the spec doesn’t clarify the parser behavior in this case. I’ve emailed the maintainers of the RFC, will see

@yschimke
Copy link
Collaborator

yschimke commented Jan 7, 2018

Because you haven't shown a non contrived real world example. As I argued before you are causing this by fudging the response headers, without that everything is working in the wild.

@asarkar
Copy link
Author

asarkar commented Jan 8, 2018

Sorry, but this is as real as it gets. You may wish for all servers to be well-behaving HTTP citizens,but the real world doesn't work that way. Who sets the header is completely irrelevant here; what is relevant, is okhttp parsing the GZIP stream in an incomplete manner.

BTW, here's more real stuff, from RFC 1952

A gzip file consists of a series of "members" (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them.

We don't have to agree with each other, that's fine. But let's not undermine the issue here.

@yschimke
Copy link
Collaborator

yschimke commented Jan 8, 2018

I've been a bit blunt in my responses and obviously caused you some offense or frustration. This wasn't my aim.

Stating the obvious it's really up to Okio whether to handle this automatically. So there isn't really a fix for OkHttp here. I'm sure they would welcome a PR, and it seems like something that a GzipSource in Okio should handle. At which point the whole argument is invalid.

@madler
Copy link

madler commented Jan 8, 2018

It isn't clear to me what the semantics of multiple streams is? Raw concatenation?

Yes, simple concatenation. A concatenation of valid gzip streams is also a valid gzip stream. Only decompressing the first gzip member and ignoring the subsequent data is non-compliant with the standard.

It's not my place to say where in the code this should be fixed. However waiting until a known bug causes a problem before you will considering fixing it does not seem like a good strategy to me.

@swankjesse
Copy link
Collaborator

Can fix. Probably in Okio. I'm under the impression this is also broken in Android’s GzipInputStream, from which GzipSource was derived.

@swankjesse swankjesse added the bug Bug in existing code label Jan 9, 2018
@yschimke
Copy link
Collaborator

yschimke commented Jan 9, 2018

@madler @asarkar noted. Please let me know if you find an example service (not just a gzipped file) that we can test against, I'll add a regression test to OkHttp.

@asarkar
Copy link
Author

asarkar commented Jan 9, 2018

@yschimke Commons Compress has test cases for this scenario: GZipTestCase.java

The file they seem to be using is multiple.gz

@yschimke
Copy link
Collaborator

yschimke commented Feb 5, 2018

I'm recording a mental note to test this issue #1517 (comment) (specifically sourceforge test url) after a fix lands in Okio.

@swankjesse
Copy link
Collaborator

swankjesse commented Feb 11, 2018

https://gist.github.com/swankjesse/025118cadfa93d7f0904cb96d80aaf68

I tried Chrome, Safari, Firefox, and curl. All but Chrome read only the first gzip member. Chrome also includes the subsequent member, but doesn’t properly to Gzip-decompress it and so the resulting page is malformed.

screen shot 2018-02-11 at 1 16 12 pm

screen shot 2018-02-11 at 1 16 12 pm

screen shot 2018-02-11 at 1 16 51 pm

screen shot 2018-02-11 at 1 17 11 pm

Because major web browsers don’t support this, I don’t think OkHttp should support it either.

@swankjesse
Copy link
Collaborator

Admittedly, gunzip does this. But I don’t think we want to be consistent with command line tools; we want to be consistent with Chrome and Firefox.

~ echo 'HELLO ' | gzip > helloworld.gz
~ echo 'WORLD ' | gzip >> helloworld.gz
~ gunzip helloworld.gz
~ cat helloworld
HELLO
WORLD

@madler
Copy link

madler commented Feb 11, 2018

The major web browsers are not compliant with the standard. Their bugs should be fixed as well.

@yschimke
Copy link
Collaborator

@swankjesse Would you accept a PR for Okio that fixes it there? I'm not offering, just giving an option if it's hampering your usage. But I agree this isn't a priority for OkHttp directly.

@swankjesse
Copy link
Collaborator

It's a pretty tough thing to get right. For example, do we provide a dictionary for the 2nd member? Would be worthwhile tracking what GzipInputStream does.

@asarkar
Copy link
Author

asarkar commented Feb 11, 2018

@swankjesse I don't get the "Because major web browsers don’t support this, I don’t think OkHttp should support it either" argument. OkHttp isn't a Browser, and even if it were, "I'll keep doing the wrong thing, even though I know it's wrong, just because others do it" don't make a lot of sense.

FWIW, I submitted a PR for this bug in Netty, which was merged. I also reported it to Apache HttpClient, and they introduced tests to prove that they are compliant with the RFC. If you knowingly leave this bug in at this point, OkHttp will the the only major HTTP client to be non-compliant.

@madler
Copy link

madler commented Feb 11, 2018

Dictionary? There are no dictionaries in gzip streams.

It is trivial to get right. If you get more data after the end of the first gzip member, then simply do exactly the same thing you did at the start of the first member -- decode a gzip member. Repeat until the input is exhausted.

gzip is not a self-terminating format, since there is no indication of the last member other than the exhaustion of the data.

@swankjesse
Copy link
Collaborator

Alright, I’m on board. We’ll need to fix in Okio.

@swankjesse swankjesse reopened this Feb 11, 2018
@swankjesse
Copy link
Collaborator

Also: are there tracking bugs for Chrome, Firefox, and Safari? They aught to fix this too, and if they don’t we should learn why not.

@asarkar
Copy link
Author

asarkar commented Feb 12, 2018

My use case is a Java client, so I reported to all of the Java clients that I found having this issue (and fixed one, so I'm not merely complaining). I don't care about Browsers, just like I don't care if some Python HTTP client in the wild has this issue too. Not saying those shouldn't be fixed, but I'm not gonna chase them for a fix.

@Dkhusainov
Copy link

Dkhusainov commented Oct 24, 2019

Are concatenated gzip stream still one the table for okio? I'm trying to write an aggregating gateway which sequentually streams gzipped data from multiple sources to client . The problem is that my client wouldn't be able to read it with okhttp/okio.

@swankjesse
Copy link
Collaborator

@Dkhusainov can you confirm Chrome or Firefox would interpret the data as you'd prefer here? The gist above will get you started.

@Dkhusainov
Copy link

Browsers have nothing to do with the issue. As mentioned above - gzip spec dictates that multiple consecutive gzip payloads is a valid gzip payload.

To not break backwards compatability this can be done with a separate MultiGzipSource, and a flag(false by default) in OkHttpClient.Builder to use it.

This issue hanging in unresolved state isn't helping anyone. Either close it so people who need it know it's not coming and will do it themselves, or give some estimation on when it will be implemented.

@swankjesse
Copy link
Collaborator

As you wish! We want to be consistent with browsers and so here we are.

@asarkar
Copy link
Author

asarkar commented Oct 31, 2019

It’s unfortunate that mere laziness had to be wrapped into a senseless excuse “we want to be consistent with the browsers” in order to avoid doing the right thing. OkHttp isn’t a browser. The author of the GZ protocol clearly stated that the behavior is wrong; the authors of OkHttp don’t know better than him.

@square square temporarily blocked asarkar Oct 31, 2019
@swankjesse
Copy link
Collaborator

Don't make personal attacks in this big tracker. I've blocked @asarkar for 24 hours for accusing us of being lazy. Further violations of our code of conduct will result in a permanent block.

@rsandx
Copy link

rsandx commented Dec 28, 2019

Recently I intermittently get IOException: CRC checkEqual failure at okio.GzipSource.checkEqual(GzipSource.kt:197) when testing an Android app (very frustrating as it can't be fixed from my side), and the issue discussed here seems the cause, so I really wish this known issue to be fixed by OkHttp, considering how widely used it's in Android community - "As of Android 5.0, OkHttp is part of the Android platform and is used for all HTTP calls." It doesn't sound right to me that OkHttp should follow the behavior of browsers in this case. @swankjesse, could you guys please reconsider it?

@yschimke
Copy link
Collaborator

@rsandx The OkHttp that is built into the Android platform is patched under the AOSP. So fixing it here is unlikely to fix that one.

OkHttp 3 & 4 is opt in, give a lot of additional features, and designed to set a higher goal for security and correctness. This includes following web standards (both formal RFC and browser based).

@yschimke
Copy link
Collaborator

@rsandx Another option here if you feel so strongly about this. The transparent gzip is disabled if you take care of this yourself.

Maybe you can publish an open source library for better Gzip support, it's obviously just not yourself affected by these issues. This is an example external compression interceptor that supports gzip and brotli

https://github.com/square/okhttp/blob/master/okhttp-brotli/src/main/java/okhttp3/brotli/BrotliInterceptor.kt

But unfortunately it still suffers with the current issue you face.

@rsandx
Copy link

rsandx commented Dec 28, 2019

@yschimke Thanks for the info. My app uses OkHttp 3 with Retrofit, does have communication issues like this and "unexpected end of stream error" from time to time, especially when the response body contains data more than a simple message. I'm really struggling to find a way to make the app working stably.

@swankjesse
Copy link
Collaborator

@rsandx what server are you using? Might be a bug where it isn't sending all the bytes it needs to send.

@rsandx
Copy link

rsandx commented Dec 28, 2019

@swankjesse I'm using Flask for testing, and aware people saying "unexpected end of stream error" is a server issue, but the same URL returns data correctly in browsers like Firefox and tools like Postman; of course I've tried varies solutions (both server and app sides) people suggested on stackoverflow and the issue still persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug in existing code
Projects
None yet
Development

No branches or pull requests

7 participants