Avoid quadratic complexity in GzDecoder #278

catenacyber · 2021-07-19T09:29:26Z

Quadratic complexity happens when the underlying reader
supplies bytes one at a time and would block.
And if header flag FNAME is set and we never have a null
byte to end the filename, read_gz_header keeps reading
all of the partial filename and computing the CRC...

The solution is to keep a more complex structure than a
simple buffer for what has already been parsed in the
header (as is done in C zlib)

@alexcrichton I will put some comments in the code for your review

Quadratic complexity happens when the underlying reader supplies bytes one at a time and would block. And if header flag FNAME is set and we never have a null byte to end the filename, read_gz_header keeps reading all of the partial filename and computing the CRC... The solution is to keep a more complex structure than a simple buffer for what has already been parsed in the header (as is done in C zlib)

catenacyber · 2021-07-19T09:30:47Z

src/gz/bufread.rs

-                        let mut reader = Buffer::new(&mut buf, reader.get_mut().get_mut());
-                        read_gz_header(&mut reader)
+                        let mut reader = Buffer::new(&mut part, reader.get_mut().get_mut());
+                        read_gz_header_part(&mut reader)


Why is src/gz/write.rs using read_gz_header ?
I guess we should have only one function instead of both read_gz_header and read_gz_header_part

catenacyber · 2021-07-19T09:32:47Z

src/gz/bufread.rs

-            Ok(hdr) => {
-                header = Some(hdr);
+            Ok(()) => {
+                header = Some(part.take_header());


read_gz_header_part cannot return directly the GzHeader structure because it would move part of the GzHeaderPartial which is behind a mutable reference.

Thus, the need to break in 2 phases :
1 return Ok
2 take ownership of the partial header fields

catenacyber · 2021-07-19T09:34:33Z

src/gz/bufread.rs

            self.buf_cur += len;
            Ok(len)
        }
    }
 }

+impl<'a, T> Buffer<'a, T> where T: std::io::Read {
+    // If we manage to read all the bytes, we reset the buffer
+    fn read_once(&mut self, buf: &mut [u8]) -> io::Result<usize> {


I did not see any good fit in neither BufRead nor Read for this...

What I want is
read_exact but buffer it there is not enough data, and clear the buffer if it succeeded

Is there some more idiomatic rust way ?

catenacyber · 2021-07-19T09:35:33Z

src/gz/bufread.rs

+            }
+            GzHeaderParsingState::GzHeaderParsingFilename => {
+                if r.part.flg & FNAME != 0 {
+                    for byte in r.reader.bytes() {


I did not find any more idiomatic way to do this

Maybe something like r.part.filename = r.read_until_zero()?; could be done ?

catenacyber · 2021-07-19T09:37:29Z

src/gz/bufread.rs

+    fn read_once(&mut self, buf: &mut [u8]) -> io::Result<usize> {
+        self.read_exact(buf)?;
+        let rlen = buf.len();
+        self.crc.update(buf);


I did not find a way to chain with CrcReader because of this read_once logic

catenacyber · 2021-07-19T09:44:05Z

I guess there should be some additions to the test framework as well. Like testing for some inputs that we get the same result if the reader is supplying bytes one at a time and we call multiple times read with treating the case io::ErrorKind::WouldBlock by making one more byte available...

alexcrichton

Thanks!

Can you explain more about the read_once aspect? I'm not sure that the internal usage of read_exact will work well with async I/O where it may fail to make progress but may make progress later. It looks like in-progress bytes may be lost?

Overall this looks pretty reasonable but I think the byte buffering could use some tweaks and/or documentation improvements.

src/gz/bufread.rs

alexcrichton · 2021-07-19T14:45:55Z

src/gz/bufread.rs

+    mtime: u32,
+    extra: Option<Vec<u8>>,
+    filename: Vec<u8>,
+    comment: Vec<u8>,


Could this perhaps internally store a GzHeader and then it just appends a buf and state field?

Good idea, will try

Ah, we need at least flg and xlen in GzHeaderPartial
So, should I still try to put a GzHeader in GzHeaderPartial ?

I don't know without doing it myself, but I'm trying to ideally reduce duplication here instead of listing the fields twice in a number of places.

Right, doing so

src/gz/bufread.rs

catenacyber · 2021-07-19T15:42:18Z

Can you explain more about the read_once aspect? I'm not sure that the internal usage of read_exact will work well with async I/O where it may fail to make progress but may make progress later. It looks like in-progress bytes may be lost?

It is possible I missed some points.
What I want read_once to do is to try to read the exact number of bytes of the buffer, and if there are not enough, save them into the structure's buffer...
Relooking at it, it looks like it does not do that yet... Will try something else

This reverts commit e463973.

catenacyber · 2021-07-19T15:57:59Z

I think read_once was correct because it is used with Buffer and not a generic Read which just keeps buffering the read data (until the buffer is reset/truncated in read_once)

alexcrichton · 2021-07-20T15:03:29Z

I don't think read_once is working as intended. The read_exact method you're calling may call read multiple times and if it ends in an error then it discards all previously read data and doesn't return how much was read.

catenacyber · 2021-07-21T09:26:57Z

I don't think read_once is working as intended.

You were right.

I added a test, and fixed read_once accordingly

catenacyber · 2021-07-21T09:29:08Z

src/gz/bufread.rs

+
+    //a cursor turning EOF into blocking errors
+    #[derive(Debug)]
+    pub struct BlockingCursor {


I do not know if there is such a structure ready in rust implementing Read and where we can specify at some points that we received more data

catenacyber · 2021-07-21T09:31:25Z

src/gz/bufread.rs

+        r.set_position(pos);
+
+        // Third read : still incomplete for 7 bytes (we have 5)
+        let mut reader2 = Buffer::new(&mut part, &mut r);


I need to redefine reader, otherwise I get

error[E0499]: cannot borrow `r` as mutable more than once at a time --> src/gz/bufread.rs:831:19 | 810 | let mut reader = Buffer::new(&mut part, &mut r); | ------ first mutable borrow occurs here ... 831 | let pos = r.position(); | ^ second mutable borrow occurs here ... 842 | match reader.read_once(&mut out) { | ------ first borrow later used here

I do not understand why pos a u64 is considered a mutable borrow

alexcrichton

I'm ok w/ whatever is necessary for tests, I don't mind too much about their idioms. I think read_once is still suffering the same partial-read bugs from before?

src/gz/bufread.rs

catenacyber · 2021-07-21T20:37:14Z

I think read_once is still suffering the same partial-read bugs from before?

The test should prove that read_once works at stage Fourth read : now succesful for 7 bytes
read_once calls read_exact which calls read multiple times, but each time the data gets buffered (for subsequent calls to read_once who did not finish)

alexcrichton · 2021-07-22T14:25:57Z

Hm ok there's some unusual stuff going on with Buffer type I don't remember at all, this will take a bit longer to page everything back in and try to understand...

catenacyber · 2021-07-22T15:04:30Z

Hm ok there's some unusual stuff going on with Buffer type I don't remember at all, this will take a bit longer to page everything back in and try to understand...

Yes Buffer's read just saves every successful read bytes from the subsequent Read into an internal buffer
And in some case, it begins to return these saved bytes, before reading more from the Read

alexcrichton · 2021-07-26T15:36:30Z

src/gz/bufread.rs

+        let rlen = buf.len();
+        self.crc.update(buf);
+        self.part.buf.truncate(0);
+        self.buf_max = 0;


Trying to read over this some more, I don't think I understand why this method needs to exist. I don't think that the manipulation of the fields of the structure here (e.g. the updates of buf_{cur,max} and the truncate are right. That seems like it's discarding data which might otherwise be succesfully read by subsequent calls to read_once.

This seems like callers of this function should be calling read_exact and then on success updating the crc manually for the header? Other than that I'm not sure why this function is needed?

The problem with read_exact is that we do not reset the buffer, leading to the quadratic complexity as we keep parsing all over and again.

What I want is
read_exact but buffer it there is not enough data (that was already ok), and clear the buffer if it succeeded

For example, the reader first provides 7 bytes, read_gz_header_part buffers these 7 bytes and returns ErrorKind::WouldBlock

Then a second call to read_gz_header_part, the reader provides 4 more bytes
There is a successful read of the 10 bytes header
As we have flag FEXTRA, we try to read a u16 but fail with ErrorKind::WouldBlock
The problem with the current code is that we have a 11 byte buffer instead of a 1 byte one for later calls to read_gz_header_part

By the way, I see a problem with my code that arc should belong to GzHeaderPartial as it lives longer than Buffer

Nah yeah I understand the quadratic part, and I forgot that this is where the fix for that is going. To fix the quadratic bits, though, I think that the .bytes() iterator can be replaced when reading the nul-terminated string with custom logic to empty the buffer. Basically when reading the C strings for the various fields I think that we should consume the bytes and not buffer them in the general intermediate buffer, but rather buffer them in the location where the field is read from (basically store them directly where they're gonna go).

Basically when reading the C strings for the various fields I think that we should consume the bytes and not buffer them in the general intermediate buffer, but rather buffer them in the location where the field is read from (basically store them directly where they're gonna go).

I did that, so Buffer::read looks at GzHeaderParsingState to decide where to buffer the data.

The point I do not like about this is that the logic is now split between read_gz_header_part and Buffer::read, ie we do not see in read_gz_header_part that we are filling the filename

alexcrichton · 2021-07-26T15:37:38Z

src/gz/bufread.rs

+                        r.part.header.filename = Some(vec);
+                    };
+                    if let Some(ref mut b) = r.part.header.filename {
+                        for byte in r.reader.bytes() {


I don't think that reaching into the reader here is also quite right since it bypasses the buffering that's otherwise happening in Buffer?

I also think this is bad style even if it is intentional as the buffering takes directly place in the GzHeader
Should I rather add a method read_until to the Buffer ?

I don't have a shape of code in mind, but there's a lot of confusing buffer-this-sometimes-dont logic going on here and there's very little documentation. I think something needs to be improved (code structure or similar) to make this more understandable.

I agree that this is not nice.
I just pushed something new.
The difficulty is r.bytes() is a mutable borrow and we want another mutable borrow for r.part.header.filename
I did not manage to have a mutable reference to r.part.header.filename in Buffer because of lifetime issues such as

error[E0623]: lifetime mismatch --> src/gz/bufread.rs:89:39 | 46 | fn read_gz_header_part<R: Read>(r: &mut Buffer<R>) -> io::Result<()> { | -------------- | | | these two types are declared with different lifetimes... ... 89 | r.direct_buffer = r.part.header.filename.as_ref(); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...but data from `r` flows into `r` here

So, in the end Buffer::read looks at the GzHeaderParsingState to decide where to buffer the data

Another (not so nice) pattern could be

r.direct = true; let mut bytebuf = [0; 1]; loop { r.read_exact(&mut bytebuf)?; if bytebuf[0] == 0 { r.direct = false; break; } if let Some(ref mut b) = r.part.header.filename { b.push(bytebuf[0]); } }

catenacyber · 2021-08-26T14:57:37Z

Friendly ping @alexcrichton
I changed Buffer's Read to use self.part.state to know which buffer to write into...

alexcrichton · 2021-08-27T19:29:21Z

Sorry this fell off my radar.

I thihnk this looks good for now though, thanks for the PR!

catenacyber · 2021-08-29T19:37:33Z

Thanks @alexcrichton
When would this be available in a release ?

alexcrichton · 2021-08-30T14:27:58Z

I'll go ahead and release now

catenacyber · 2021-08-30T20:10:34Z

Thanks Alex

catenacyber commented Jul 19, 2021

View reviewed changes

alexcrichton reviewed Jul 19, 2021

View reviewed changes

catenacyber added 4 commits July 19, 2021 17:42

fixup rustfmt

4325f0a

fixup GzHeaderParsingState

bb1cfc5

fixup read_once

e463973

Revert "fixup read_once"

7a29a23

This reverts commit e463973.

catenacyber added 2 commits July 21, 2021 11:07

fixup using GzHeader in GzHeaderPartial

bb87533

Fixup read_gz_header_part used by read_gz_header

159e374

fixup read_once and add test about it

63ca743

catenacyber commented Jul 21, 2021

View reviewed changes

alexcrichton reviewed Jul 21, 2021

View reviewed changes

src/gz/bufread.rs Show resolved Hide resolved

alexcrichton reviewed Jul 26, 2021

View reviewed changes

catenacyber added 3 commits July 27, 2021 11:44

fixup moce crc to GzHeaderPartial

348aa05

dummy lifetime

9499b48

It works

70b023e

catenacyber force-pushed the quadratic-gz-read branch from 2ec2c71 to 70b023e Compare July 30, 2021 20:22

alexcrichton merged commit 7212da8 into rust-lang:master Aug 27, 2021

jongiddy mentioned this pull request Nov 25, 2022

Add write::MultiGzDecoder for multi-member gzip data #325

Merged

Avoid quadratic complexity in GzDecoder #278

Avoid quadratic complexity in GzDecoder #278

Conversation

catenacyber commented Jul 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catenacyber commented Jul 19, 2021

alexcrichton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catenacyber commented Jul 19, 2021

catenacyber commented Jul 19, 2021

alexcrichton commented Jul 20, 2021

catenacyber commented Jul 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton left a comment

Choose a reason for hiding this comment

catenacyber commented Jul 21, 2021

alexcrichton commented Jul 22, 2021

catenacyber commented Jul 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catenacyber commented Aug 26, 2021

alexcrichton commented Aug 27, 2021

catenacyber commented Aug 29, 2021

alexcrichton commented Aug 30, 2021

catenacyber commented Aug 30, 2021