Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid quadratic complexity in GzDecoder #278

Merged
merged 11 commits into from
Aug 27, 2021

Conversation

catenacyber
Copy link
Contributor

Quadratic complexity happens when the underlying reader
supplies bytes one at a time and would block.
And if header flag FNAME is set and we never have a null
byte to end the filename, read_gz_header keeps reading
all of the partial filename and computing the CRC...

The solution is to keep a more complex structure than a
simple buffer for what has already been parsed in the
header (as is done in C zlib)

@alexcrichton I will put some comments in the code for your review

Quadratic complexity happens when the underlying reader
supplies bytes one at a time and would block.
And if header flag FNAME is set and we never have a null
byte to end the filename, read_gz_header keeps reading
all of the partial filename and computing the CRC...

The solution is to keep a more complex structure than a
simple buffer for what has already been parsed in the
header (as is done in C zlib)
let mut reader = Buffer::new(&mut buf, reader.get_mut().get_mut());
read_gz_header(&mut reader)
let mut reader = Buffer::new(&mut part, reader.get_mut().get_mut());
read_gz_header_part(&mut reader)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is src/gz/write.rs using read_gz_header ?
I guess we should have only one function instead of both read_gz_header and read_gz_header_part

Ok(hdr) => {
header = Some(hdr);
Ok(()) => {
header = Some(part.take_header());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_gz_header_part cannot return directly the GzHeader structure because it would move part of the GzHeaderPartial which is behind a mutable reference.

Thus, the need to break in 2 phases :
1 return Ok
2 take ownership of the partial header fields

self.buf_cur += len;
Ok(len)
}
}
}

impl<'a, T> Buffer<'a, T> where T: std::io::Read {
// If we manage to read all the bytes, we reset the buffer
fn read_once(&mut self, buf: &mut [u8]) -> io::Result<usize> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see any good fit in neither BufRead nor Read for this...

What I want is
read_exact but buffer it there is not enough data, and clear the buffer if it succeeded

Is there some more idiomatic rust way ?

}
GzHeaderParsingState::GzHeaderParsingFilename => {
if r.part.flg & FNAME != 0 {
for byte in r.reader.bytes() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find any more idiomatic way to do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like r.part.filename = r.read_until_zero()?; could be done ?

fn read_once(&mut self, buf: &mut [u8]) -> io::Result<usize> {
self.read_exact(buf)?;
let rlen = buf.len();
self.crc.update(buf);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find a way to chain with CrcReader because of this read_once logic

@catenacyber
Copy link
Contributor Author

I guess there should be some additions to the test framework as well. Like testing for some inputs that we get the same result if the reader is supplying bytes one at a time and we call multiple times read with treating the case io::ErrorKind::WouldBlock by making one more byte available...

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Can you explain more about the read_once aspect? I'm not sure that the internal usage of read_exact will work well with async I/O where it may fail to make progress but may make progress later. It looks like in-progress bytes may be lost?

Overall this looks pretty reasonable but I think the byte buffering could use some tweaks and/or documentation improvements.

src/gz/bufread.rs Outdated Show resolved Hide resolved
mtime: u32,
extra: Option<Vec<u8>>,
filename: Vec<u8>,
comment: Vec<u8>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this perhaps internally store a GzHeader and then it just appends a buf and state field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we need at least flg and xlen in GzHeaderPartial
So, should I still try to put a GzHeader in GzHeaderPartial ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know without doing it myself, but I'm trying to ideally reduce duplication here instead of listing the fields twice in a number of places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, doing so

src/gz/bufread.rs Show resolved Hide resolved
@catenacyber
Copy link
Contributor Author

Can you explain more about the read_once aspect? I'm not sure that the internal usage of read_exact will work well with async I/O where it may fail to make progress but may make progress later. It looks like in-progress bytes may be lost?

It is possible I missed some points.
What I want read_once to do is to try to read the exact number of bytes of the buffer, and if there are not enough, save them into the structure's buffer...
Relooking at it, it looks like it does not do that yet... Will try something else

@catenacyber
Copy link
Contributor Author

I think read_once was correct because it is used with Buffer and not a generic Read which just keeps buffering the read data (until the buffer is reset/truncated in read_once)

@alexcrichton
Copy link
Member

I don't think read_once is working as intended. The read_exact method you're calling may call read multiple times and if it ends in an error then it discards all previously read data and doesn't return how much was read.

@catenacyber
Copy link
Contributor Author

I don't think read_once is working as intended.

You were right.

I added a test, and fixed read_once accordingly


//a cursor turning EOF into blocking errors
#[derive(Debug)]
pub struct BlockingCursor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if there is such a structure ready in rust implementing Read and where we can specify at some points that we received more data

r.set_position(pos);

// Third read : still incomplete for 7 bytes (we have 5)
let mut reader2 = Buffer::new(&mut part, &mut r);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to redefine reader, otherwise I get

error[E0499]: cannot borrow `r` as mutable more than once at a time
   --> src/gz/bufread.rs:831:19
    |
810 |         let mut reader = Buffer::new(&mut part, &mut r);
    |                                                 ------ first mutable borrow occurs here
...
831 |         let pos = r.position();
    |                   ^ second mutable borrow occurs here
...
842 |         match reader.read_once(&mut out) {
    |               ------ first borrow later used here

I do not understand why pos a u64 is considered a mutable borrow

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok w/ whatever is necessary for tests, I don't mind too much about their idioms. I think read_once is still suffering the same partial-read bugs from before?

src/gz/bufread.rs Show resolved Hide resolved
@catenacyber
Copy link
Contributor Author

I think read_once is still suffering the same partial-read bugs from before?

The test should prove that read_once works at stage Fourth read : now succesful for 7 bytes
read_once calls read_exact which calls read multiple times, but each time the data gets buffered (for subsequent calls to read_once who did not finish)

@alexcrichton
Copy link
Member

Hm ok there's some unusual stuff going on with Buffer type I don't remember at all, this will take a bit longer to page everything back in and try to understand...

@catenacyber
Copy link
Contributor Author

Hm ok there's some unusual stuff going on with Buffer type I don't remember at all, this will take a bit longer to page everything back in and try to understand...

Yes Buffer's read just saves every successful read bytes from the subsequent Read into an internal buffer
And in some case, it begins to return these saved bytes, before reading more from the Read

let rlen = buf.len();
self.crc.update(buf);
self.part.buf.truncate(0);
self.buf_max = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to read over this some more, I don't think I understand why this method needs to exist. I don't think that the manipulation of the fields of the structure here (e.g. the updates of buf_{cur,max} and the truncate are right. That seems like it's discarding data which might otherwise be succesfully read by subsequent calls to read_once.

This seems like callers of this function should be calling read_exact and then on success updating the crc manually for the header? Other than that I'm not sure why this function is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with read_exact is that we do not reset the buffer, leading to the quadratic complexity as we keep parsing all over and again.

What I want is
read_exact but buffer it there is not enough data (that was already ok), and clear the buffer if it succeeded

For example, the reader first provides 7 bytes, read_gz_header_part buffers these 7 bytes and returns ErrorKind::WouldBlock

Then a second call to read_gz_header_part, the reader provides 4 more bytes
There is a successful read of the 10 bytes header
As we have flag FEXTRA, we try to read a u16 but fail with ErrorKind::WouldBlock
The problem with the current code is that we have a 11 byte buffer instead of a 1 byte one for later calls to read_gz_header_part

By the way, I see a problem with my code that arc should belong to GzHeaderPartial as it lives longer than Buffer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah yeah I understand the quadratic part, and I forgot that this is where the fix for that is going. To fix the quadratic bits, though, I think that the .bytes() iterator can be replaced when reading the nul-terminated string with custom logic to empty the buffer. Basically when reading the C strings for the various fields I think that we should consume the bytes and not buffer them in the general intermediate buffer, but rather buffer them in the location where the field is read from (basically store them directly where they're gonna go).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically when reading the C strings for the various fields I think that we should consume the bytes and not buffer them in the general intermediate buffer, but rather buffer them in the location where the field is read from (basically store them directly where they're gonna go).

I did that, so Buffer::read looks at GzHeaderParsingState to decide where to buffer the data.

The point I do not like about this is that the logic is now split between read_gz_header_part and Buffer::read, ie we do not see in read_gz_header_part that we are filling the filename

r.part.header.filename = Some(vec);
};
if let Some(ref mut b) = r.part.header.filename {
for byte in r.reader.bytes() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that reaching into the reader here is also quite right since it bypasses the buffering that's otherwise happening in Buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this is bad style even if it is intentional as the buffering takes directly place in the GzHeader
Should I rather add a method read_until to the Buffer ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a shape of code in mind, but there's a lot of confusing buffer-this-sometimes-dont logic going on here and there's very little documentation. I think something needs to be improved (code structure or similar) to make this more understandable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is not nice.
I just pushed something new.
The difficulty is r.bytes() is a mutable borrow and we want another mutable borrow for r.part.header.filename
I did not manage to have a mutable reference to r.part.header.filename in Buffer because of lifetime issues such as

error[E0623]: lifetime mismatch
  --> src/gz/bufread.rs:89:39
   |
46 | fn read_gz_header_part<R: Read>(r: &mut Buffer<R>) -> io::Result<()> {
   |                                    --------------
   |                                    |
   |                                    these two types are declared with different lifetimes...
...
89 |                     r.direct_buffer = r.part.header.filename.as_ref();
   |                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...but data from `r` flows into `r` here

So, in the end Buffer::read looks at the GzHeaderParsingState to decide where to buffer the data

Another (not so nice) pattern could be

                    r.direct = true;
                    let mut bytebuf = [0; 1];
                    loop {
                        r.read_exact(&mut bytebuf)?;
                        if bytebuf[0] == 0 {
                            r.direct = false;
                            break;
                        }
                        if let Some(ref mut b) = r.part.header.filename {
                            b.push(bytebuf[0]);
                        }
                    }

@catenacyber
Copy link
Contributor Author

Friendly ping @alexcrichton
I changed Buffer's Read to use self.part.state to know which buffer to write into...

@alexcrichton
Copy link
Member

Sorry this fell off my radar.

I thihnk this looks good for now though, thanks for the PR!

@alexcrichton alexcrichton merged commit 7212da8 into rust-lang:master Aug 27, 2021
@catenacyber
Copy link
Contributor Author

Thanks @alexcrichton
When would this be available in a release ?

@alexcrichton
Copy link
Member

I'll go ahead and release now

@catenacyber
Copy link
Contributor Author

Thanks Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants