-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid quadratic complexity in GzDecoder #278
Avoid quadratic complexity in GzDecoder #278
Conversation
Quadratic complexity happens when the underlying reader supplies bytes one at a time and would block. And if header flag FNAME is set and we never have a null byte to end the filename, read_gz_header keeps reading all of the partial filename and computing the CRC... The solution is to keep a more complex structure than a simple buffer for what has already been parsed in the header (as is done in C zlib)
let mut reader = Buffer::new(&mut buf, reader.get_mut().get_mut()); | ||
read_gz_header(&mut reader) | ||
let mut reader = Buffer::new(&mut part, reader.get_mut().get_mut()); | ||
read_gz_header_part(&mut reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is src/gz/write.rs using read_gz_header
?
I guess we should have only one function instead of both read_gz_header
and read_gz_header_part
Ok(hdr) => { | ||
header = Some(hdr); | ||
Ok(()) => { | ||
header = Some(part.take_header()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_gz_header_part
cannot return directly the GzHeader
structure because it would move part of the GzHeaderPartial
which is behind a mutable reference.
Thus, the need to break in 2 phases :
1 return Ok
2 take ownership of the partial header fields
src/gz/bufread.rs
Outdated
self.buf_cur += len; | ||
Ok(len) | ||
} | ||
} | ||
} | ||
|
||
impl<'a, T> Buffer<'a, T> where T: std::io::Read { | ||
// If we manage to read all the bytes, we reset the buffer | ||
fn read_once(&mut self, buf: &mut [u8]) -> io::Result<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see any good fit in neither BufRead
nor Read
for this...
What I want is
read_exact
but buffer it there is not enough data, and clear the buffer if it succeeded
Is there some more idiomatic rust way ?
src/gz/bufread.rs
Outdated
} | ||
GzHeaderParsingState::GzHeaderParsingFilename => { | ||
if r.part.flg & FNAME != 0 { | ||
for byte in r.reader.bytes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find any more idiomatic way to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like r.part.filename = r.read_until_zero()?;
could be done ?
src/gz/bufread.rs
Outdated
fn read_once(&mut self, buf: &mut [u8]) -> io::Result<usize> { | ||
self.read_exact(buf)?; | ||
let rlen = buf.len(); | ||
self.crc.update(buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find a way to chain with CrcReader because of this read_once
logic
I guess there should be some additions to the test framework as well. Like testing for some inputs that we get the same result if the reader is supplying bytes one at a time and we call multiple times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Can you explain more about the read_once
aspect? I'm not sure that the internal usage of read_exact
will work well with async I/O where it may fail to make progress but may make progress later. It looks like in-progress bytes may be lost?
Overall this looks pretty reasonable but I think the byte buffering could use some tweaks and/or documentation improvements.
src/gz/bufread.rs
Outdated
mtime: u32, | ||
extra: Option<Vec<u8>>, | ||
filename: Vec<u8>, | ||
comment: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this perhaps internally store a GzHeader
and then it just appends a buf
and state
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, will try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we need at least flg
and xlen
in GzHeaderPartial
So, should I still try to put a GzHeader
in GzHeaderPartial
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know without doing it myself, but I'm trying to ideally reduce duplication here instead of listing the fields twice in a number of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, doing so
It is possible I missed some points. |
This reverts commit e463973.
I think |
I don't think |
You were right. I added a test, and fixed |
|
||
//a cursor turning EOF into blocking errors | ||
#[derive(Debug)] | ||
pub struct BlockingCursor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know if there is such a structure ready in rust implementing Read
and where we can specify at some points that we received more data
r.set_position(pos); | ||
|
||
// Third read : still incomplete for 7 bytes (we have 5) | ||
let mut reader2 = Buffer::new(&mut part, &mut r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to redefine reader
, otherwise I get
error[E0499]: cannot borrow `r` as mutable more than once at a time
--> src/gz/bufread.rs:831:19
|
810 | let mut reader = Buffer::new(&mut part, &mut r);
| ------ first mutable borrow occurs here
...
831 | let pos = r.position();
| ^ second mutable borrow occurs here
...
842 | match reader.read_once(&mut out) {
| ------ first borrow later used here
I do not understand why pos
a u64
is considered a mutable borrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok w/ whatever is necessary for tests, I don't mind too much about their idioms. I think read_once
is still suffering the same partial-read bugs from before?
The test should prove that |
Hm ok there's some unusual stuff going on with |
Yes |
let rlen = buf.len(); | ||
self.crc.update(buf); | ||
self.part.buf.truncate(0); | ||
self.buf_max = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to read over this some more, I don't think I understand why this method needs to exist. I don't think that the manipulation of the fields of the structure here (e.g. the updates of buf_{cur,max}
and the truncate
are right. That seems like it's discarding data which might otherwise be succesfully read by subsequent calls to read_once
.
This seems like callers of this function should be calling read_exact
and then on success updating the crc manually for the header? Other than that I'm not sure why this function is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with read_exact
is that we do not reset the buffer, leading to the quadratic complexity as we keep parsing all over and again.
What I want is
read_exact
but buffer it there is not enough data (that was already ok), and clear the buffer if it succeeded
For example, the reader first provides 7 bytes, read_gz_header_part
buffers these 7 bytes and returns ErrorKind::WouldBlock
Then a second call to read_gz_header_part
, the reader provides 4 more bytes
There is a successful read of the 10 bytes header
As we have flag FEXTRA
, we try to read a u16
but fail with ErrorKind::WouldBlock
The problem with the current code is that we have a 11 byte buffer instead of a 1 byte one for later calls to read_gz_header_part
By the way, I see a problem with my code that arc should belong to GzHeaderPartial
as it lives longer than Buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah yeah I understand the quadratic part, and I forgot that this is where the fix for that is going. To fix the quadratic bits, though, I think that the .bytes()
iterator can be replaced when reading the nul-terminated string with custom logic to empty the buffer. Basically when reading the C strings for the various fields I think that we should consume the bytes and not buffer them in the general intermediate buffer, but rather buffer them in the location where the field is read from (basically store them directly where they're gonna go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically when reading the C strings for the various fields I think that we should consume the bytes and not buffer them in the general intermediate buffer, but rather buffer them in the location where the field is read from (basically store them directly where they're gonna go).
I did that, so Buffer::read
looks at GzHeaderParsingState to decide where to buffer the data.
The point I do not like about this is that the logic is now split between read_gz_header_part
and Buffer::read
, ie we do not see in read_gz_header_part
that we are filling the filename
src/gz/bufread.rs
Outdated
r.part.header.filename = Some(vec); | ||
}; | ||
if let Some(ref mut b) = r.part.header.filename { | ||
for byte in r.reader.bytes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that reaching into the reader
here is also quite right since it bypasses the buffering that's otherwise happening in Buffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this is bad style even if it is intentional as the buffering takes directly place in the GzHeader
Should I rather add a method read_until
to the Buffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a shape of code in mind, but there's a lot of confusing buffer-this-sometimes-dont logic going on here and there's very little documentation. I think something needs to be improved (code structure or similar) to make this more understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is not nice.
I just pushed something new.
The difficulty is r.bytes()
is a mutable borrow and we want another mutable borrow for r.part.header.filename
I did not manage to have a mutable reference to r.part.header.filename
in Buffer
because of lifetime issues such as
error[E0623]: lifetime mismatch
--> src/gz/bufread.rs:89:39
|
46 | fn read_gz_header_part<R: Read>(r: &mut Buffer<R>) -> io::Result<()> {
| --------------
| |
| these two types are declared with different lifetimes...
...
89 | r.direct_buffer = r.part.header.filename.as_ref();
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...but data from `r` flows into `r` here
So, in the end Buffer::read
looks at the GzHeaderParsingState
to decide where to buffer the data
Another (not so nice) pattern could be
r.direct = true;
let mut bytebuf = [0; 1];
loop {
r.read_exact(&mut bytebuf)?;
if bytebuf[0] == 0 {
r.direct = false;
break;
}
if let Some(ref mut b) = r.part.header.filename {
b.push(bytebuf[0]);
}
}
2ec2c71
to
70b023e
Compare
Friendly ping @alexcrichton |
Sorry this fell off my radar. I thihnk this looks good for now though, thanks for the PR! |
Thanks @alexcrichton |
I'll go ahead and release now |
Thanks Alex |
Quadratic complexity happens when the underlying reader
supplies bytes one at a time and would block.
And if header flag FNAME is set and we never have a null
byte to end the filename, read_gz_header keeps reading
all of the partial filename and computing the CRC...
The solution is to keep a more complex structure than a
simple buffer for what has already been parsed in the
header (as is done in C zlib)
@alexcrichton I will put some comments in the code for your review