-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix poll_read leaving the stream in a non-valid state #26
Conversation
…` API. This causes issues when trying to read the `ZipEntry` data using any other function than `read_to_end_crc` or `copy_to_end_crc`, in other words, anything that uses the `AsyncRead` trait to poll data (Tokio, Futures-rs, others async-readers).
…s not consumed before This means that it does have the descriptor in the `Local file header`, but we still need to return the additional bytes read.
Is it even possible to read zip files with a data descriptor and no compressed length without using the central directory? You have no idea how long the file is and the file could easily contain the signature bytes so you could have false positives and you'd have no real way to verify whether it's correct. Zip-rs just fails when you try to read a file with a data descriptor while streaming https://github.com/zip-rs/zip/blob/master/src/read.rs#L1128 I just don't see how they are possible to read without a seekable reader and a parsed central directory that contains the length. |
Yes, I think you are right (I was probably going too far on the "SHOULD" part of the documentation). We wouldn't know where to stop reading the zip data. Also, for Zip64, Compressed size and Uncompressed size will be I've pushed some additional changes, to fix the compilation (to match the MRV of the CI/CD, because I tested on Rust 1.62), and to remove the note part about Data Descriptor absence, I don't know if those zip files would even be possible (maybe I was too tired when I wrote the notes). I may open a PR for full Zip64 support after this one, but I need to do more investigation (and study) before this. |
If we can't stream things with a data descriptor then do we even need to read the data descriptor? I've looked at zip-rs and a few C implementations and they never seem to read it except for two exceptions: So in normal operation nothing I've seen uses it. |
I don't know if I understood it right, but I'm only taking into consideration the streaming scenario where the data is being transferred as it compresses, if the “Compressed” and “Uncompressed” fields are not set, the Data Descriptor must exist, otherwise we would not be able to decompress while downloading the Zip file (which is something I was trying to do in my project). If we have a Seekable Reader, surely we don't even need the Data Descriptor even if the “Compressed” and “Uncompressed” fields are not set, because they will be present in the Central Directory. But if we are transferring as we compress the data, we would not be able to Seek for the Central Directory, since we can't seek a TcpStream.
I think that is the reason for Data Descriptor to exist, for cases that you don't have the Central Directory yet. I've found this issue: zip-rs/zip-old#16, maybe it's relevant for the discussion. |
Sorry for the subsequent comment, however, I think that it is relevant to separate.
After some research, I think that now I'm able to answer this question, Deflate is self-terminating, you don't need to know where the file ends, you just need to feed it to the Deflate decoder and it will stop where the data ends. From here (where the deflate left), you can now check for the Data Descriptor signature. So, in fact, the Data Descriptor may not exist and we could be tolerant for this case, I just don't know exactly how to design the implementation to do so, I have only a vague idea. |
I did a quick search and it seems like all supported compression algorithms are self-terminating, well except for stored (no compression). |
@mobad Do you think that it does fits this PR? I could do these changes to better accomplish the handling of Data Descriptor, just need to figure out how the current logic works, but I don't think it would be hard to do. |
Not really, I think what you're doing makes sense and is probably the best way of doing it. The only alternative I can think of is doing this in entry_reader and consuming the data descriptor of the previous entry, but that has its own problems. |
We are here to improve, so any suggestion is welcome, and I'm also very unconfortable with the fact that if the descriptor magic number appears in the compressed data stream, it may be confused with a Data Descriptor and break the process.
The only problem I can think of is, you would not be able to do the CRC32 check, because it will only be available when you start reading the next entry, which doesn't make sense. But it also bothers me that, if the Data Descriptor is invalid, it also fails the entry read process, even if the compressed data is valid and correctly decompressed. I can think of some solutions, like, lazily reading the Data Descriptor when trying to do a CRC32 check or retrieve compressed/uncompressed sizes. However, since you can't start reading the next entry if you don't read the Data Descriptor, we would also need to consume the Data Descriptor if the |
e91bf88
to
36a4927
Compare
Update: I've synced the PR with the main branch. (Also accidentally pushed WIP changes around removing |
I'm unsure about the direction I want this to be taken as well - it's an interesting one. I want to be as spec-compliant as possible, but there doesn't really seem to be an optimal solution that doesn't involve some sort of sacrifice. Either way, I'll close this and have a look at the other PR given it's based on this branch. :) |
Trying to read the
stream::ZipEntry
with the regular asyncpoll_ready
API causes the stream to be left in a non-valid state.Reason
That's caused mainly because we use fixed buffer sizes (like tokio's
8192
buffer size),and we don't know how many bytes a entry has (or will have) beforehand to set the read boundary,
so we read not only the data of the current entry, but also the data of the next entry.
That's is not a problem,
AsyncDelimiterReader
provides access to the inner buffer, so we can just take the exceeding bytes and prepend it usingAsyncPrependReader
.The real problem is, the function that does this is the
reset_reader
, and it must be called before trying to read the next entry,so the bytes stolen from the next
Entry
returns to theinner
Reader as non-consumed bytes. And the other problem is, this function uses theasync
mechanism, that cannot be used insidepoll_*
functions (not only because they are not async, but also because they requires Pinning).Solution
The entire
reset_reader
function was removed,read_to_end_crc
andread_to_string_crc
does not call it anymore. The entire mechanism was moved topoll_data_descriptor
, so any caller that usesAsyncRead
API automatically triggers the entire logic of consuming the Data Descriptor bytes and returning the stolen bytes to theinner
Reader.The implementation is “a bit” more complex (actually, considerably more complex IMO), it uses a State Machine to track the state the last
Poll::Pending
left, and the already consumed data.There are 3 states:
ReadData
ReadDescriptor
so we need to consume the next bytes to construct the Data Descriptor and return the stolen bytes.
PrepareNext
the code would need to have more scope delimiters, which I did before, but was too messier to keep.
Considerations
During the implementation of the fix, I found some issues:
compressed
anduncompressed
fields, but the current implementation is usingu32
(4 bytes) to store those values.they are not really required, and surely there are some wild Zips which does not follow the recommendations.
It would be good to be more tolerant to these cases, but I don't know if it does worth it (altho, the effort is very low to handle this).
compare_crc
may panic, I don't know if it will now, but it did before the fix (because of the Data Descriptor absence)Result<bool>
, mainly if there's a plan to be tolerant over the absence ofthe Data Descriptor.
Also, it is always a good idea for APIs to never panic, if they do inside a Thread managed by a Runtime like Tokio, it's not a big deal, but if they do in event-loop based implementations, or in anything that runs on Main Thread, it would be very dangerous.
Final words
That's my first contribution to this repo (actually, my first contribution to an Open-Source Project after a while), any suggestions/critics are welcome.