Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix poll_read leaving the stream in a non-valid state #26

Closed

Conversation

JonathanxD
Copy link
Contributor

Trying to read the stream::ZipEntry with the regular async poll_ready API causes the stream to be left in a non-valid state.

Reason

That's caused mainly because we use fixed buffer sizes (like tokio's 8192 buffer size),
and we don't know how many bytes a entry has (or will have) beforehand to set the read boundary,
so we read not only the data of the current entry, but also the data of the next entry.

That's is not a problem, AsyncDelimiterReader provides access to the inner buffer, so we can just take the exceeding bytes and prepend it using AsyncPrependReader.

The real problem is, the function that does this is the reset_reader, and it must be called before trying to read the next entry,
so the bytes stolen from the next Entry returns to the inner Reader as non-consumed bytes. And the other problem is, this function uses the async mechanism, that cannot be used inside poll_* functions (not only because they are not async, but also because they requires Pinning).

Solution

The entire reset_reader function was removed, read_to_end_crc and read_to_string_crc does not call it anymore. The entire mechanism was moved to poll_data_descriptor, so any caller that uses AsyncRead API automatically triggers the entire logic of consuming the Data Descriptor bytes and returning the stolen bytes to the inner Reader.

The implementation is “a bit” more complex (actually, considerably more complex IMO), it uses a State Machine to track the state the last Poll::Pending left, and the already consumed data.

There are 3 states:

  • ReadData
    • The still data being read, we haven't reached the end of the compressed content
  • ReadDescriptor
    • We reached the end of the compressed content and we don't have a Data Descriptor set,
      so we need to consume the next bytes to construct the Data Descriptor and return the stolen bytes.
  • PrepareNext
    • We've read the Data Descriptor section and now we need to return the stolen bytes. or
    • We have the Data Descriptor set (found in Local file header), therefore no Data Descriptor will be present, but we still need to return the stolen bytes.
    • Actually, the Data Descriptor is set in this step, not in the previous one, just to conform to Rust borrowing rules, otherwise
      the code would need to have more scope delimiters, which I did before, but was too messier to keep.

Considerations

During the implementation of the fix, I found some issues:

  • Zip64 is not entirely supported, it uses 64-bit (8 bytes) for compressed and uncompressed fields, but the current implementation is using u32 (4 bytes) to store those values.
  • The absence of the Data Descriptor is not handled (wasn't before and isn't in the current implementation),
    • Although it is expected to exists when those values are not present in the Local file header,
      they are not really required, and surely there are some wild Zips which does not follow the recommendations.
      It would be good to be more tolerant to these cases, but I don't know if it does worth it (altho, the effort is very low to handle this).
  • compare_crc may panic, I don't know if it will now, but it did before the fix (because of the Data Descriptor absence)
    • I think that it would be better for it to return Result<bool>, mainly if there's a plan to be tolerant over the absence of
      the Data Descriptor.
      Also, it is always a good idea for APIs to never panic, if they do inside a Thread managed by a Runtime like Tokio, it's not a big deal, but if they do in event-loop based implementations, or in anything that runs on Main Thread, it would be very dangerous.

Final words

That's my first contribution to this repo (actually, my first contribution to an Open-Source Project after a while), any suggestions/critics are welcome.

There may be some typos in this PR, it's very late here and the PR is a little extensive, so forgive me for any grammatical confusion.

…` API.

This causes issues when trying to read the `ZipEntry` data using any other function than `read_to_end_crc` or `copy_to_end_crc`, in other words, anything that uses the `AsyncRead` trait to poll data (Tokio, Futures-rs, others async-readers).
…s not consumed before

This means that it does have the descriptor in the `Local file header`, but we still
need to return the additional bytes read.
@mobad
Copy link
Contributor

mobad commented Jul 2, 2022

Is it even possible to read zip files with a data descriptor and no compressed length without using the central directory?

You have no idea how long the file is and the file could easily contain the signature bytes so you could have false positives and you'd have no real way to verify whether it's correct.

Zip-rs just fails when you try to read a file with a data descriptor while streaming https://github.com/zip-rs/zip/blob/master/src/read.rs#L1128
So imo if you're streaming and the local file header doesn't have a compressed length then you should fail.

I just don't see how they are possible to read without a seekable reader and a parsed central directory that contains the length.
But maybe I'm just misunderstanding something...
I guess it could be possible with certain self terminating compression modes but stored is a counter example as you have no way to tell when it ends.

@JonathanxD
Copy link
Contributor Author

You have no idea how long the file is and the file could easily contain the signature bytes so you could have false positives and you'd have no real way to verify whether it's correct.

Zip-rs just fails when you try to read a file with a data descriptor while streaming https://github.com/zip-rs/zip/blob/master/src/read.rs#L1128 So imo if you're streaming and the local file header doesn't have a compressed length then you should fail.

Yes, I think you are right (I was probably going too far on the "SHOULD" part of the documentation). We wouldn't know where to stop reading the zip data. Also, for Zip64, Compressed size and Uncompressed size will be 0xffffffff so the Data Descriptor must be present.

I've pushed some additional changes, to fix the compilation (to match the MRV of the CI/CD, because I tested on Rust 1.62), and to remove the note part about Data Descriptor absence, I don't know if those zip files would even be possible (maybe I was too tired when I wrote the notes).

I may open a PR for full Zip64 support after this one, but I need to do more investigation (and study) before this.

@mobad
Copy link
Contributor

mobad commented Jul 2, 2022

If we can't stream things with a data descriptor then do we even need to read the data descriptor?
If we have a seekable data source and it has a data descriptor then we can just read the central directory and use a normal read with a .take(compressed_size), not even look at the data descriptor, then just seek to the start of the next file when the next by_index or whatever is called.
Unless I'm missing something I really don't know why the spec even includes it.

I've looked at zip-rs and a few C implementations and they never seem to read it except for two exceptions:
Minizip uses it to try to recover a corrupted central directory https://github.com/zlib-ng/minizip-ng/blob/47b8449fec2c7a575a506ecbe6399ad0603b5992/mz_zip.c#L1219
Zip has a separate validation function which uses it to validate the zip file https://github.com/kuba--/zip/blob/master/src/miniz.h#L1556

So in normal operation nothing I've seen uses it.

@JonathanxD
Copy link
Contributor Author

JonathanxD commented Jul 2, 2022

If we can't stream things with a data descriptor then do we even need to read the data descriptor?

I don't know if I understood it right, but I'm only taking into consideration the streaming scenario where the data is being transferred as it compresses, if the “Compressed” and “Uncompressed” fields are not set, the Data Descriptor must exist, otherwise we would not be able to decompress while downloading the Zip file (which is something I was trying to do in my project).

If we have a Seekable Reader, surely we don't even need the Data Descriptor even if the “Compressed” and “Uncompressed” fields are not set, because they will be present in the Central Directory. But if we are transferring as we compress the data, we would not be able to Seek for the Central Directory, since we can't seek a TcpStream.

Actually, for Zip64, it does not must exists, we should parse the size from Extra Fields, but if we are streaming it, the size will be in the Data Descriptor.

I think that is the reason for Data Descriptor to exist, for cases that you don't have the Central Directory yet.

I've found this issue: zip-rs/zip-old#16, maybe it's relevant for the discussion.

@JonathanxD
Copy link
Contributor Author

JonathanxD commented Jul 2, 2022

Sorry for the subsequent comment, however, I think that it is relevant to separate.

You have no idea how long the file is and the file could easily contain the signature bytes so you could have false positives and you'd have no real way to verify whether it's correct.

After some research, I think that now I'm able to answer this question, Deflate is self-terminating, you don't need to know where the file ends, you just need to feed it to the Deflate decoder and it will stop where the data ends. From here (where the deflate left), you can now check for the Data Descriptor signature.

So, in fact, the Data Descriptor may not exist and we could be tolerant for this case, I just don't know exactly how to design the implementation to do so, I have only a vague idea.

@mobad
Copy link
Contributor

mobad commented Jul 2, 2022

I did a quick search and it seems like all supported compression algorithms are self-terminating, well except for stored (no compression).
So if we wanted to support streaming with data descriptors properly I think we'd have to get rid of AsyncDelimiterReader (as it could prematurely end the stream if it randomly finds the delimiter in the file which would happen every couple of GB decompressed assuming uniform randomness) and just read until the decompressor tells us we're done.
But if it's stored then we just have to error out.

@JonathanxD
Copy link
Contributor Author

@mobad Do you think that it does fits this PR? I could do these changes to better accomplish the handling of Data Descriptor, just need to figure out how the current logic works, but I don't think it would be hard to do.

@mobad
Copy link
Contributor

mobad commented Jul 3, 2022

Not really, I think what you're doing makes sense and is probably the best way of doing it.
I'm just a random, I can't approve or anything though.

The only alternative I can think of is doing this in entry_reader and consuming the data descriptor of the previous entry, but that has its own problems.

@JonathanxD
Copy link
Contributor Author

I'm just a random, I can't approve or anything though.

We are here to improve, so any suggestion is welcome, and I'm also very unconfortable with the fact that if the descriptor magic number appears in the compressed data stream, it may be confused with a Data Descriptor and break the process.

The only alternative I can think of is doing this in entry_reader and consuming the data descriptor of the previous entry, but that has its own problems.

The only problem I can think of is, you would not be able to do the CRC32 check, because it will only be available when you start reading the next entry, which doesn't make sense. But it also bothers me that, if the Data Descriptor is invalid, it also fails the entry read process, even if the compressed data is valid and correctly decompressed.

I can think of some solutions, like, lazily reading the Data Descriptor when trying to do a CRC32 check or retrieve compressed/uncompressed sizes. However, since you can't start reading the next entry if you don't read the Data Descriptor, we would also need to consume the Data Descriptor if the entry_reader is called and the previous entry Data Descriptor was not consumed.

@JonathanxD JonathanxD force-pushed the fix/data-descriptor-on-stream branch from e91bf88 to 36a4927 Compare July 3, 2022 05:34
@JonathanxD
Copy link
Contributor Author

Update: I've synced the PR with the main branch. (Also accidentally pushed WIP changes around removing AsyncDelimiterReader use in Stream, fixed it with a force-push).

@Majored
Copy link
Owner

Majored commented Jul 8, 2022

I'm unsure about the direction I want this to be taken as well - it's an interesting one. I want to be as spec-compliant as possible, but there doesn't really seem to be an optimal solution that doesn't involve some sort of sacrifice.

Either way, I'll close this and have a look at the other PR given it's based on this branch. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants