Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected result when comparing file content in and out of zip #172

Open
mtkennerly opened this issue Aug 17, 2022 · 6 comments
Open

Unexpected result when comparing file content in and out of zip #172

mtkennerly opened this issue Aug 17, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@mtkennerly
Copy link

mtkennerly commented Aug 17, 2022

Hi! I may have missed or misunderstood something from the documentation, but I just ran into this and found it surprising.

I have a backup utility, and I want to check the previous backup archive to see if the content is the same as the latest scan. It works for some files, but not others, and I'm not sure why. I've created a minimal example of the issue: https://gist.github.com/mtkennerly/981622694bb9313253a91ea8e4ebde9d

Output:

assets/text.txt matches = true
assets/line.jpg matches = true
assets/screenshot.jpg matches = false

I'm not sure why screenshot.jpg doesn't compare equal. It works if I change CompressionMethod::Deflated to CompressionMethod::Stored or CompressionMethod::Bzip2, whereas it has the same issue if I use CompressionMethod::Zstd. How could I accomplish what I'm trying to do?

I'm using a buffered read of 1 KiB at a time since the actual files may be too large to load into memory. I also considered comparing CRC hashes with ZipFile::crc32, but I wasn't sure how to compute a compatible hash for an unarchived file, since there seem to be a lot of different CRC variations.

I'm using Windows 11, Rust 1.63.0, and Zip 0.6.2.

Assets

Place these in an assets folder next to src:

@zamazan4ik
Copy link
Contributor

@mtkennerly thanks for the issue. Could you please try to load whole into the memory instead of using buffering? Previously we had the same issue with buffering. I guess it could be somehow related to it.

@mtkennerly
Copy link
Author

mtkennerly commented Aug 17, 2022

Looks like you're right:

Diff

diff --git a/src/main.rs b/src/main.rs
index 1a8dc9c..249cc3d 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -16,8 +16,8 @@ fn try_same_content_as_zip(original: &str, archived: &mut ZipFile) -> Result<boo
     let handle = File::open(original)?;
     let mut reader = BufReader::new(handle);

-    let mut disk_buffer = [0; 1024];
-    let mut zip_buffer = [0; 1024];
+    let mut disk_buffer = vec![];
+    let mut zip_buffer = vec![];
     loop {
         let read_disk = reader.read(&mut disk_buffer[..])?;
         let read_zip = archived.read(&mut zip_buffer[..])?;

Output

assets/text.txt matches = true
assets/line.jpg matches = true
assets/screenshot.jpg matches = true

@zamazan4ik
Copy link
Contributor

@mtkennerly Fortunately (or unfortunately) I was right... What I could propose to you. If you are really interested in the library, try to debug zip-rs library, find the root cause, fix and send the PR - I will review it and then merge. I have no time to do it, sorry. The library is almost dead, so...

More lazy way - just don't use buffering :)

@zamazan4ik zamazan4ik added the bug Something isn't working label Aug 17, 2022
@mtkennerly
Copy link
Author

That's fair! I took a very brief look, and I wonder if it might be an issue in flate2 and zstd since zip's code seemed to be generic over the readers from those libraries, but I don't think I'm prepared to drill down into it. I'll probably just start storing hashes for each file at backup time, then compare with those hashes during the next backup instead of comparing with the zip content directly. Thanks for the fast response on this.

@Plecra
Copy link
Member

Plecra commented Feb 1, 2023

Yikes, this is bad. And it's probably our fault tbh. Thankyou for the reproducible case, I'll try to hunt this down!

@superhedge22
Copy link

superhedge22 commented May 27, 2023

i confirm i have the same issue, zipping up csv files end up with mangled files, it seems like sometimes some bytes are removed (i guess the exact size of the buffer) and i end up with a few lines missing data/eating up the new line:

2023-03-04 00:16:00,0.2302,0.2302,0.23,0.2301,45967
2023-03-04 00:17:00,0.2301,0.2301,02:13:00,0.2299,0.23,0.2299,0.2299,10910
2023-03-04 02:14:00,0.2299,0.23,0.2299,0.2299,12225

this is my code:

    let mut zip = ZipWriter::new(output_file);
    let options = FileOptions::default()
        .compression_method(Deflated)
        .large_file(true)
        .compression_level(Some(9)) // or Bzip2 for better compression
        .unix_permissions(0o755);

    zip.start_file(path.file_name().unwrap().to_str().unwrap(), options)?;
    let mut buffer = [0; 8192];
    let mut reader = std::io::BufReader::new(file);
  
    loop {
        let bytes_read = reader.read(&mut buffer)?;
        if bytes_read == 0 {
            break;
        }
        // pb.inc(bytes_read as u64); // Increase the progress bar
        zip.write(&buffer[..bytes_read])?;
    }

i confirm this doesn't seem to be happening with sevenz_rust nor flate2-rs.

@Pr0methean Pr0methean transferred this issue from zip-rs/zip-old Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants