Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: ZipFile/ZipArchive entry name decoding not working correctly when entryNameEncoding is specified #92283

Closed
deng0 opened this issue Sep 19, 2023 · 7 comments · Fixed by #103271
Labels
area-System.IO.Compression bug in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@deng0
Copy link

deng0 commented Sep 19, 2023

Description

The documentation for ZipFile.Open contains the following remarks:

When you open a zip archive file for reading and entryNameEncoding is set to a value other than null, entry names are decoded according to the following rules:
When the language encoding flag is not set, the specified entryNameEncoding is used to decode the entry name.
When the language encoding flag is set, UTF-8 is used to decode the entry name.

This is how it always worked, but in .NET 7 and 8 there seems to be a bug, so that the last rule is no longer applied.
It seems entryNameEncoding is always used, even when the zip file entry has the language encoding flag set.
In my opinion this is a serious regression.

Reproduction Steps

Her are some test zip files, to reproduce the problem:
test_win.zip
test_dotnet.zip

The first zip was created by the windows 11 file explorer and the second was created by .NET without specifying entryNameEncoding. The windows file explorer does not set the language encoding flag, but .NET does.
The problems begin when you try to read those zip files with .NET.

When reading a zip file with .NET you always had to specify the entryNameEncoding, otherwise the special file name characters would not be read correctly. Something like this:

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding entryNameEncoding = Encoding.GetEncoding(850);
ZipFile.ExtractToDirectory(@"C:\temp\test_win.zip", @"c:\temp\test_win_extracted", entryNameEncoding, true);
ZipFile.ExtractToDirectory(@"C:\temp\test_donet.zip", @"c:\temp\test_dotnet_extracted", entryNameEncoding, true);

Expected behavior

For both zip files the name of the extracted file should be "Nürburgring.txt"

Actual behavior

The zip file created by the file explorer is correctly extracted, but the other file is not correctly extracted "N├╝rburgring.txt" when using .NET 7/8.

Regression?

In .NET Framework and .NET 6 this worked correctly.
You could always specify an entryNameEncoding and .NET correctly respected the language encoding flag.

Known Workarounds

No response

Configuration

No response

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 19, 2023
@ghost
Copy link

ghost commented Sep 19, 2023

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The documentation for ZipFile.Open contains the following remarks:

When you open a zip archive file for reading and entryNameEncoding is set to a value other than null, entry names are decoded according to the following rules:
When the language encoding flag is not set, the specified entryNameEncoding is used to decode the entry name.
When the language encoding flag is set, UTF-8 is used to decode the entry name.

This is how it always worked, but in .NET 7 and 8 there seems to be a bug, so that the last rule is no longer applied.
It seems entryNameEncoding is always used, even when the zip file entry has the language encoding flag set.
In my opinion this is a serious regression.

Reproduction Steps

Her are some test zip files, to reproduce the problem:
test_win.zip
test_dotnet.zip

The first zip was created by the windows 11 file explorer and the second was created by .NET without specifying entryNameEncoding. The windows file explorer does not set the language encoding flag, but .NET does.
The problems begin when you try to read those zip files with .NET.

When reading a zip file with .NET you always had to specify the entryNameEncoding, otherwise the special file name characters would not be read correctly. Something like this:

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding entryNameEncoding = Encoding.GetEncoding(850);
ZipFile.ExtractToDirectory(@"C:\temp\test_win.zip", @"c:\temp\test_win_extracted", entryNameEncoding, true);
ZipFile.ExtractToDirectory(@"C:\temp\test_donet.zip", @"c:\temp\test_dotnet_extracted", entryNameEncoding, true);

Expected behavior

For both zip files the name of the extracted file should be "Nürburgring.txt"

Actual behavior

The zip file created by the file explorer is correctly extracted, but the other file is not correctly extracted "N├╝rburgring.txt" when using .NET 7/8.

Regression?

In .NET Framework and .NET 6 this worked correctly.
You could always specify an entryNameEncoding and .NET correctly respected the language encoding flag.

Known Workarounds

No response

Configuration

No response

Other information

No response

Author: deng0
Assignees: -
Labels:

area-System.IO.Compression, untriaged

Milestone: -

@carlossanlop carlossanlop added this to the 9.0.0 milestone Sep 19, 2023
@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Sep 19, 2023
@deng0
Copy link
Author

deng0 commented Sep 20, 2023

I would like to emphasize that this is not some kind of special case. It's pretty normal that you have to specify entryNameEncoding, otherwise almost all zip files, containing filenames with special characters, (including those created by windows file explorer) will not be read correctly. So not specifying entryNameEncoding is not really an option either.
This is an issue of itself, but this has always been the case with .NET. The problem is the regression in .NET 7/8 that the specified entryNameEncoding is used even when the entries are clearly marked as being UTF8.

I've taken a look at the source code and my suspicion was confirmed.

In .NET 6 the filename was decoded like this:

        private string DecodeEntryName(byte[] entryNameBytes)
        {
            Debug.Assert(entryNameBytes != null);

            Encoding readEntryNameEncoding;
            if ((_generalPurposeBitFlag & BitFlagValues.UnicodeFileName) == 0)
            {
                readEntryNameEncoding = _archive == null ?
                    Encoding.UTF8 :
                    _archive.EntryNameEncoding ?? Encoding.UTF8;
            }
            else
            {
                readEntryNameEncoding = Encoding.UTF8;
            }

            return readEntryNameEncoding.GetString(entryNameBytes);
        }

But in .NET 7/8 it's done like this:

_storedEntryName = (_archive.EntryNameAndCommentEncoding ?? Encoding.UTF8).GetString(_storedEntryNameBytes);

In the new version when EntryNameAndEncoding is specified, it's always used, regardless of the _generalPurposeBitFlag having the BitFlagValues.UnicodeFileName flag set.
In my opinion this is clearly a bug and should be fixed as soon as possible and not just be looked at for .NET 9

Probably should be changed to something like this:

 _storedEntryName = (_archive.EntryNameAndCommentEncoding != null && (_generalPurposeBitFlag & BitFlagValues.UnicodeFileName) == 0 ? _archive.EntryNameAndCommentEncoding : Encoding.UTF8).GetString(_storedEntryNameBytes);

@neelabo
Copy link

neelabo commented Dec 19, 2023

I ran into the same problem. Currently, it seems that the UTF8 flags included in the ZIP file are not used at all in the read.
Specifically, BitFlagValues.UnicodeFileNameAndComment is being ignored.

@mikaelwallen
Copy link

We recently ran into this issue as well and agree it is a pretty serious regression. We are now basically forced to stop using the dot net built-in functionality to uncompress zip files and start using some other library, probably SharpZipLib.

@danmoseley
Copy link
Member

Not my area, but I'm guessing if someone is interested in offering a PR this might meet the bar for servicing.

@branko-d
Copy link

branko-d commented May 14, 2024

I can confirm that we are seeing the same bug. ZipArchiveEntry always applies the same encoding, regardless of whether the ZIP entry bit that governs encoding is set or not. This has already caused us problems in production that we are currently trying to work around.

Here is the relevant portion of the PKWARE's .ZIP File Format Specification:

APPENDIX D - Language Encoding (EFS)

D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).

So basically: when the bit is unset use CP437, when set use UTF-8.

In case it helps, here are some sample ZIP archives whose names should decode correctly when passing null to the entryNameEncoding parameter of ZipArchive constructor (but Spanish doesn't in .NET 7 & 8):

Documentación.zip - the bit is unset
行げくと活断工ずル供高チオユ本作作こ.zip - the bit is set

Passing CP437 should also work in both cases, but this time it fails on the Japanese ZIP.

Here is a simple console project illustrating the problem:

UnzipNames.zip

@malaterre
Copy link
Contributor

Here is a quick unit test:

    [Fact]
    public async Task encoding3()
    {
        string[] entryNames = {"finalePräsentation.pdf", "münchen.pdf", "Übersicht.pdf"};
        string comment = string.Join("-", entryNames);
        var latin1Stream = new MemoryStream();
        var utf8Stream = new MemoryStream();
        {
            using (var archiveOut = new ZipArchive(latin1Stream, ZipArchiveMode.Create, true, Encoding.Latin1))
            {
                archiveOut.Comment = comment;
                foreach (var entryName in entryNames)
                {
                    /*
                     * When you write to archive files and entryNameEncoding is set to a value other than null,
                     * the specified entryNameEncoding is used to encode the entry names into bytes.
                     * The language encoding flag (in the general-purpose bit flag of the local file header)
                     * is set only when the specified encoding is a UTF-8 encoding.
                     */
                    var entry = archiveOut.CreateEntry(entryName);
                    await using var writer = new StreamWriter(entry.Open());
                    await writer.WriteAsync("Hello World!");
                }
            }

            latin1Stream.Position = 0;

            using (var archiveOut = new ZipArchive(utf8Stream, ZipArchiveMode.Create, true, null))
            {
                archiveOut.Comment = comment;
                foreach (var entryName in entryNames)
                {
                    var containsOnlyAscii = entryName.All(char.IsAscii);
                    if (!containsOnlyAscii)
                        output.WriteLine("language encoding flag is set, and entry names are encoded by using UTF-8.");
                    else
                        output.WriteLine(
                            "the language encoding flag is not set, and entry names are encoded by using the current system default code page");
                    var entry = archiveOut.CreateEntry(entryName);
                    await using var writer = new StreamWriter(entry.Open());
                    await writer.WriteAsync("Hello World!");
                }
            }

            utf8Stream.Position = 0;
        }
        {
            output.WriteLine("ZipArchive.Latin1:");
            using (var archiveIn = new ZipArchive(latin1Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
            {
                Assert.Equal(comment, archiveIn.Comment);
                foreach (var (entryName, entry) in entryNames.Zip(archiveIn.Entries))
                    Assert.Equal(entryName, entry.FullName);
            }

            output.WriteLine("ZipArchive.Default:");
            // When you open a zip archive file for reading and entryNameEncoding is set to a value other than null,
            using (var archiveIn = new ZipArchive(utf8Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
            {
                // When the language encoding flag is set, UTF-8 is used to decode the entry name.
                Assert.Equal(comment, archiveIn.Comment);
                foreach (var (entryName, entry) in entryNames.Zip(archiveIn.Entries))
                    Assert.Equal(entryName, entry.FullName);
            }
        }
    }

ref:

@ericstj ericstj added the in-pr There is an active PR which will close this issue when it is merged label Aug 6, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Sep 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO.Compression bug in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
9 participants