Regression: ZipFile/ZipArchive entry name decoding not working correctly when entryNameEncoding is specified #92283

deng0 · 2023-09-19T15:52:14Z

Description

The documentation for ZipFile.Open contains the following remarks:

When you open a zip archive file for reading and entryNameEncoding is set to a value other than null, entry names are decoded according to the following rules:
When the language encoding flag is not set, the specified entryNameEncoding is used to decode the entry name.
When the language encoding flag is set, UTF-8 is used to decode the entry name.

This is how it always worked, but in .NET 7 and 8 there seems to be a bug, so that the last rule is no longer applied.
It seems entryNameEncoding is always used, even when the zip file entry has the language encoding flag set.
In my opinion this is a serious regression.

Reproduction Steps

Her are some test zip files, to reproduce the problem:
test_win.zip
test_dotnet.zip

The first zip was created by the windows 11 file explorer and the second was created by .NET without specifying entryNameEncoding. The windows file explorer does not set the language encoding flag, but .NET does.
The problems begin when you try to read those zip files with .NET.

When reading a zip file with .NET you always had to specify the entryNameEncoding, otherwise the special file name characters would not be read correctly. Something like this:

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding entryNameEncoding = Encoding.GetEncoding(850);
ZipFile.ExtractToDirectory(@"C:\temp\test_win.zip", @"c:\temp\test_win_extracted", entryNameEncoding, true);
ZipFile.ExtractToDirectory(@"C:\temp\test_donet.zip", @"c:\temp\test_dotnet_extracted", entryNameEncoding, true);

Expected behavior

For both zip files the name of the extracted file should be "Nürburgring.txt"

Actual behavior

The zip file created by the file explorer is correctly extracted, but the other file is not correctly extracted "N├╝rburgring.txt" when using .NET 7/8.

Regression?

In .NET Framework and .NET 6 this worked correctly.
You could always specify an entryNameEncoding and .NET correctly respected the language encoding flag.

Known Workarounds

No response

Configuration

No response

Other information

No response

The text was updated successfully, but these errors were encountered:

ghost · 2023-09-19T15:52:21Z

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The documentation for ZipFile.Open contains the following remarks:

When you open a zip archive file for reading and entryNameEncoding is set to a value other than null, entry names are decoded according to the following rules:
When the language encoding flag is not set, the specified entryNameEncoding is used to decode the entry name.
When the language encoding flag is set, UTF-8 is used to decode the entry name.

This is how it always worked, but in .NET 7 and 8 there seems to be a bug, so that the last rule is no longer applied.
It seems entryNameEncoding is always used, even when the zip file entry has the language encoding flag set.
In my opinion this is a serious regression.

Reproduction Steps

Her are some test zip files, to reproduce the problem:
test_win.zip
test_dotnet.zip

The first zip was created by the windows 11 file explorer and the second was created by .NET without specifying entryNameEncoding. The windows file explorer does not set the language encoding flag, but .NET does.
The problems begin when you try to read those zip files with .NET.

When reading a zip file with .NET you always had to specify the entryNameEncoding, otherwise the special file name characters would not be read correctly. Something like this:

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding entryNameEncoding = Encoding.GetEncoding(850);
ZipFile.ExtractToDirectory(@"C:\temp\test_win.zip", @"c:\temp\test_win_extracted", entryNameEncoding, true);
ZipFile.ExtractToDirectory(@"C:\temp\test_donet.zip", @"c:\temp\test_dotnet_extracted", entryNameEncoding, true);

Expected behavior

For both zip files the name of the extracted file should be "Nürburgring.txt"

Actual behavior

The zip file created by the file explorer is correctly extracted, but the other file is not correctly extracted "N├╝rburgring.txt" when using .NET 7/8.

Regression?

In .NET Framework and .NET 6 this worked correctly.
You could always specify an entryNameEncoding and .NET correctly respected the language encoding flag.

Known Workarounds

No response

Configuration

No response

Other information

No response

Author:	deng0
Assignees:	-
Labels:	`area-System.IO.Compression`, `untriaged`
Milestone:	-

deng0 · 2023-09-20T11:43:11Z

I would like to emphasize that this is not some kind of special case. It's pretty normal that you have to specify entryNameEncoding, otherwise almost all zip files, containing filenames with special characters, (including those created by windows file explorer) will not be read correctly. So not specifying entryNameEncoding is not really an option either.
This is an issue of itself, but this has always been the case with .NET. The problem is the regression in .NET 7/8 that the specified entryNameEncoding is used even when the entries are clearly marked as being UTF8.

I've taken a look at the source code and my suspicion was confirmed.

In .NET 6 the filename was decoded like this:

        private string DecodeEntryName(byte[] entryNameBytes)
        {
            Debug.Assert(entryNameBytes != null);

            Encoding readEntryNameEncoding;
            if ((_generalPurposeBitFlag & BitFlagValues.UnicodeFileName) == 0)
            {
                readEntryNameEncoding = _archive == null ?
                    Encoding.UTF8 :
                    _archive.EntryNameEncoding ?? Encoding.UTF8;
            }
            else
            {
                readEntryNameEncoding = Encoding.UTF8;
            }

            return readEntryNameEncoding.GetString(entryNameBytes);
        }

But in .NET 7/8 it's done like this:

_storedEntryName = (_archive.EntryNameAndCommentEncoding ?? Encoding.UTF8).GetString(_storedEntryNameBytes);

In the new version when EntryNameAndEncoding is specified, it's always used, regardless of the _generalPurposeBitFlag having the BitFlagValues.UnicodeFileName flag set.
In my opinion this is clearly a bug and should be fixed as soon as possible and not just be looked at for .NET 9

Probably should be changed to something like this:

 _storedEntryName = (_archive.EntryNameAndCommentEncoding != null && (_generalPurposeBitFlag & BitFlagValues.UnicodeFileName) == 0 ? _archive.EntryNameAndCommentEncoding : Encoding.UTF8).GetString(_storedEntryNameBytes);

neelabo · 2023-12-19T12:46:54Z

I ran into the same problem. Currently, it seems that the UTF8 flags included in the ZIP file are not used at all in the read.
Specifically, BitFlagValues.UnicodeFileNameAndComment is being ignored.

mikaelwallen · 2024-03-05T11:08:55Z

We recently ran into this issue as well and agree it is a pretty serious regression. We are now basically forced to stop using the dot net built-in functionality to uncompress zip files and start using some other library, probably SharpZipLib.

danmoseley · 2024-03-05T13:31:17Z

Not my area, but I'm guessing if someone is interested in offering a PR this might meet the bar for servicing.

branko-d · 2024-05-14T11:40:08Z

I can confirm that we are seeing the same bug. ZipArchiveEntry always applies the same encoding, regardless of whether the ZIP entry bit that governs encoding is set or not. This has already caused us problems in production that we are currently trying to work around.

Here is the relevant portion of the PKWARE's .ZIP File Format Specification:

APPENDIX D - Language Encoding (EFS)

D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).

So basically: when the bit is unset use CP437, when set use UTF-8.

In case it helps, here are some sample ZIP archives whose names should decode correctly when passing null to the entryNameEncoding parameter of ZipArchive constructor (but Spanish doesn't in .NET 7 & 8):

Documentación.zip - the bit is unset
行げくと活断工ずル供高チオユ本作作こ.zip - the bit is set

Passing CP437 should also work in both cases, but this time it fails on the Japanese ZIP.

Here is a simple console project illustrating the problem:

UnzipNames.zip

malaterre · 2024-06-17T07:07:42Z

Here is a quick unit test:

    [Fact]
    public async Task encoding3()
    {
        string[] entryNames = {"finalePräsentation.pdf", "münchen.pdf", "Übersicht.pdf"};
        string comment = string.Join("-", entryNames);
        var latin1Stream = new MemoryStream();
        var utf8Stream = new MemoryStream();
        {
            using (var archiveOut = new ZipArchive(latin1Stream, ZipArchiveMode.Create, true, Encoding.Latin1))
            {
                archiveOut.Comment = comment;
                foreach (var entryName in entryNames)
                {
                    /*
                     * When you write to archive files and entryNameEncoding is set to a value other than null,
                     * the specified entryNameEncoding is used to encode the entry names into bytes.
                     * The language encoding flag (in the general-purpose bit flag of the local file header)
                     * is set only when the specified encoding is a UTF-8 encoding.
                     */
                    var entry = archiveOut.CreateEntry(entryName);
                    await using var writer = new StreamWriter(entry.Open());
                    await writer.WriteAsync("Hello World!");
                }
            }

            latin1Stream.Position = 0;

            using (var archiveOut = new ZipArchive(utf8Stream, ZipArchiveMode.Create, true, null))
            {
                archiveOut.Comment = comment;
                foreach (var entryName in entryNames)
                {
                    var containsOnlyAscii = entryName.All(char.IsAscii);
                    if (!containsOnlyAscii)
                        output.WriteLine("language encoding flag is set, and entry names are encoded by using UTF-8.");
                    else
                        output.WriteLine(
                            "the language encoding flag is not set, and entry names are encoded by using the current system default code page");
                    var entry = archiveOut.CreateEntry(entryName);
                    await using var writer = new StreamWriter(entry.Open());
                    await writer.WriteAsync("Hello World!");
                }
            }

            utf8Stream.Position = 0;
        }
        {
            output.WriteLine("ZipArchive.Latin1:");
            using (var archiveIn = new ZipArchive(latin1Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
            {
                Assert.Equal(comment, archiveIn.Comment);
                foreach (var (entryName, entry) in entryNames.Zip(archiveIn.Entries))
                    Assert.Equal(entryName, entry.FullName);
            }

            output.WriteLine("ZipArchive.Default:");
            // When you open a zip archive file for reading and entryNameEncoding is set to a value other than null,
            using (var archiveIn = new ZipArchive(utf8Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
            {
                // When the language encoding flag is set, UTF-8 is used to decode the entry name.
                Assert.Equal(comment, archiveIn.Comment);
                foreach (var (entryName, entry) in entryNames.Zip(archiveIn.Entries))
                    Assert.Equal(entryName, entry.FullName);
            }
        }
    }

ref:

https://stackoverflow.com/q/78621548/136285

dotnet-issue-labeler bot added the area-System.IO.Compression label Sep 19, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Sep 19, 2023

carlossanlop added this to the 9.0.0 milestone Sep 19, 2023

carlossanlop removed the untriaged New issue has not been triaged by the area owner label Sep 19, 2023

stephentoub added the bug label May 14, 2024

edwardneal added a commit to edwardneal/runtime that referenced this issue Jun 11, 2024

Added test to verify behaviour of dotnet#92283

394415c

edwardneal mentioned this issue Jun 11, 2024

Respect general-purpose bit flags when decoding ZipArchiveEntry names and comments #103271

Merged

edwardneal mentioned this issue Aug 2, 2024

[Breaking change]: ZipArchiveEntry names and comments now respect UTF8 flag when decoded dotnet/docs#42003

Closed

3 tasks

ericstj added the in-pr There is an active PR which will close this issue when it is merged label Aug 6, 2024

carlossanlop closed this as completed in #103271 Aug 6, 2024

github-actions bot locked and limited conversation to collaborators Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: ZipFile/ZipArchive entry name decoding not working correctly when entryNameEncoding is specified #92283

Regression: ZipFile/ZipArchive entry name decoding not working correctly when entryNameEncoding is specified #92283

deng0 commented Sep 19, 2023

ghost commented Sep 19, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

deng0 commented Sep 20, 2023

neelabo commented Dec 19, 2023

mikaelwallen commented Mar 5, 2024

danmoseley commented Mar 5, 2024

branko-d commented May 14, 2024 •

edited

Loading

APPENDIX D - Language Encoding (EFS)

malaterre commented Jun 17, 2024

Regression: ZipFile/ZipArchive entry name decoding not working correctly when entryNameEncoding is specified #92283

Regression: ZipFile/ZipArchive entry name decoding not working correctly when entryNameEncoding is specified #92283

Comments

deng0 commented Sep 19, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

ghost commented Sep 19, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

deng0 commented Sep 20, 2023

neelabo commented Dec 19, 2023

mikaelwallen commented Mar 5, 2024

danmoseley commented Mar 5, 2024

branko-d commented May 14, 2024 • edited Loading

APPENDIX D - Language Encoding (EFS)

malaterre commented Jun 17, 2024

branko-d commented May 14, 2024 •

edited

Loading