Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect general-purpose bit flags when decoding ZipArchiveEntry names and comments #103271

Merged
merged 4 commits into from
Aug 6, 2024

Conversation

edwardneal
Copy link
Contributor

Resolves #92283.

This specifically only changes the decoding of ZipArchiveEntry.Name and Comment, not the encoding. I think this is functioning as documented.

I've also updated the public-facing XML documentation for ZipFile and ZipArchive. Both parts of this documentation suggested that the bit only changed how the entry name was decoded; this was incorrect, as it also changes the decoding of the entry comment (in compliance with the ZIP specification.)

If bit 11 in the general purpose bit flags is set, forces the use of UTF-8 instead of the encoding specified in the ZipArchive constructor.
Previous documentation stated that the encoding was only used to decode ZIP entry names. This is incorrect; .NET complies with the standard (which says that it's used to decode both names and entry comments.)
Copy link
Member

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Left some comments for you to consider.

: _archive?.EntryNameAndCommentEncoding ?? Encoding.UTF8;

return readEntryStringEncoding.GetString(entryStringBytes);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you're bringing back a condition similar to the one we used to have in .NET 6, as proposed in the issue description. I can see that it fixes the problem. I'd like to provide some context.

Before this change:

The archive's encoding was set either via the ZipArchive public constructors or the ZipFile Open or Create methods that take an entryNameEncoding argument. This encoding was then passed to the ZipArchiveEntry constructors that take an encoding argument as well. This is important because we need to respect that encoding when the uses passes it as an argument.

We would only read the encoding from each entry's general purpose bit when the user did not pass an entryNameEncoding argument or when the value was passed as null, in which case the archive entries were constructed using the internal ZipArchiveEntry constructor that take an archive and a central directory header as arguments.

What we were not doing was read this general purpose bit flag value when the public ZipArchiveEntry constructors were called, meaning that when the time came to check the encoding of the entry, we would check if the archive's EntryNameAndCommentEncoding had a value, but it would always be unset (because of the constructor used to create this entry). So the next step was to default to UTF8 and read the file comment using that encoding.

After this change:

We will always first read the entry's general purpose bit flag to see if the value was set. If not, then we will do what we were doing before: check the EntryNameAndCommentEncoding, and if that is unset, we default to UTF8 to read the file comment.

The bit flag will only be set when the internal entry constructor is called, which was the bug. This should not affect entries created with the public constructors, meaning the behavior remains unmodified for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @carlossanlop. I think we've both read the pre-PR code for decoding existing ZipArchiveEntrys in the same way.

I'll re-check and test for the correct behaviour when creating a new ZipArchiveEntry - I can see that L136 of ZipArchiveEntry sets FullName, and this property's setter uses the existing behaviour you've described to convert the value to a byte array. This looks correct to me, but I'll check.

Was prioritising the supplied encoding over the general-purpose bit flags deliberate? The ZIP file spec. seems to indicate that it's the other way around: if bit 11 of the flags is set, the filename must always be UTF-8. I'm looking at the latest version, sections 4.4.4 and appendix D.

Section 4.4.4:

Bit 11: Language encoding flag (EFS). If this bit is set,
the filename and comment fields for this file
MUST be encoded using UTF-8. (see APPENDIX D)

Appendix D:

D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to sign-off. I looked at the change history in this area, I wanted to double check that we were not regressing this: #87883 Note that I added the tests for that fix in a follow-up PR: #88978

I can see that we did not break them in this PR.

@carlossanlop
Copy link
Member

@ericstj @jeffhandley I think this meets the bar for merging to main so it gets included in 9.0: It's a reliability problem from a previous release, we introduced comments in zip files but we were not handling encoding correctly. Would you agree?

@carlossanlop
Copy link
Member

Talked to @ericstj and this meets the bar for .NET 9 but we need to mark it as a breaking change, as we're going to be generating different names and comments (even though the previous behavior was wrong).

@carlossanlop carlossanlop added breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet labels Jul 30, 2024
@edwardneal
Copy link
Contributor Author

Thanks @carlossanlop. I've raised both PRs - the documentation update is dotnet/dotnet-api-docs#10193, and the breaking change is dotnet/docs#42003.

@carlossanlop
Copy link
Member

Much appreciated, @edwardneal!

@carlossanlop carlossanlop added breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. and removed breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet labels Aug 6, 2024
@dotnet-policy-service dotnet-policy-service bot added the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Aug 6, 2024
@carlossanlop carlossanlop merged commit 6cdc448 into dotnet:main Aug 6, 2024
81 of 84 checks passed
@carlossanlop carlossanlop removed the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Aug 6, 2024
@edwardneal edwardneal deleted the issue-92283 branch August 31, 2024 16:25
@github-actions github-actions bot locked and limited conversation to collaborators Oct 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO.Compression breaking-change Issue or PR that represents a breaking API or functional change over a prerelease.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: ZipFile/ZipArchive entry name decoding not working correctly when entryNameEncoding is specified
2 participants