-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect general-purpose bit flags when decoding ZipArchiveEntry names and comments #103271
Conversation
If bit 11 in the general purpose bit flags is set, forces the use of UTF-8 instead of the encoding specified in the ZipArchive constructor.
Previous documentation stated that the encoding was only used to decode ZIP entry names. This is incorrect; .NET complies with the standard (which says that it's used to decode both names and entry comments.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. Left some comments for you to consider.
src/libraries/System.IO.Compression.ZipFile/src/System/IO/Compression/ZipFile.Create.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/tests/ZipArchive/zip_CreateTests.Comments.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/tests/ZipArchive/zip_CreateTests.Comments.cs
Outdated
Show resolved
Hide resolved
: _archive?.EntryNameAndCommentEncoding ?? Encoding.UTF8; | ||
|
||
return readEntryStringEncoding.GetString(entryStringBytes); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you're bringing back a condition similar to the one we used to have in .NET 6, as proposed in the issue description. I can see that it fixes the problem. I'd like to provide some context.
Before this change:
The archive's encoding was set either via the ZipArchive public constructors or the ZipFile Open or Create methods that take an entryNameEncoding argument. This encoding was then passed to the ZipArchiveEntry constructors that take an encoding argument as well. This is important because we need to respect that encoding when the uses passes it as an argument.
We would only read the encoding from each entry's general purpose bit when the user did not pass an entryNameEncoding argument or when the value was passed as null, in which case the archive entries were constructed using the internal ZipArchiveEntry constructor that take an archive and a central directory header as arguments.
What we were not doing was read this general purpose bit flag value when the public ZipArchiveEntry constructors were called, meaning that when the time came to check the encoding of the entry, we would check if the archive's EntryNameAndCommentEncoding had a value, but it would always be unset (because of the constructor used to create this entry). So the next step was to default to UTF8 and read the file comment using that encoding.
After this change:
We will always first read the entry's general purpose bit flag to see if the value was set. If not, then we will do what we were doing before: check the EntryNameAndCommentEncoding, and if that is unset, we default to UTF8 to read the file comment.
The bit flag will only be set when the internal entry constructor is called, which was the bug. This should not affect entries created with the public constructors, meaning the behavior remains unmodified for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context @carlossanlop. I think we've both read the pre-PR code for decoding existing ZipArchiveEntry
s in the same way.
I'll re-check and test for the correct behaviour when creating a new ZipArchiveEntry - I can see that L136 of ZipArchiveEntry sets FullName, and this property's setter uses the existing behaviour you've described to convert the value to a byte array. This looks correct to me, but I'll check.
Was prioritising the supplied encoding over the general-purpose bit flags deliberate? The ZIP file spec. seems to indicate that it's the other way around: if bit 11 of the flags is set, the filename must always be UTF-8. I'm looking at the latest version, sections 4.4.4 and appendix D.
Section 4.4.4:
Bit 11: Language encoding flag (EFS). If this bit is set,
the filename and comment fields for this file
MUST be encoded using UTF-8. (see APPENDIX D)
Appendix D:
D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/libraries/System.IO.Compression/tests/ZipArchive/zip_CreateTests.Comments.cs
Show resolved
Hide resolved
@ericstj @jeffhandley I think this meets the bar for merging to main so it gets included in 9.0: It's a reliability problem from a previous release, we introduced comments in zip files but we were not handling encoding correctly. Would you agree? |
Talked to @ericstj and this meets the bar for .NET 9 but we need to mark it as a breaking change, as we're going to be generating different names and comments (even though the previous behavior was wrong). |
Thanks @carlossanlop. I've raised both PRs - the documentation update is dotnet/dotnet-api-docs#10193, and the breaking change is dotnet/docs#42003. |
Much appreciated, @edwardneal! |
Resolves #92283.
This specifically only changes the decoding of
ZipArchiveEntry.Name
andComment
, not the encoding. I think this is functioning as documented.I've also updated the public-facing XML documentation for ZipFile and ZipArchive. Both parts of this documentation suggested that the bit only changed how the entry name was decoded; this was incorrect, as it also changes the decoding of the entry comment (in compliance with the ZIP specification.)