Skip to content

Commit

Permalink
Remove incorrect UTF decode assert (#5028)
Browse files Browse the repository at this point in the history
The assert assumed that after removing a BOM and "deflating"
UTF* to UTF8, the decoded (UTF8) size should be less than
the raw size (UTF8 or UTF16). However, UTF8 is not actually
smaller than UTF16 for some UTF16 codepoints. Specifically,
UTF16 code points (2 bytes) 0x800+ are 3 to 4 bytes large.

The assert is mostly obeyed for source code files, but is
easily violated for binary files with more random values.

Wikipedia clarifies why:

https://en.wikipedia.org/wiki/UTF-8#UTF-16

"Text encoded in UTF-8 will be smaller than the same text encoded
in UTF-16 if there are more code points below U+0080 than in the range
U+0800..U+FFFF. This is true for all modern European languages. It is
often true even for languages like Chinese, due to the large number of
spaces, newlines, digits, and HTML markup in typical files."
  • Loading branch information
cheneym2 authored Sep 6, 2024
1 parent 8662375 commit dcd6c24
Showing 1 changed file with 0 additions and 1 deletion.
1 change: 0 additions & 1 deletion source/compiler-core/slang-source-loc.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -597,7 +597,6 @@ void SourceFile::setContents(ISlangBlob* blob)

char const* decodedContentBegin = (char const*)m_contentBlob->getBufferPointer();
const UInt decodedContentSize = m_contentBlob->getBufferSize();
assert(decodedContentSize <= rawContentSize);
char const* decodedContentEnd = decodedContentBegin + decodedContentSize;

m_content = UnownedStringSlice(decodedContentBegin, decodedContentEnd);
Expand Down

0 comments on commit dcd6c24

Please sign in to comment.