From dd757e90349cb869d60700cb54b95b1dad3a9113 Mon Sep 17 00:00:00 2001 From: Jason Yundt Date: Sun, 7 Jul 2024 08:51:54 -0400 Subject: [PATCH 1/3] Explicitly declare that source files are UTF-8 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Before this change, there was a chance that a text editor or compiler would use the wrong character encoding. For text editors, this commit adds “charset = utf-8” to .editorconfig. That will cause editors that support EditorConfig files [1] to automatically use UTF-8 when reading and writing files. For compilers, this commit adds compiler options that guarantee that compilers will decode source code files using UTF-8. The compiler options are known to work on MSVC, GCC and Clang. If we ever want to support additional compilers, then we might have to edit the if statement that this commit adds. This commit does not eliminate the chance that a wrong character encoding will be used. Someone could always use a text editor that doesn’t support EditorConfig files or a compiler that doesn’t support the compiler options that we use. This commit does, however, make an encoding mismatch much less likely. [1]: --- .editorconfig | 1 + CMakeLists.txt | 6 ++++++ 2 files changed, 7 insertions(+) diff --git a/.editorconfig b/.editorconfig index c7a5f17e9..acd10aaf8 100644 --- a/.editorconfig +++ b/.editorconfig @@ -5,6 +5,7 @@ indent_style = space indent_size = 2 tab_width = 8 end_of_line = lf +charset = utf-8 spelling_language = en-US [*.md] diff --git a/CMakeLists.txt b/CMakeLists.txt index a1ee6fd5f..bf5e8fd1a 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -28,6 +28,12 @@ set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set_property(GLOBAL PROPERTY USE_FOLDERS ON) +if(MSVC) + add_compile_options(/source-charset:UTF-8) +else() + add_compile_options(-finput-charset=UTF-8) +endif() + if(FORCE_COLORED_OUTPUT) if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.24) set(CMAKE_COLOR_DIAGNOSTICS ON) From adf58eca8104c014fac147298b3ade158add0184 Mon Sep 17 00:00:00 2001 From: Jason Yundt Date: Sun, 7 Jul 2024 12:54:52 -0400 Subject: [PATCH 2/3] Explicitly declare execution character set MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Consider this program: #include int main(void) { const char *example_string = "ディセント3"; for (size_t i = 0; example_string[i] != '\0'; ++i) { printf("%02hhx ", example_string[i]); } puts(""); return 0; } What will that program output? The answer is: it depends. If that program is compiled with a UTF-8 execution character set, then it will print this: e3 83 87 e3 82 a3 e3 82 bb e3 83 b3 e3 83 88 33 If that program is compiled with a Shift JIS execution character set, then it will print this: 83 66 83 42 83 5a 83 93 83 67 33 This is especially a problem when using MSVC. MSVC doesn’t necessarily default to using UTF-8 as a program’s execution character set [1]. --- Before this change, Descent 3 would use whatever the default execution character set was. This commit ensures that the execution character set is UTF-8 as long as Descent 3 gets compiled with MSVC, GCC or Clang. If Descent 3 is compiled with a different compiler, then a different execution character set might get used, but as far as I know, we only support MSVC, GCC and Clang. I’m not sure whether or not this change has any noticeable effects. If using different execution character sets do have noticeable effects, then this change will hopefully ensure that those effects are the same for everyone. [1]: --- CMakeLists.txt | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index bf5e8fd1a..d616b5799 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -29,9 +29,18 @@ set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set_property(GLOBAL PROPERTY USE_FOLDERS ON) if(MSVC) - add_compile_options(/source-charset:UTF-8) + add_compile_options(/source-charset:UTF-8 /execution-charset:UTF-8) else() add_compile_options(-finput-charset=UTF-8) + # Unfortunately, Clang doesn’t support -fexec-charset yet so this next part + # is GCC only. Luckily, Clang defaults to using UTF-8 for the execution + # character set [1], so we’re fine. Once Clang gets support for + # -fexec-charset, we should probably start using it. + # + # [1]: + if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU") + add_compile_options(-fexec-charset=UTF-8) + endif() endif() if(FORCE_COLORED_OUTPUT) From 5a199429b6dbd9aa6c1d8943b48a69986eb11fa9 Mon Sep 17 00:00:00 2001 From: Jason Yundt Date: Tue, 9 Jul 2024 20:19:41 -0400 Subject: [PATCH 3/3] Set activeCodePage to UTF-8 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Consider this program: #include int main(void) { const char *filename = u8"ディセント3.txt"; auto fp = std::fopen(filename, "r"); if (fp) { std::fclose(fp); return 0; } else { return 1; }; } If a file named ディセント3.txt exists, then will that program successfully open it? The answer is: it depends. filename is going to point to these bytes: Raw bytes: e3 83 87 e3 82 a3 e3 82 bb e3 83 b3 e3 83 88 33 2e 74 78 74 00 Characters: ディセント3.txt␀ Internally, Windows uses UTF-16. When you call fopen(), Windows will convert the filename parameter into UTF-16 [1]. If the program is run with a UTF-8 Windows code page, then the above bytes will be correctly interpreted as UTF-8 when being converted into UTF-16 [2]. The final UTF-16 string will be this*: Raw bytes: ff fe c7 30 a3 30 bb 30 f3 30 c8 30 33 00 2e 00 74 00 78 00 74 00 Characters: ディセント3.txt On the other hand, if the program is run with code page 932, then the original bytes will be incorrectly interpreted as code page 932 when being converted into UTF-16. The final UTF-16 string will be this*: Raw bytes: ff fe 5d 7e fd ff 67 7e 63 ff 67 7e 7b ff 5d 7e 73 ff 5d 7e fd ff 33 00 2e 00 74 00 78 00 74 00 Characters: 繝�繧」繧サ繝ウ繝�3.txt In other words, if that program gets compiled on Windows with a UTF-8 execution character set, then it needs to be run with a UTF-8 Windows code page. Otherwise, mojibake might happen. *Unlike the first string, this one does not have a null terminator. This is because the Windows kernel doesn’t use null terminated strings for paths [3][4]. --- Before this commit, Descent 3 would pass UTF-8 to fopen(), even if Descent 3 is run with a non-UTF-8 Windows code page [5]. This commit makes sure that Descent 3 gets run with a UTF-8 Windows code page. The Windows code page isn’t just used by fopen(). It also gets used by many other functions in the Windows API [6]. I don’t know if Descent 3 uses any of those other functions, but if it does, then this commit will also help make sure that those functions receive strings with the correct character encoding. Descent 3 uses UTF-8 for strings by default [7]. Making sure that Descent 3 uses UTF-8 everywhere will make encoding-related mistakes less likely in the future. Fixes #483. [1]: [2]: [3]: [4]: [5]: [6]: [7]: adf58eca (Explicitly declare execution character set, 2024-07-07) --- Descent3/CMakeLists.txt | 9 ++++++++- Descent3/Descent3.exe.manifest.in | 9 +++++++++ 2 files changed, 17 insertions(+), 1 deletion(-) create mode 100644 Descent3/Descent3.exe.manifest.in diff --git a/Descent3/CMakeLists.txt b/Descent3/CMakeLists.txt index 8bbc15e9f..e6f4e86d1 100644 --- a/Descent3/CMakeLists.txt +++ b/Descent3/CMakeLists.txt @@ -274,6 +274,13 @@ set(CPPS if(WIN32) set(PLATFORM_LIBS wsock32.lib winmm.lib) set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /SAFESEH:NO /NODEFAULTLIB:LIBC") + set(MANIFEST ${CMAKE_CURRENT_BINARY_DIR}/Descent3.exe.manifest) + configure_file( + ${CMAKE_CURRENT_SOURCE_DIR}/Descent3.exe.manifest.in + ${MANIFEST} + @ONLY + NEWLINE_STYLE WIN32 + ) endif() if(UNIX AND NOT APPLE) @@ -287,7 +294,7 @@ endif() file(GLOB_RECURSE INCS "../lib/*.h") -add_executable(Descent3 WIN32 ${HEADERS} ${CPPS} ${INCS}) +add_executable(Descent3 WIN32 ${HEADERS} ${CPPS} ${INCS} ${MANIFEST}) target_link_libraries(Descent3 PRIVATE 2dlib AudioEncode bitmap cfile czip d3music dd_video ddebug ddio libmve libacm fix grtext manage mem misc model module movie stream_audio linux SDL2::SDL2 diff --git a/Descent3/Descent3.exe.manifest.in b/Descent3/Descent3.exe.manifest.in new file mode 100644 index 000000000..4deb933d5 --- /dev/null +++ b/Descent3/Descent3.exe.manifest.in @@ -0,0 +1,9 @@ + + + + + + UTF-8 + + +