Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Crash When Archiving Files with Special Characters in Names #241

Open
Zpovednice-adm opened this issue Oct 5, 2024 · 26 comments
Assignees
Labels
bug Something isn't working fixready
Milestone

Comments

@Zpovednice-adm
Copy link

Zpovednice-adm commented Oct 5, 2024

Description: When attempting to archive a complex directory structure with a large number of files using Dwarfs, the program crashes unexpectedly without any error message. After some investigation and testing, it appears that the crash is caused by files with special characters in their names.

An example of a problematic file name is:

2022-06-07 16-55-41 URGENT_ 🚨Someone have Run a Background-check on Your (Public Records)🔎 Tape Your Name & See Results Yourself!⭐ N°7267.eml

It seems that characters such as emojis (e.g., 🚨, 🔎, ⭐), special symbols, or other non-standard characters in file names are not handled properly by Dwarfs, resulting in the crash.

Steps to Reproduce:

Create a directory structure with files containing special characters and emojis in their names.
Attempt to archive the directory using Dwarfs.
The program crashes without an error message.
Expected Behavior: Dwarfs should either handle the special characters gracefully or provide a clear error message indicating which file name is causing the issue.

Environment:

Dwarfs version: 0.10.1
OS: Windows 10

Please let me know if more information is needed to address this issue. Thank you!

EDIT:
After further investigation, it seems that the issue might not be solely caused by emoji characters. There are indications that the Mathematical Alphanumeric Symbols could be contributing to the problem as well. These symbols are part of the Unicode block ranging from U+1D400 to U+1D7FF and include various styled versions of Latin letters and digits (e.g., bold, italic, double-struck). They are typically used in mathematical notations but can appear in file names and text. Since they have different Unicode representations, they may not be handled properly by dwarfs.

Additionally, there is a confirmed issue with the replacement character (�), which is represented by U+FFFD in Unicode. This character often appears when there is an invalid or undecodable sequence in the text. It seems that dwarfs is having difficulty processing file names containing this character, which is likely causing it to crash.

Apologies for any confusion caused, as this issue has proven to be quite difficult to diagnose. One major challenge is that dwarfs doesn't provide a clear error message indicating which specific file or character caused the failure, making it tough to isolate the root cause.

@mhx
Copy link
Owner

mhx commented Oct 5, 2024

Can you attach a zip archive with a set of files that will reproduce the issue? I'll take a look when I'm back from vacation. The test suite explicitly covers files with Unicode characters (including emojis) and runs fine on Windows.

@Zpovednice-adm
Copy link
Author

Okay, here's something. At least one of those files is causing the crash.
dwarfs issue files.zip

@mhx
Copy link
Owner

mhx commented Oct 8, 2024

Thanks, that helped a lot!

The root cause is that the non-empty file in the archive does indeed contain a character in the file name that cannot be represented in Unicode (this is what you see as �). This triggers an error when converting the file name to UTF-8 (which DwarFS uses internally).

This error causes mkdwarfs to try to log a message about the error that just happened. However, the log statement assumes that it's possible to output the file name, which triggers the same error again. So it fails to produce a log message and that empty log message makes the logging code fall over, triggering an assertion (the logging code didn't expect empty log messages). In the release build, this assertion is compiled out and the code performs some invalid memory access, resulting in the crash you're seeing.

This is certainly fixable, but I'm not entirely sure yet exactly how to fix it.

The logging code crashing is obviously a bug, and for that bit I already have a fix.

Producing a useful log message is going to be more challenging, but is likely doable with some Windows-specific code to convert the "raw" file name to Unicode without throwing an exception.

But given we can replace invalid file name characters by �, what should the overall behavior be? I'm leaning towards adding the file with the modified file name, although this would mean the stored file name is different from the original file name. Alternatively, files with invalid file names could be skipped.

@Zpovednice-adm
Copy link
Author

Thanks for the detailed explanation!

I agree with your suggestion to add the file with the modified file name. While it means the stored file name would be different from the original, it seems like the more practical solution compared to skipping the file entirely. This way, at least the file will still be included in the archive, even if the name has been altered slightly.

I'm also encountering another issue. The process seems to work by first searching for all files, then scanning and hashing them. In my case, after completing this phase, the process stopped accessing the disk and started using around 20% of the CPU. It stayed in this state for about 3 minutes without showing any progress, and then it crashed again without any error message.

Looking forward to hearing your thoughts.

@mhx
Copy link
Owner

mhx commented Oct 10, 2024

The CI workflow produces debug build artifacts for Windows, e.g. here, look for dwarfs-0.10.1-Windows-AMD64-debug.7z.

Instead of a silent crash, these will (hopefully) show an assertion / more details on why the program crashed. Maybe you can give this a try?

mhx added a commit that referenced this issue Oct 10, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
@mhx
Copy link
Owner

mhx commented Oct 10, 2024

@Zpovednice-adm
Copy link
Author

Thanks a lot.

I’ll definitely try it, but probably sometime next week. My use case is quite time-consuming.

This got me thinking that I’d like to ask for some advice, if you don’t mind it being a bit off-topic, on how to set the command line parameters for the best possible result. To give you an idea of what I’m trying to do – I have a directory that I want to archive. These are various backups of different things over several years. But mostly, they are full copies of system disks of several (Windows) PCs. Therefore, I’m counting a lot on deduplication because these disks are surely full of duplicate files (or parts of files). The whole thing is 7.5TB and contains 17 million files. The goal is to reduce the disk space it occupies while still being able to access the files relatively quickly and conveniently. I’ve already applied KopiaUI to it, which reduced it to 3.7TB, and it took about 26 hours. I was curious how DwarFS would handle it. (Of course, I’d be happy to share the results if you’re interested.) So, could you advise me on how to set the parameters in the best way to achieve the best result, while also making sure it doesn’t take forever? (It might be important to note that I have about 16GB of free RAM available.)

Thanks in advance!

@mhx
Copy link
Owner

mhx commented Oct 12, 2024

The holy grail of archiving: producing the best possible result regardless of input. :)

In all seriousness, if there was a set of "best possible parameters", I'd make it the default.

Most options involve trade-offs. If you want to maximize compression, it's likely going to be (much) slower and use (much) more memory. If you want to maximize access speed, you won't be able to maximize compression. If you have a good understanding of the data you're going to compress, this can guide you in tweaking different options. But that's going to be hard with a very heterogeneous set of data.

Here's what I'd do:

  • First, think about how you want to access the data later. This should affect the block size of the DwarFS image (-S). A large block size means that if you want to access a small file (say, 10KB), one or more file system blocks have to be decompressed. These blocks are typically several MB in size. If the chosen compression algorithm is slow to decompress (e.g. lzma), or if you want to access lots of small files, chances are a large block size will slow down access significantly (whether or not this is noticeable in your scenario is a different question, but it will be measurable). Also, fewer large blocks will fit in the block cache, which again can be detrimental for access performance. However, larger block sizes also mean (much) better compression. So if you want maximum compression, go for a larger block size (64 MB, or -S 26). If you want fast access, pick a smaller one (8/16 MB, or -S 23/-S 24).
  • If you know that there's a significant amount of already-compressed files (MP3, JPG, ZIP, all sorts of video) in the input, you can turn on the incompressible categorizer. This will significantly speed up the compression stage by not trying to compress the already compressed data.
  • If you know that there are uncompressed audio files (WAV, ...) in the input, turn on the pcmaudio categorizer.
  • While you can play with the -W / -w options to tweak the segmenter, this can become quite tedious and I don't recommend it unless you have massive amounts of time to spare or know that this will help compression based on knowing your data.
  • If you have enough temporary disk space, don't worry about the compression algorithm and its configuration yet. You can always change the algorithm later by re-writing the DwarFS image (see --recompress). I'd probably start with something like -C zstd:level=9 and once I have a working image, recompress it using different algorithms to see which one performs best. That way, you can skip the whole scanning / de-duplication stage and only focus on selecting the compression algorithm.
  • If you don't have the temporary disk space, choosing the compression algorithm, again, is a trade-off. lzma will, most of the time, give you the best compression, but at the expense of being about 10 times slower to decompress compared to zstd. Whether that slower decompression speed is something you'd notice is a different story and again, it's a question of how you're planning to access the data. If you're occasionally opening a couple of files, this choice won't matter. If you regularly want to search through the contents of thousands of files, it will.
  • There isn't much point in tweaking the "memory limit" (-L) unless you have very large blocks or many CPU cores. As a rule of thumb, if the memory limit is twice as large as the block size multiplied by the number of CPU cores, you're most likely fine.
  • Last but not least, once you have a DwarFS image, make sure you protect it from bit rot. A single bit flip in the metadata block of the DwarFS image can render your whole image useless.

Definitely let me know your results!

mhx added a commit that referenced this issue Oct 12, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
mhx added a commit that referenced this issue Oct 13, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
@Zpovednice-adm
Copy link
Author

Thanks a lot for the valuable advice.
I'll try ZSTD compression to start with – it was also used in the case of KopiaUI. I’ll see what comes out of it, and if necessary, I’ll try --recompress afterwards. That's very good idea, disk space is not problem for me. I just have one question about the --max-lookback-blocks option. I don’t think I fully understand how it works exactly, and mainly which phase it affects – whether it can also be changed during recompression, or if it already impacts the scanning phase.

Thanks again for your help, and I'll definitely let you know how it turns out.

@mhx
Copy link
Owner

mhx commented Oct 15, 2024

It's only the block compression that can be changed after the fact.

I'm thinking about the possibility of metadata manipulation, but there's currently no clear plan. It should be possible to change the block size, too, once the metadata can be manipulated.

Everything else (window size/shift, lookback, categorizers, ...) cannot be changed later.

--max-lookback-blocks sets the number of blocks the segmenter can use to find matches. The segmenter receives a stream of data and checks if it has seen some of that data before. If a match is found by the segmenter, it will insert a reference to the earlier data instead of the data itself. The -W option determines the minimum size of a match, and --max-lookback-blocks determines how far back the segmenter will look. The output of the segmenter are the final file system blocks and a list of "chunks" for each file, referencing the blocks. The blocks produced by the segmenter are passed on to the compression stage, but will be kept in memory for the segmenter for matching. Up to the last --max-lookback-blocks blocks are kept in memory. So that means the total lookback in bytes is the product of the block size and the number of lookback blocks. Hope that clears things up a bit.

@Zpovednice-adm
Copy link
Author

Hello,
I apologize for the bad news.
I tried dwarfs-0.10.1-71-ga5b71e2cb3-Windows-AMD64-debug.7z on my original data, and once again, it crashed on the file "2019-12-23 13-50-24 𝗘𝗹𝗶𝗺𝗶𝗻𝗮𝘁𝗲 𝗕𝗮𝘁𝗵𝗿𝗼𝗼𝗺 𝗕𝗮𝘁𝗵𝗶𝗻𝗴 𝗛𝗮𝘇𝗮𝗿𝗱𝘀—𝗚𝗲𝘁 𝗮 𝗪𝗮𝗹𝗸 𝗜𝗻 𝗕𝗮𝘁𝗵𝘁𝘂𝗯 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶.eml". I got a Microsoft Visual C++ Runtime Library error message, as shown in the screenshot. Nothing was printed to the console. I had the log set to trace, which allowed me to see in the log the last successfully processed file "... [file_scanner.cpp:273] scanning file ...".
But it neither processed nor logged the problematic file.

image

@mhx
Copy link
Owner

mhx commented Oct 17, 2024

Okay, time to get out some slightly bigger guns...

That "Debug Error!" dialog isn't very useful and it's been annoying me ever since I started porting DwarFS to Windows.

I've made a change to:

  1. disable that useless dialog and
  2. print a stack trace when abort() is called. This will also cover assertions, as they ultimately call abort() as well.

Please try again using dwarfs-0.10.1-76-g9e6ed1fec6-Windows-AMD64-debug.7z.

@Zpovednice-adm
Copy link
Author

Zpovednice-adm commented Oct 17, 2024

So here's the stacktrace. But I don't know if it will help anything.

Caught signal 22
Stack trace (most recent call first):
#0  0x00007ff70dff091a in  ??
#1  0x00007ff70e9e5174 in  ??
#2  0x00007ff70e9638a8 in  ??
#3  0x00007ff70e9c7a4f in  ??
#4  0x00007ff70e95c1c7 in  ??
#5  0x00007ff70e95dbc7 in  ??
#6  0x00007ff70e95ddf9 in  ??
#7  0x00007ff70e941e5a in  ??
#8  0x00007ff70e93af2f in  ??
#9  0x00007ff8a2d528be in _chkstk
#10 0x00007ff8a2d02553 in RtlRaiseException
#11 0x00007ff8a2d022a6 in RtlRaiseException
#12 0x00007ff8a072b698 in RaiseException
#13 0x00007ff70e93ffe1 in  ??
#14 0x00007ff70da5131a in  ??
#15 0x00007ff70da4bd53 in  ??
#16 0x00007ff70da4bf31 in  ??
#17 0x00007ff70da55dc8 in  ??
#18 0x00007ff70da4b137 in  ??
#19 0x00007ff70da4af17 in  ??
#20 0x00007ff70da4b097 in  ??
#21 0x00007ff70da4ae67 in  ??
#22 0x00007ff70e016465 in  ??
#23 0x00007ff70e00e5f7 in  ??
#24 0x00007ff70e00e3d4 in  ??
#25 0x00007ff70d9b8f50 in  ??
#26 0x00007ff70eb1257b in  ??
#27 0x00007ff70e96183f in  ??
#28 0x00007ff70e95ee2d in  ??
#29 0x00007ff8a2d51c25 in RtlCaptureContext2
#30 0x00007ff70dcda892 in  ??
#31 0x00007ff70dcf097a in  ??
#32 0x00007ff70dce4ee5 in  ??
#33 0x00007ff70d9f91a0 in  ??
#34 0x00007ff70d949a4c in  ??
#35 0x00007ff70d9fee43 in  ??
#36 0x00007ff70d9ff076 in  ??
#37 0x00007ff70da00cd2 in  ??
#38 0x00007ff70d9ff2f9 in  ??
#39 0x00007ff70da01c5e in  ??
#40 0x00007ff70da01b13 in  ??
#41 0x00007ff70d9fef4c in  ??
#42 0x00007ff70da4a925 in  ??
#43 0x00007ff70e93b5b8 in  ??
#44 0x00007ff70e93b461 in  ??
#45 0x00007ff70e93b31d in  ??
#46 0x00007ff70e93b64d in  ??
#47 0x00007ff8a1547373 in BaseThreadInitThunk
#48 0x00007ff8a2cfcc90 in RtlUserThreadStart

@mhx
Copy link
Owner

mhx commented Oct 18, 2024

Sorry, that's not how this is supposed to look like. :/

I have to admit that I haven't used Windows seriously for more than 20 years, so obviously my experience here is limited.

As it turns out, for the stack traces to work, you need a .pdb file alongside the .exe file and you need to run the .exe from within its directory, as otherwise it doesn't seem to find the .pdb.

I've made a new build and tried really hard to make sure it actually works this time. Please get dwarfs-0.10.1-78-g4a0dba5aec-Windows-AMD64-debug-stacktrace.7z and make sure you're running mkdwarfs from within its directory.

@mhx mhx self-assigned this Oct 18, 2024
@mhx mhx added the bug Something isn't working label Oct 18, 2024
@Zpovednice-adm
Copy link
Author

Zpovednice-adm commented Oct 20, 2024

Ok, here You are a little more informative stacktrace. I hope it helps.

Caught signal SIGABRT
Stack trace (most recent call first):
#0  0x00007ff6bdc00adf in dwarfs::`anonymous namespace'::fatal_signal_handler_win(int) at C:\actions-runner\_work\dwarfs\dwarfs\src\util.cpp:450
#1  0x00007ff6be62dfb4 in raise(int) at minkernel\crts\ucrt\src\appcrt\misc\signal.cpp:541
#2  0x00007ff6be5ac6e8 in abort() at minkernel\crts\ucrt\src\appcrt\startup\abort.cpp:64
#3  0x00007ff6be61088f in terminate() at minkernel\crts\ucrt\src\appcrt\misc\terminate.cpp:58
#4  0x00007ff6be5a5007 in FindHandler<__FrameHandler4>(EHExceptionRecord*, unsigned int*, _CONTEXT*, _xDISPATCHER_CONTEXT*, FH4::FuncInfo4*, unsigned int, int, unsigned int*) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:739
#5  0x00007ff6be5a6a07 in __InternalCxxFrameHandler<__FrameHandler4>(EHExceptionRecord*, unsigned int*, _CONTEXT*, _xDISPATCHER_CONTEXT*, FH4::FuncInfo4*, int, unsigned int*, unsigned int) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:396
#6  0x00007ff6be5a6c39 in __InternalCxxFrameHandlerWrapper<__FrameHandler4>(EHExceptionRecord*, unsigned int*, _CONTEXT*, _xDISPATCHER_CONTEXT*, FH4::FuncInfo4*, int, unsigned int*, unsigned int) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:236
#7  0x00007ff6be58ac9a in __CxxFrameHandler4(EHExceptionRecord*, unsigned int, _CONTEXT*, _xDISPATCHER_CONTEXT*) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\risctrnsctrl.cpp:304
#8  0x00007ff6be583d6f in __GSHandlerCheck_EH4(_EXCEPTION_RECORD*, void*, _CONTEXT*, _DISPATCHER_CONTEXT*) at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\gs\amd64\gshandlereh4.cpp:73
#9  0x00007ff9920d28be in _chkstk
#10 0x00007ff992082553 in RtlRaiseException
#11 0x00007ff9920822a6 in RtlRaiseException
#12 0x00007ff98f7eb698 in RaiseException
#13 0x00007ff6be588e21 in _CxxThrowException(void*, _s__ThrowInfo*) at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\throw.cpp:81
#14 0x00007ff6bd66132a in fmt::v10::detail::do_throw<std::runtime_error>(std::runtime_error&) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format.h:124
#15 0x00007ff6bd65bd63 in <lambda_2476853628ce5e4d38026ca625155c1d>::operator()(unsigned int, fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1397
#16 0x00007ff6bd65bf41 in <lambda_22829297f0ddcbf21f33e8075ba356b8>::operator()(char*, char*) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format.h:672
#17 0x00007ff6bd665dd8 in fmt::v10::detail::for_each_codepoint<<lambda_2476853628ce5e4d38026ca625155c1d> >(fmt::v10::basic_string_view<char>, fmt::v10::detail::utf8_to_utf16::{ctor}::__l2::<lambda_2476853628ce5e4d38026ca625155c1d>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format.h:680
#18 0x00007ff6bd65b147 in fmt::v10::detail::utf8_to_utf16::utf8_to_utf16(fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1396
#19 0x00007ff6bd65af27 in fmt::v10::detail::write_console(int, fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1444
#20 0x00007ff6bd65b0a7 in fmt::v10::detail::print(_iobuf*, fmt::v10::basic_string_view<char>) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1468
#21 0x00007ff6bd65ae77 in fmt::v10::vprint(_iobuf*, fmt::v10::basic_string_view<char>, fmt::v10::basic_format_args<fmt::v10::basic_format_context<fmt::v10::appender, char> >) at C:\vcpkg\buildtrees\fmt\src\10.2.1-a991065f88.clean\include\fmt\format-inl.h:1478
#22 0x00007ff6bdc27955 in fmt::v10::print<std::basic_string_view<char, std::char_traits<char> > &>(_iobuf*, fmt::v10::basic_format_string<char, std::basic_string_view<char, std::char_traits<char> > &>, std::basic_string_view<char, std::char_traits<char> >&) at C:\actions-runner\_work\dwarfs\vcpkg-install-dwarfs\x64-windows-static\include\fmt\core.h:2941
#23 0x00007ff6bdc1ee07 in dwarfs::stream_logger::write_nolock(std::basic_string_view<char, std::char_traits<char> >) at C:\actions-runner\_work\dwarfs\dwarfs\src\logger.cpp:127
#24 0x00007ff6bdc1ebcc in dwarfs::stream_logger::write(dwarfs::logger::level_type, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char*, int) at C:\actions-runner\_work\dwarfs\dwarfs\src\logger.cpp:249
#25 0x00007ff6bd5c8f60 in dwarfs::level_logger::~level_logger() at C:\actions-runner\_work\dwarfs\dwarfs\include\dwarfs\logger.h:154
#26 0x00007ff6be72521b in `dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::add_entry'::`1'::catch$26() at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:372
#27 0x00007ff6be5aa67f in _CallSettingFrame_LookupContinuationIndex() at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\amd64\handlers.asm:97
#28 0x00007ff6be5a7c6d in __FrameHandler4::CxxCallCatchBlock() at D:\a\_work\1\s\src\vctools\crt\vcruntime\src\eh\frame.cpp:1439
#29 0x00007ff9920d1c25 in RtlCaptureContext2
#30 0x00007ff6bd8ebad2 in dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::add_entry(std::filesystem::path&, std::shared_ptr<dwarfs::writer::internal::dir>*, dwarfs::writer::internal::progress&, dwarfs::writer::internal::file_scanner&, bool) at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:370
#31 0x00007ff6bd9013aa in dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::scan_tree(std::filesystem::path&, dwarfs::writer::internal::progress&, dwarfs::writer::internal::file_scanner&) at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:545
#32 0x00007ff6bd8fa0f5 in dwarfs::writer::internal::scanner_<dwarfs::prod_logger_policy>::scan(dwarfs::writer::filesystem_writer&, std::filesystem::path&, dwarfs::writer::writer_progress&, std::optional<std::span<std::filesystem::path const , -1> >, std::shared_ptr<dwarfs::file_access const >*) at C:\actions-runner\_work\dwarfs\dwarfs\src\writer\scanner.cpp:667
#33 0x00007ff6bd6091b0 in dwarfs::writer::scanner::scan(dwarfs::writer::filesystem_writer&, std::filesystem::path&, dwarfs::writer::writer_progress&, std::optional<std::span<std::filesystem::path const , -1> >, std::shared_ptr<dwarfs::file_access const >*) at C:\actions-runner\_work\dwarfs\dwarfs\include\dwarfs\writer\scanner.h:67
#34 0x00007ff6bd559a5c in dwarfs::tool::mkdwarfs_main(int, wchar_t**, dwarfs::tool::iolayer&) at C:\actions-runner\_work\dwarfs\dwarfs\tools\src\mkdwarfs_main.cpp:1366
#35 0x00007ff6bd60ee53 in dwarfs::tool::main_adapter::operator()(int, wchar_t**) at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\main_adapter.cpp:49
#36 0x00007ff6bd60f086 in `dwarfs::tool::main_adapter::safe'::`2'::<lambda_1>::operator()() at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\main_adapter.cpp:63
#37 0x00007ff6bd610ce2 in std::invoke<`dwarfs::tool::main_adapter::safe'::`2'::<lambda_1> &>(dwarfs::tool::main_adapter::safe::__l2::<lambda_1>&) at C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\type_traits:1705
#38 0x00007ff6bd60f309 in std::_Func_impl_no_alloc<`dwarfs::tool::main_adapter::safe'::`2'::<lambda_1>, int>::_Do_call() at C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\functional:876
#39 0x00007ff6bd611c6e in std::_Func_class<int>::operator()() at C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\functional:920
#40 0x00007ff6bd611b23 in dwarfs::tool::safe_main(std::function<int __cdecl(void)>*) at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\safe_main.cpp:40
#41 0x00007ff6bd60ef5c in dwarfs::tool::main_adapter::safe(int, wchar_t**) at C:\actions-runner\_work\dwarfs\dwarfs\src\tool\main_adapter.cpp:63
#42 0x00007ff6bd65a935 in wmain(int, wchar_t**) at C:\actions-runner\_work\dwarfs\dwarfs\tools\src\mkdwarfs.cpp:28
#43 0x00007ff6be5843f8 in invoke_main() at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:90
#44 0x00007ff6be5842a1 in __scrt_common_main_seh() at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
#45 0x00007ff6be58415d in __scrt_common_main() at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:330
#46 0x00007ff6be58448d in wmainCRTStartup(void*) at D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_wmain.cpp:16
#47 0x00007ff9900e7373 in BaseThreadInitThunk
#48 0x00007ff99207cc90 in RtlUserThreadStart

@mhx
Copy link
Owner

mhx commented Oct 20, 2024

Thanks so much, that was helpful. Not quite to the extent that I had hoped, but definitely better than nothing. :)

I've actually managed the code to crash on my machine under some circumstances, but never with the same stack trace as on your machine.

I've made two changes:

  • Replaced (hopefully) all file path conversions with the "safe" conversions I introduced with the previous change. This got rid of the crashes on my machine.
  • Added code to (hopefully) catch the error I see happening on your machine and print a hexdump of the string that it is unable to write to the console.

Here's a new version including these changes: dwarfs-0.10.1-115-g53ac77f237-Windows-AMD64-debug-stacktrace.7z

It should (hopefully) run without crashing, but in either case I'd be very interested in the full output.

@Zpovednice-adm
Copy link
Author

Hi,
I tested the latest version on a sample dataset where that problematic file name was.
From my point of view, everything went more or less exactly as it should. No error occurred. In the resulting image file, the file name was corrected to "2019-12-23 13-50-24 𝗘𝗹𝗶𝗺𝗶𝗻𝗮𝘁𝗲 𝗕𝗮𝘁𝗵𝗿𝗼𝗼𝗺 𝗕𝗮𝘁𝗵𝗶𝗻𝗴 𝗛𝗮𝘇𝗮𝗿𝗱𝘀—𝗚𝗲𝘁 𝗮 𝗪𝗮𝗹𝗸 𝗜𝗻 𝗕𝗮𝘁𝗵𝘁𝘂𝗯 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶�.eml"
However, I had logging set to trace, so I’m not sure if any potential warning might have been lost within the tracing, but I didn’t get any visible alert about an error or filename change (at least, I didn’t see anything like that).
But the entire process completed successfully, and the file was found, with its contents matching. So, from my perspective, it looks good.

As a final note for this issue, I’d like to ask something somewhat unrelated. – I admit I didn’t read through all the discussions, so I might be touching on a topic that has already been mentioned, but still – would it really be so technically challenging to implement some option to save an intermediate state of the application, either periodically or phase by phase, or perhaps a “pre-crash” state of the app or an "on-demand" state triggered by a key? This way, it would be possible to resume from that point rather than starting completely from the beginning.
Sorry if I’m already annoying you with my requests ;-) – I really appreciate all the effort you’ve put into actively trying to help me. On the other hand, it’s true that I’ve gotten quite attached to your project, and my computer has spent hundreds of hours running your code... (For example, after 20 hours of running your script, my (fucking) MS Windows decided to restart due to a required update... Which, of course, I understand is not your problem. But still, it would be nice if I didn’t have to start the process all over again. Please don’t take this as an unreasonable request.)

Thank you very much for your interest and help, regardless.

Please also keep in mind that English is not my native language, so everything you're reading from me has gone through a translator, and it might not come across exactly as I intended. Especially if something sounds offensive in any way, it's definitely a translation mistake, because my main intention is, in fact, gratitude.

@Zpovednice-adm
Copy link
Author

And a small note regarding logging?
scanning: ?\S:\b\ZALOHA FOTEK\fotky_2016\novy fotak\6\jpg\DSC_2008.JPG
2,100,023 dirs, 0/0 soft/hard links, 12,142,778/17,931,612 files, 0 other
original size: 6.84 TiB, hashed: 2.266 TiB (14,134,318 files, 50.47 MiB/s)
scanned: 2.793 TiB (2,918,309 files, 42.42 MiB/s), categorizing: 0 B/s
saved by deduplication: 1.424 TiB (9,224,469 files), saved by segmenting: 0 B
filesystem: 0 B in 0 blocks (0 chunks, 2,918,308 fragments, 2,918,309/8,707,143 inodes)
compressed filesystem: 0 blocks/0 B written

shouldn’t that 50.47 MiB/s be multiplied by the number of CPU cores? Because then it would roughly correspond to what I see in the Task Manager in Windows, where I can see how many MiB/s were read from the HDD...

@mhx
Copy link
Owner

mhx commented Oct 21, 2024

I wouldn't have been able to tell that you were using a translator. I really appreciate the time and effort you've put not only into reproducing the errors, but also into writing this up properly.

So, from my perspective, it looks good.

That's great! :)

However, I had logging set to trace, so I’m not sure if any potential warning might have been lost within the tracing,

Probably not lost, but more likely very hard to spot.

but I didn’t get any visible alert about an error or filename change (at least, I didn’t see anything like that).

There should have been at least errors for the file(s) with changes to their file name. If you could at some point run this again with the default log level and check if there's anything suspicious; in particular, I'm looking for anything that would look like this:

Unexpected error writing string:
00000000  0d 1b 5b 41 1b 5b 41 1b  5b 41 1b 5b 41 1b 5b 41  |..[A.[A.[A.[A.[A|
00000010  1b 5b 41 1b 5b 41 1b 5b  41 1b 5b 41 49 20 31 38  |.[A.[A.[A.[AI 18|
00000020  3a 35 35 3a 35 33 2e 34  31 35 31 30 38 20 73 61  |:55:53.415108 sa|
00000030  76 69 6e 67 20 73 79 6d  6c 69 6e 6b 73 20 74 61  |ving symlinks ta|
00000040  62 6c 65 2e 2e 2e 20 5b  32 33 2e 32 75 73 5d 1b  |ble... [23.2us].|
00000050  5b 4b 0a e2 8e af e2 8e  af e2 8e af e2 8e af e2  |[K..............|
00000060  8e af e2 8e af e2 8e af  e2 8e af e2 8e af e2 8e  |................|
[...]

If you don't see anything like this, that means I've somehow managed to fix the issue you ran into. :)

would it really be so technically challenging to implement some option to save an intermediate state of the application, either periodically or phase by phase, or perhaps a “pre-crash” state of the app or an "on-demand" state triggered by a key? This way, it would be possible to resume from that point rather than starting completely from the beginning.

It would be technically challenging. A "pre-crash" version would be even more challenging.

Here's the problem: There aren't that many "phases" with a singular transition state.

What do I mean by that? mkdwarfs is highly asynchronous. Many tasks are performed in parallel. Processing isn't strictly linear, meaning that you cannot stop at an arbitrary point and know exactly what the state of the program is.

To remind myself of what's going on, I've made this sequence diagram. There are 3 "phases":

  1. The "startup" phase is just setup, no input has been processed yet when this finishes. Having a resumable "checkpoint" after this phase is useless.
  2. The "scanning" phase: this is where mkdwarfs builds its internal view of the input data, and performs tasks like categorization, file-level deduplication, and similarity hashing.
  3. The "segmentation" phase: this includes similarity ordering, block-level deduplication (segmentation), metadata packing, and block compression.

So the only somewhat feasible "checkpoint" would be between phases 2 and 3, this is the only time when mkdwarfs is running synchronously. Saving/resuming the state at this point would still be non-trivial, but at least feasible. Trying to save/resume in the middle of phases 2/3 is not impossible, but it would make the code much more complex and nearly impossible to test.

I think having the possibility to "checkpoint" after the scanning phase would be nice, although you'd have to be aware that when you resume (which could be two years later in theory), the input data may have significantly changed. I'll add it to my huge bucket of ideas.

Any more granular "checkpointing" is completely off the table. It would make the code much more complex, likely slower, and almost impossible to test.

Please don’t take this as an unreasonable request.

The request is definitely reasonable! But the options are really limited.

In the meantime, the best thing you can do to limit the amount of time it takes to create an image, as mentioned earlier, is to use a fast compression algorithm and recompress the image later.

shouldn’t that 50.47 MiB/s be multiplied by the number of CPU cores? Because then it would roughly correspond to what I see in the Task Manager in Windows, where I can see how many MiB/s were read from the HDD...

Yes, the speed is per-core. Hashing is the first thing that happens, so this one is typically limited by I/O speed. Both categorization and similarity hashing can show much higher speeds because they happen after hashing and benefit from the fact that the files are already cached. Take these numbers with a grain of salt, the way they're currently defined (per-core wallclock speed) is probably not what you'd intuitively expect.

mhx added a commit that referenced this issue Nov 18, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
mhx added a commit that referenced this issue Nov 18, 2024
mhx added a commit that referenced this issue Nov 19, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
mhx added a commit that referenced this issue Nov 19, 2024
mhx added a commit that referenced this issue Nov 20, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
mhx added a commit that referenced this issue Nov 20, 2024
mhx added a commit that referenced this issue Nov 20, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
mhx added a commit that referenced this issue Nov 20, 2024
mhx added a commit that referenced this issue Nov 20, 2024
For some reason, Windows allows invalid UTF-16 characters in file names.
Try to handle these gracefully when converting to UTF-8.
mhx added a commit that referenced this issue Nov 20, 2024
@mhx
Copy link
Owner

mhx commented Nov 21, 2024

It's been a while (I've had an unscheduled hospital visit), but I was finally able to reproduce this and come up with a (hopefully) proper fix (converting the localized error message from the code page character set to utf8). I'm planning to push a bugfix release in the next couple of days, if you could confirm that the current binaries from the main branch fix the issue for you, that would be highly appreciated.

@mhx mhx added the fixready label Nov 21, 2024
@mhx mhx added this to the v0.10.2 milestone Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixready
Projects
None yet
Development

No branches or pull requests

2 participants