Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSAN theLangBindHelper_ImplicitTransactions_InterProcess failure #6739

Closed
nicola-cab opened this issue Jun 22, 2023 · 6 comments · Fixed by #6959
Closed

TSAN theLangBindHelper_ImplicitTransactions_InterProcess failure #6739

nicola-cab opened this issue Jun 22, 2023 · 6 comments · Fixed by #6959
Assignees

Comments

@nicola-cab
Copy link
Member

Some data race has been detected while running the test.

thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[2023/04/06 15:08:43.482]   * frame #0: 0x00007fff205e1342 libsystem_kernel.dylib`flock + 10
[2023/04/06 15:08:43.482]     frame #1: 0x00000001041d6de8 realm-tests`realm::util::File::lock(bool, bool) + 56
[2023/04/06 15:08:43.482]     frame #2: 0x00000001040874de realm-tests`realm::DB::open(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, realm::DBOptions const&) + 2046
[2023/04/06 15:08:43.483]     frame #3: 0x000000010408f72b realm-tests`realm::DB::create(realm::Replication&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, realm::DBOptions const&) + 123
[2023/04/06 15:08:43.483]     frame #4: 0x0000000103886c15 realm-tests`Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess::test_run() + 4149
[2023/04/06 15:08:43.483]     frame #5: 0x00000001038d0f15 realm-tests`realm::test_util::unit_test::RegisterTest<Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess>::run_test(realm::test_util::unit_test::TestContext&) + 21
[2023/04/06 15:08:43.483]     frame #6: 0x0000000103ffbca8 realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::run(realm::test_util::unit_test::TestList::SharedContextImpl::Entry, realm::util::UniqueLock&) + 232
[2023/04/06 15:08:43.483]     frame #7: 0x0000000103ffba3b realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::nonconcur_run() + 139
[2023/04/06 15:08:43.483]     frame #8: 0x0000000103ffaaad realm-tests`realm::test_util::unit_test::TestList::run(realm::test_util::unit_test::TestList::Config) + 5149
[2023/04/06 15:08:43.483]     frame #9: 0x000000010370507a realm-tests`test_all(std::__1::shared_ptr<realm::util::Logger> const&) + 8026

https://spruce.mongodb.com/task/realm_core_stable_macos_release_test_on_exfat_b2ed3201a306b1b00f5fb1cc4d64d84e7e603c3f_23_04_06_14_22_32/logs?execution=0

@ironage
Copy link
Contributor

ironage commented Jun 28, 2023

I looked into the logs and this doesn't appear to be a data race. The LangBindHelper_ImplicitTransactions_InterProcess test is timing out and triggering the hang analyzer. The main problem is an exception: [2023/04/06 15:08:43.531] /System/Volumes/Data/data/mci/0f260d4d7448c8c40d56c8486459ba64/realm-core/test/test_lang_bind_helper.cpp:3341: ERROR in LangBindHelper_ImplicitTransactions_InterProcess: Unhandled exception after line 3341 realm::SystemError: flock() failed
It is unclear why acquiring the lock fails because the actual error code is not included, but it could possibly be because the disk was out of space? The consequence of the writer failing is that the other spawned processes wait forever which looks like this:

[2023/04/06 15:08:41.718] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[2023/04/06 15:08:41.718]   * frame #0: 0x00007fff205e1342 libsystem_kernel.dylib`flock + 10
[2023/04/06 15:08:41.718]     frame #1: 0x000000010eaf2de8 realm-tests`realm::util::File::lock(bool, bool) + 56
[2023/04/06 15:08:41.718]     frame #2: 0x000000010e9a8be4 realm-tests`realm::DB::do_begin_write() + 84
[2023/04/06 15:08:41.718]     frame #3: 0x000000010e9a67f7 realm-tests`realm::DB::start_write(bool) + 119
[2023/04/06 15:08:41.718]     frame #4: 0x000000010e1a3d25 realm-tests`(anonymous namespace)::multiple_trackers_writer_thread(std::__1::shared_ptr<realm::DB>) + 213
[2023/04/06 15:08:41.718]     frame #5: 0x000000010e1a2e08 realm-tests`Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess::test_run() + 4648
[2023/04/06 15:08:41.718]     frame #6: 0x000000010e1ecf15 realm-tests`realm::test_util::unit_test::RegisterTest<Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess>::run_test(realm::test_util::unit_test::TestContext&) + 21
[2023/04/06 15:08:41.718]     frame #7: 0x000000010e917ca8 realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::run(realm::test_util::unit_test::TestList::SharedContextImpl::Entry, realm::util::UniqueLock&) + 232
[2023/04/06 15:08:41.718]     frame #8: 0x000000010e917a3b realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::nonconcur_run() + 139
[2023/04/06 15:08:41.718]     frame #9: 0x000000010e916aad realm-tests`realm::test_util::unit_test::TestList::run(realm::test_util::unit_test::TestList::Config) + 5149
[2023/04/06 15:08:41.718]     frame #10: 0x000000010e02107a realm-tests`test_all(std::__1::shared_ptr<realm::util::Logger> const&) + 8026
[2023/04/06 15:08:41.718]     frame #11: 0x000000010e01efe9 realm-tests`main + 41
[2023/04/06 15:08:41.718]     frame #12: 0x00007fff2062ef3d libdyld.dylib`start + 1

I'll try adding the error code to the exception string so that if this happens again we can get more info about the original error.

@kiburtse
Copy link
Contributor

kiburtse commented Aug 3, 2023

Still happening: ci logs

@ironage the timestamp for "flock() failed" is after the hang analyser was reported to start:

[2023/04/06 15:08:26.833] Command 'shell.exec' in function 'run hang analyzer' failed: shell script encountered problem: exit code 1.

so it's just reported error after every process was forcefully terminated.

The real problem seems to be an assertion (what linked ci run log shows), which is hanging the whole test and triggers analyser later:

[2023/04/06 14:55:36.194] /System/Volumes/Data/data/mci/0f260d4d7448c8c40d56c8486459ba64/realm-core/test/test_lang_bind_helper.cpp:3264: [realm-core-13.9.0] Assertion failed: reader_process_timed_out
[2023/04/06 14:55:36.194] 0   realm-tests                         0x0000000103643ffc _ZN5realm4utilL18terminate_internalERNSt3__118basic_stringstreamIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE + 28
[2023/04/06 14:55:36.194] 1   realm-tests                         0x0000000103643f01 _ZN5realm4util9terminateEPKcS2_lOSt16initializer_listINS0_9PrintableEE + 385
[2023/04/06 14:55:36.194] 2   realm-tests                         0x0000000102cde553 _ZN12_GLOBAL__N_131multiple_trackers_reader_threadERN5realm9test_util9unit_test11TestContextENSt3__110shared_ptrINS0_2DBEEE + 819
[2023/04/06 14:55:36.194] 3   realm-tests                         0x0000000102cdcf4c _ZN64Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess8test_runEv + 4972
[2023/04/06 14:55:36.194] 4   realm-tests                         0x0000000102d26f15 _ZN5realm9test_util9unit_test12RegisterTestI64Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcessE8run_testERNS1_11TestContextE + 21
[2023/04/06 14:55:36.194] 5   realm-tests                         0x0000000103451ca8 _ZN5realm9test_util9unit_test8TestList17ThreadContextImpl3runENS2_17SharedContextImpl5EntryERNS_4util10UniqueLockE + 232
[2023/04/06 14:55:36.194] 6   realm-tests                         0x0000000103451a3b _ZN5realm9test_util9unit_test8TestList17ThreadContextImpl13nonconcur_runEv + 139
[2023/04/06 14:55:36.194] 7   realm-tests                         0x0000000103450aad _ZN5realm9test_util9unit_test8TestList3runENS2_6ConfigE + 5149
[2023/04/06 14:55:36.194] 8   realm-tests                         0x0000000102b5b07a _Z8test_allRKNSt3__110shared_ptrIN5realm4util6LoggerEEE + 8026
[2023/04/06 14:55:36.194] 9   realm-tests                         0x0000000102b58fe9 main + 41
[2023/04/06 14:55:36.194] 10  libdyld.dylib                       0x00007fff2062ef3d start + 1
[2023/04/06 14:55:36.198] signal handler: 6

Although there is no assertion in latest failure. Weird.

@ironage
Copy link
Contributor

ironage commented Sep 1, 2023

Another failure. The reader processes time out after their allowed 17 minutes of waiting and finally the hang analyzer runs, showing that the writer process is blocked waiting for a semaphore during commit:

[2023/08/31 21:19:22.298]   * frame #0: 0x00007fff209092e6 libsystem_kernel.dylib`semaphore_wait_trap + 10
[2023/08/31 21:19:22.298]     frame #1: 0x00007fff20794c9b libdispatch.dylib`_dispatch_sema4_wait + 16
[2023/08/31 21:19:22.298]     frame #2: 0x00007fff2079516d libdispatch.dylib`_dispatch_semaphore_wait_slow + 98
[2023/08/31 21:19:22.298]     frame #3: 0x000000010f8e1778 realm-tests`realm::DB::low_level_commit(unsigned long long, realm::Transaction&, bool) + 552
[2023/08/31 21:19:22.298]     frame #4: 0x000000010f8e14c2 realm-tests`realm::DB::do_commit(realm::Transaction&, bool) + 242
[2023/08/31 21:19:22.298]     frame #5: 0x000000010fa21855 realm-tests`realm::Transaction::commit() + 101
[2023/08/31 21:19:22.298]     frame #6: 0x000000010f0d3037 realm-tests`(anonymous namespace)::multiple_trackers_writer_thread(std::__1::shared_ptr<realm::DB>) + 951
[2023/08/31 21:19:22.298]     frame #7: 0x000000010f0d1e38 realm-tests`Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess::test_run() + 4648
[2023/08/31 21:19:22.298]     frame #8: 0x000000010f11b1c5 realm-tests`realm::test_util::unit_test::RegisterTest<Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess>::run_test(realm::test_util::unit_test::TestContext&) + 21
[2023/08/31 21:19:22.298]     frame #9: 0x000000010f84b948 realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::run(realm::test_util::unit_test::TestList::SharedContextImpl::Entry, realm::util::UniqueLock&) + 232
[2023/08/31 21:19:22.298]     frame #10: 0x000000010f84b6cd realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::nonconcur_run() + 141
[2023/08/31 21:19:22.298]     frame #11: 0x000000010f84a72d realm-tests`realm::test_util::unit_test::TestList::run(realm::test_util::unit_test::TestList::Config) + 5149
[2023/08/31 21:19:22.298]     frame #12: 0x000000010ef5174a realm-tests`test_all(std::__1::shared_ptr<realm::util::Logger> const&) + 8026
[2023/08/31 21:19:22.298]     frame #13: 0x000000010ef4f6b9 realm-tests`main + 41
[2023/08/31 21:19:22.298]     frame #14: 0x00007fff20959f3d libdyld.dylib`start + 1

@kiburtse
Copy link
Contributor

kiburtse commented Sep 4, 2023

I can reproduce similar issue with hanging reader processes. Essentially on macos i plug simple sdcard formated with exFAT, and run this test on it like this:

$ >> UNITTEST_FILTER=*InterProcess ttime ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO\ NAME/_r/                                                                                                                                     
                                                                                                                          
Realm version: 13.20.0 with Debug Enabled                                                                                                                                                                                                            
Platform: macOS Darwin 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul  5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000 arm64                                                                                                            
Encryption: Enabled at compile-time (always encrypt = no)                                                                                                                                                                                            
                                                                                                                                                                                                                                                     
REALM_MAX_BPNODE_SIZE = 1000                                                                                                                                                                                                                         
REALM_MEMDEBUG = Disabled                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                     
sizeof (size_t) * 8 = 64                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                     
Compiler supported SSE (auto detect):       No                                                                                                                                                                                                       
This CPU supports SSE (auto detect):        None                                                                                                                                                                                                     
Compiler supported AVX (auto detect):       No                                                                                                                                                                                                       
This CPU supports AVX (AVX1) (auto detect): No                                                                            
                                                                                                                          
UNITTEST_RANDOM_SEED:                       1860404304                                                                                                                                                                                               
Test path prefix:                           /Volumes/NO NAME/_r/                                                                                                                                                                                     
Test resource path:                         /Users/user/r/core/.bdd/test/realm-tests.app/Contents/Resources/                                                                                                                                         
                                                                                                                          
Number of test threads: 10 (default)                                                                                      
(Use UNITTEST_THREADS=1 to serialize testing)                                                                                                                                                                                                        
                                                                                                                                                                                                                                                     
/Users/user/r/core/test/test_lang_bind_helper.cpp:3024: [realm-core-13.20.0] Assertion failed: reader_process_timed_out                                                                                                                              
0   realm-tests                         0x0000000103190e64 _ZN5realm4utilL18terminate_internalERNSt3__118basic_stringstreamIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE + 28                                                                          
1   realm-tests                         0x0000000103190e44 _ZN5realm4util19terminate_with_infoEPKcS2_lS2_OSt16initializer_listINS0_9PrintableEE + 308                                                                                                
2   realm-tests                         0x0000000103190d10 _ZN5realm4util19terminate_with_infoEPKcS2_lS2_OSt16initializer_listINS0_9PrintableEE + 0                                                                                                  
3   realm-tests                         0x000000010268e5b0 _ZN12_GLOBAL__N_131multiple_trackers_reader_threadERN5realm9test_util9unit_test11TestContextENSt3__110shared_ptrINS0_2DBEEE + 940                
4   realm-tests                         0x000000010268d870 _ZN64Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess8test_runEv + 2604                                                                                                   
5   realm-tests                         0x00000001026d5488 _ZN5realm9test_util9unit_test12RegisterTestI64Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcessE8run_testERNS1_11TestContextE + 40                                         
6   realm-tests                         0x0000000102d4739c _ZN5realm9test_util9unit_test8TestList17ThreadContextImpl3runENS2_17SharedContextImpl5EntryERNS_4util10UniqueLockE + 172                                                                  
7   realm-tests                         0x0000000102d46e40 _ZN5realm9test_util9unit_test8TestList17ThreadContextImpl13nonconcur_runEv + 172                                                                                                          
8   realm-tests                         0x0000000102d45f94 _ZN5realm9test_util9unit_test8TestList3runENS2_6ConfigE + 3240                                                                                                                            
9   realm-tests                         0x00000001028371a4 _ZN12_GLOBAL__N_19run_testsERKNSt3__110shared_ptrIN5realm4util6LoggerEEE + 4300                                                                                                           
10  realm-tests                         0x0000000102835518 _Z8test_allRKNSt3__110shared_ptrIN5realm4util6LoggerEEE + 244                                                                                                                             
11  realm-tests                         0x00000001022738fc main + 80                                                      
12  dyld                                0x00000001888e3f28 start + 2236                                                   
!!! IMPORTANT: Please report this at https://github.com/realm/realm-core/issues/new/choosesignal handler: 6                                                                                                                                          
trace: 0   realm-tests                         0x0000000102d42be8 _ZN5realm9test_utilL14signal_handlerEi + 68                                                                                                                                        
1   libsystem_platform.dylib            0x0000000188c6aa24 _sigtramp + 56                                                 
2   libsystem_pthread.dylib             0x0000000188c3bc28 pthread_kill + 288                                             
3   libsystem_c.dylib                   0x0000000188b49ae8 abort + 180                                                    
4   realm-tests                         0x0000000103190cdc _ZN5realm4util9terminateEPKcS2_lOSt16initializer_listINS0_9PrintableEE + 0                                                                                                                
5   realm-tests                         0x0000000103190f6c _ZN5realm4utilL18terminate_internalERNSt3__118basic_stringstreamIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE + 292
6   realm-tests                         0x0000000103190e44 _ZN5realm4util19terminate_with_infoEPKcS2_lS2_OSt16initializer_listINS0_9PrintableEE + 308                                                                                                
7   realm-tests                         0x0000000103190d10 _ZN5realm4util19terminate_with_infoEPKcS2_lS2_OSt16initializer_listINS0_9PrintableEE + 0                                                                                                  
8   realm-tests                         0x000000010268e5b0 _ZN12_GLOBAL__N_131multiple_trackers_reader_threadERN5realm9test_util9unit_test11TestContextENSt3__110shared_ptrINS0_2DBEEE + 940
9   realm-tests                         0x000000010268d870 _ZN64Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess8test_runEv + 2604                                                                                                   
10  realm-tests                         0x00000001026d5488 _ZN5realm9test_util9unit_test12RegisterTestI64Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcessE8run_testERNS1_11TestContextE + 40
11  realm-tests                         0x0000000102d4739c _ZN5realm9test_util9unit_test8TestList17ThreadContextImpl3runENS2_17SharedContextImpl5EntryERNS_4util10UniqueLockE + 172
12  realm-tests                         0x0000000102d46e40 _ZN5realm9test_util9unit_test8TestList17ThreadContextImpl13nonconcur_runEv + 172                                                                                                          
13  realm-tests                         0x0000000102d45f94 _ZN5realm9test_util9unit_test8TestList3runENS2_6ConfigE + 3240                                                                                                                            
14  realm-tests                         0x00000001028371a4 _ZN12_GLOBAL__N_19run_testsERKNSt3__110shared_ptrIN5realm4util6LoggerEEE + 4300                                                                                                           
15  realm-tests                         0x0000000102835518 _Z8test_allRKNSt3__110shared_ptrIN5realm4util6LoggerEEE + 244                                                                                                                             
16  realm-tests                         0x00000001022738fc main + 80                                                      
17  dyld                                0x00000001888e3f28 start + 2236                                                   
/Users/user/r/core/test/test_lang_bind_helper.cpp:3024: [realm-core-13.20.0] Assertion failed: reader_process_timed_out 

every process stays in limbo after those assertions (looks similar to how CI reports this failure):

88883 user        40   0  389G   336 ?   0.0  0.0  0:00.00 │  │  └─ /usr/bin/time ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88884 user        32   0  389G 13408 ?   0.0  0.0  0:00.00 │  │     └─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88886 user        40   0  389G 73536 ?   0.0  0.2  0:00.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88887 user        40   0  389G 74000 ?   0.0  0.2  0:00.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88888 user        40   0  389G 13712 ?   0.0  0.0  0:00.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88889 user        40   0  389G 13648 ?   0.0  0.0  0:00.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88890 user        40   0  389G 13648 ?   0.0  0.0  0:00.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88891 user        40   0  389G 15152 ?   0.0  0.0  0:00.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88892 user        40   0  389G 15072 ?   0.0  0.0  0:00.00 │  │        ├+ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88893 user        26   0  389G 18352 Z  98.6  0.1 17:24.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88894 user        26   0  389G 17712 Z 100.1  0.1 17:24.00 │  │        ├─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/
88895 user        26   0  389G 18080 Z 101.7  0.1 17:24.00 │  │        └─ ./test/realm-tests.app/Contents/MacOS/realm-tests /Volumes/NO NAME/_r/

The test itself passes from time to time in my setup, and it takes normally around 1 minute on this slow sdcard with debug build. But when it hangs, essentially only initial write succedess - the one which is done in setup code from one child process.
If i attach lldb to every write process - they all stuck at very first iteration waiting on flock to open realm, except one process, which is stuck at flock in FileVersionManager init.

The problem it seems to be that inode for m_writemutex and m_versionlist_mutex are identical, so the same process tries to lock the same file twise? How could that be i don't get it?..

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x0000000188bfff1c libsystem_kernel.dylib`flock + 8
    frame #1: 0x0000000103475320 realm-tests`realm::util::File::lock(this=0x0000600002bcc018, exclusive=true, non_blocking=false) at file.cpp:1253:13
    frame #2: 0x0000000102e7df70 realm-tests`realm::util::File::lock(this=0x0000600002bcc018) at file.hpp:1093:5
    frame #3: 0x0000000102e76c54 realm-tests`realm::util::InterprocessMutex::lock(this=0x000000015d8085a0) at interprocess_mutex.hpp:334:25
    frame #4: 0x0000000102e7dfec realm-tests`std::__1::lock_guard<realm::util::InterprocessMutex>::lock_guard[abi:v15006](this=0x000000016d874630, __m=0x000000015d8085a0) at __mutex_base:94:27
    frame #5: 0x0000000102e7591c realm-tests`std::__1::lock_guard<realm::util::InterprocessMutex>::lock_guard[abi:v15006](this=0x000000016d874630, __m=0x000000015d8085a0) at __mutex_base:94:21
    frame #6: 0x00000001032f16ec realm-tests`realm::DB::FileVersionManager::FileVersionManager(this=0x00006000037c4000, file=0x000000015d808490, mutex=0x000000015d8085a0) at db.cpp:783:29
    frame #7: 0x00000001032f1554 realm-tests`realm::DB::FileVersionManager::FileVersionManager(this=0x00006000037c4000, file=0x000000015d808490, mutex=0x000000015d8085a0) at db.cpp:770:5
    frame #8: 0x00000001032df784 realm-tests`std::__1::__unique_if<realm::DB::FileVersionManager>::__unique_single std::__1::make_unique[abi:v15006]<realm::DB::FileVersionManager, realm::util::File&, realm::util::InterprocessMutex&>(__args=0x000000015d808490, __args=0x000000015d8085a0) at unique_ptr.h:714:32
  * frame #9: 0x00000001032dd8a0 realm-tests`realm::DB::open(this=0x000000015d808218, path="/Volumes/NO NAME/_r/LangBindHelper_ImplicitTransactions_InterProcess.1.path.realm", no_create_file=false, options=0x000000016d875938) at db.cpp:1134:36
    frame #10: 0x00000001032e0a10 realm-tests`realm::DB::open(this=0x000000015d808218, repl=0x00006000039c0000, file="/Volumes/NO NAME/_r/LangBindHelper_ImplicitTransactions_InterProcess.1.path.realm", options=0x000000016d875938) at db.cpp:1474:5
    frame #11: 0x00000001032e7300 realm-tests`realm::DB::create(repl=0x00006000039c0000, file="/Volumes/NO NAME/_r/LangBindHelper_ImplicitTransactions_InterProcess.1.path.realm", options=0x000000016d875938) at db.cpp:2767:13
    frame #12: 0x00000001029a5610 realm-tests`Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess::test_run(this=0x000000016d875ad0) at test_lang_bind_helper.cpp:3147:28
    frame #13: 0x00000001029ed488 realm-tests`realm::test_util::unit_test::RegisterTest<Realm_UnitTest__LangBindHelper_ImplicitTransactions_InterProcess>::run_test(test_context=0x000000016d875c00) at unit_test.hpp:643:14
    frame #14: 0x000000010305f39c realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::run(this=0x000000016d876ce0, entry=<unavailable>, lock=0x000000016d875ce8) at unit_test.cpp:728:9
    frame #15: 0x000000010305ee40 realm-tests`realm::test_util::unit_test::TestList::ThreadContextImpl::nonconcur_run(this=0x000000016d876ce0) at unit_test.cpp:711:9
    frame #16: 0x000000010305df94 realm-tests`realm::test_util::unit_test::TestList::run(this=0x000000010367afe8, config=Config @ 0x000000016d8770c0) at unit_test.cpp:625:24
    frame #17: 0x0000000102b4f1a4 realm-tests`(anonymous namespace)::run_tests(logger=nullptr) at test_all.cpp:511:25
    frame #18: 0x0000000102b4d518 realm-tests`test_all(logger=nullptr) at test_all.cpp:579:20
    frame #19: 0x000000010258b8fc realm-tests`main(argc=2, argv=0x000000016d877cc0) at main.cpp:26:12
    frame #20: 0x00000001888e3f28 dyld`start + 2236
(lldb) p m_writemutex
(realm::util::InterprocessMutex) $43 = {
  m_filename = "/Volumes/NO NAME/_r/LangBindHelper_ImplicitTransactions_InterProcess.1.path.realm.management/access_control.write.mx"
  m_fileuid = (device = 16777237, inode = 159087)
  m_lock_info = std::__1::shared_ptr<realm::util::InterprocessMutex::LockInfo>::element_type @ 0x0000600002bcc018 strong=2 weak=2 {
    __ptr_ = 0x0000600002bcc018
  }
}
(lldb) p m_controlmutex
(realm::util::InterprocessMutex) $44 = {
  m_filename = "/Volumes/NO NAME/_r/LangBindHelper_ImplicitTransactions_InterProcess.1.path.realm.management/access_control.control.mx"
  m_fileuid = (device = 16777237, inode = 159088)
  m_lock_info = std::__1::shared_ptr<realm::util::InterprocessMutex::LockInfo>::element_type @ 0x0000600002bcc0b8 strong=1 weak=2 {
    __ptr_ = 0x0000600002bcc0b8
  }
}
(lldb) p m_versionlist_mutex
(realm::util::InterprocessMutex) $42 = {
  m_filename = "/Volumes/NO NAME/_r/LangBindHelper_ImplicitTransactions_InterProcess.1.path.realm.management/access_control.versions.mx"
  m_fileuid = (device = 16777237, inode = 159087)
  m_lock_info = std::__1::shared_ptr<realm::util::InterprocessMutex::LockInfo>::element_type @ 0x0000600002bcc018 strong=2 weak=2 {
    __ptr_ = 0x0000600002bcc018
  }
}

UP: so in my case even readers are not needed. Writters hang by themselves. It's File::get_unique_id which return different inode values on two calls, and when lock info for versions files (which is initialized last) gets the same File::UniqueID which is in use by either by lock info for write or control files, then it hits global cache value, and later hangs on trying to flock the same file.. how is it possible with 'stat' call for different files

@finnschiermer
Copy link
Contributor

@kiburtse Thanks for going deep on this one :-D. Very nice.

@kiburtse
Copy link
Contributor

kiburtse commented Sep 6, 2023

So essentially after further investigation the problem is really just about peculiar FAT32 and exFAT and multiprocess realm access.

"If you truncate or resize a file to zero bytes, the inode number may be reused for another file. This is because the FAT32 and exFAT filesystems do not have a concept of deleted files. When a file is deleted, its data is simply marked as free space. The inode number for the file is not reused until another file is created."

We do open file with O_TRUNC and then resize unconditionally.
So with multiple processes initializing the same interprocess mutexes with the right timing leads to the situation where inodes are scrambled within one process, so the lock later tries to do flock the same file twice.

This is an example how system behaves with FAT32 and exFAT to see what's the problem:

$ >> echo 1 > wri.cx
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> ll
total 192
drwxrwxrwx  1 user  staff    32K Sep  6 14:26 .
drwxrwxrwx@ 1 user  staff    32K Aug 31 19:53 ..
-rwxrwxrwx  1 user  staff     2B Sep  6 14:26 wri.cx
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> stat -x wri.cx
  File: "wri.cx"
  Size: 2            FileType: Regular File
  Mode: (0777/-rwxrwxrwx)         Uid: (  501/    user)  Gid: (   20/   staff)
Device: 1,21   Inode: 159164    Links: 1
Access: Wed Sep  6 00:00:00 2023
Modify: Wed Sep  6 14:26:50 2023
Change: Wed Sep  6 14:26:50 2023
 Birth: Wed Sep  6 14:26:50 2023
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> tru
true           truncate       trust          trustcachectl
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> truncate -s0 wri.cx
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> stat -x wri.cx
  File: "wri.cx"
  Size: 0            FileType: Regular File
  Mode: (0777/-rwxrwxrwx)         Uid: (  501/    user)  Gid: (   20/   staff)
Device: 1,21   Inode: 999999999    Links: 1
Access: Wed Sep  6 00:00:00 2023
Modify: Wed Sep  6 14:27:06 2023
Change: Wed Sep  6 14:27:06 2023
 Birth: Wed Sep  6 14:26:50 2023
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> echo 1 > con.cx
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> stat -x con.cx
  File: "con.cx"
  Size: 2            FileType: Regular File
  Mode: (0777/-rwxrwxrwx)         Uid: (  501/    user)  Gid: (   20/   staff)
Device: 1,21   Inode: 159164    Links: 1
Access: Wed Sep  6 00:00:00 2023
Modify: Wed Sep  6 14:27:22 2023
Change: Wed Sep  6 14:27:22 2023
 Birth: Wed Sep  6 14:27:22 2023
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> truncate -s +1 wri.cx
user@M-GCXVVT07DY (tmux) [/Volumes/NO NAME/_r]
$ >> stat -x wri.cx
  File: "wri.cx"
  Size: 1            FileType: Regular File
  Mode: (0777/-rwxrwxrwx)         Uid: (  501/    user)  Gid: (   20/   staff)
Device: 1,21   Inode: 159165    Links: 1
Access: Wed Sep  6 00:00:00 2023
Modify: Wed Sep  6 14:28:30 2023
Change: Wed Sep  6 14:28:30 2023
 Birth: Wed Sep  6 14:26:50 2023

@kiburtse kiburtse assigned kiburtse and unassigned ironage Sep 29, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants