-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSAN theLangBindHelper_ImplicitTransactions_InterProcess failure #6739
Comments
I looked into the logs and this doesn't appear to be a data race. The
I'll try adding the error code to the exception string so that if this happens again we can get more info about the original error. |
Still happening: ci logs @ironage the timestamp for "flock() failed" is after the hang analyser was reported to start:
so it's just reported error after every process was forcefully terminated. The real problem seems to be an assertion (what linked ci run log shows), which is hanging the whole test and triggers analyser later:
Although there is no assertion in latest failure. Weird. |
Another failure. The reader processes time out after their allowed 17 minutes of waiting and finally the hang analyzer runs, showing that the writer process is blocked waiting for a semaphore during commit:
|
I can reproduce similar issue with hanging reader processes. Essentially on macos i plug simple sdcard formated with exFAT, and run this test on it like this:
every process stays in limbo after those assertions (looks similar to how CI reports this failure):
The test itself passes from time to time in my setup, and it takes normally around 1 minute on this slow sdcard with debug build. But when it hangs, essentially only initial write succedess - the one which is done in setup code from one child process. The problem it seems to be that
UP: so in my case even readers are not needed. Writters hang by themselves. It's File::get_unique_id which return different inode values on two calls, and when lock info for versions files (which is initialized last) gets the same File::UniqueID which is in use by either by lock info for write or control files, then it hits global cache value, and later hangs on trying to flock the same file.. how is it possible with 'stat' call for different files |
@kiburtse Thanks for going deep on this one :-D. Very nice. |
So essentially after further investigation the problem is really just about peculiar FAT32 and exFAT and multiprocess realm access. "If you truncate or resize a file to zero bytes, the inode number may be reused for another file. This is because the FAT32 and exFAT filesystems do not have a concept of deleted files. When a file is deleted, its data is simply marked as free space. The inode number for the file is not reused until another file is created." We do open file with O_TRUNC and then resize unconditionally. This is an example how system behaves with FAT32 and exFAT to see what's the problem:
|
Some data race has been detected while running the test.
https://spruce.mongodb.com/task/realm_core_stable_macos_release_test_on_exfat_b2ed3201a306b1b00f5fb1cc4d64d84e7e603c3f_23_04_06_14_22_32/logs?execution=0
The text was updated successfully, but these errors were encountered: