Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simultaneous pwsh instance start Mutex.CreateMutexCore IOException race condition #2658

Open
darthwalsh opened this issue Jul 8, 2021 · 20 comments
Labels
Issue-Bug It either shouldn't be doing this or needs an investigation. issue-mutex The mutex creation issue on macOS Needs-More Investigation Initial investigation is done, but need to follow up.

Comments

@darthwalsh
Copy link

Environment

PSReadLine: 2.1.0
PowerShell: 7.1.3
OS: Darwin 20.5.0 Darwin Kernel Version 20.5.0: Sat May  8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
BufferWidth: 176
BufferHeight: 91

Exception report

iTerm session 3:

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'
   at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
   at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
   at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize()
   at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics)
   at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken)

iTerm session 4:

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'
   at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
   at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
   at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize()
   at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics)
   at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken)

Steps to reproduce

(Steps that shouldn't matter: My macbook was asleep in clamshell mode, I unplugged the display connections, and the OS crashed. I logged in and submitted the bug report to Apple, again.)

I saw this message on two tabs when iTerm2 was trying to restore a terminal window with four tab. The first two tabs did not have any error.

iTerm session restore caused 4 pwsh instances to launch at the same time, and got these two crashes from different pwsh instances.

Expected behavior

Launching multiple pwsh instances at the exact same time should not show error.

Actual behavior

Tabs 3 and 4 each showed the exception message above.


Running some debug commands, there are a lot of PSReadLineHistoryFile_2886463743 files created at nearly the exact same millisecond:

➜ ~ gci -r /tmp/.dotnet/shm

    Directory: /tmp/.dotnet/shm

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----            7/8/2021 10:59 AM                session1999
d----            7/8/2021 12:55 PM                session5595
d----            7/8/2021 10:15 AM                session705
d----            7/8/2021 10:15 AM                session710
d----            7/8/2021 10:15 AM                session715
d----            7/8/2021 10:15 AM                session721

    Directory: /tmp/.dotnet/shm/session715

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session705

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session5595

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 12:44 PM           4096 PSReadLineHistoryFile_2886463743
-----            7/8/2021 12:55 PM           4096 PSReadLineHistoryFile_3946503603

    Directory: /tmp/.dotnet/shm/session1999

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:59 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session721

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743

    Directory: /tmp/.dotnet/shm/session710

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-----            7/8/2021 10:15 AM           4096 PSReadLineHistoryFile_2886463743
➜ ~ gci -r -file /tmp/.dotnet/shm | % { $_.LastWriteTime.ToString("MM/dd/yyyy hh:mm:ss.ffffff tt") } | sort-object
07/08/2021 10:15:57.403067 AM
07/08/2021 10:15:57.403165 AM
07/08/2021 10:15:57.425071 AM
07/08/2021 10:15:57.425950 AM
07/08/2021 10:59:23.698350 AM
07/08/2021 12:44:43.666927 PM
07/08/2021 12:55:35.784985 PM

My 2 cents:

@ghost ghost added the Needs-Triage 🔍 It's a new issue that core contributor team needs to triage. label Jul 8, 2021
@darthwalsh
Copy link
Author

(Issue happened again, after upgrading macOS to 11.5. Had 4 terminal windows that started pwsh simultaneously, and two threw the same way IOException on PSReadLineHistoryFile_2886463743.)

I tried to reproduce the failure by spawning several pwsh from zsh

for i in {1..39}; do; pwsh -nop -c '([datetime]::now).TimeOfDay.TotalMilliseconds' >> b.txt &; done

but even though a few pairs of processes started within the same millisecond, I don't get any IOException. I'm not sure that test loads PSReadLine though. Or, maybe the problem is related to trying to acquire a mutex soon after system startup.

@iSazonov
Copy link

It is dup #1464 (comment)

@darthwalsh
Copy link
Author

@iSazonov sorry I don't understand why you think it is a duplicate. Here's why I said it is not a duplicate:

@iSazonov
Copy link

My bad.

Perhaps it is "by design" in .Net. https://github.com/dotnet/runtime/blob/d019e70d2b7c2f7cd1137fac084dbcdc3d2e05f5/src/libraries/System.Private.CoreLib/src/System/Threading/WaitSubsystem.Unix.cs#L169-L174
The comment says they addresses only thread race conditions but not inter process ones.

I believe it makes sense to open new issue in .Net Runtime repo and ask there.

@darthwalsh
Copy link
Author

Oh, good find. Do you have a sense of whether dotnet runtime might fix that bug? Agree it seems likely "by design," but I could open the same bug there.

(I was having some trouble understanding which resource the mutex is guarding. Before I was thinking the resource is i.e. the file /tmp/.dotnet/shm/session5595/PSReadLineHistoryFile_2886463743 -- then the mutex name hash could include the session ID in the name of the mutex, and that should further decrease contention and avoid this error. But I don't see any PSReadLine code interacting with this file, I'm guessing that's literally the dotnet implementation of the named mutex library. Instead 2886463743 looks to be the hash of /Users/walshca/.local/share/powershell/PSReadLine/ConsoleHost_history.txt so my suggestion about making the mutex use a more unique name is wrong.)

I see the other code interacting with the mutex has retry loops. If we understand mutex creation has by-design some chance of throwing, should PSReadLine also use a retry loop when creating the mutex?

@iSazonov
Copy link

As for PSRL @daxian-dbw can add more info and make a conclusion about using a loop in the code.
I think it makes sense to open request in .Net Runtime repository. They could add a loop too.

@daxian-dbw
Copy link
Member

@darthwalsh and @iSazonov Thank you both for the investigation so far. I will need to get a macOS to try reproducing the issue.
A couple clarification:

  1. the mutex is to guard the read/write access of the history file, which is shared by all pwsh instances for a given user.
  2. the re-try we have is for abandoned mutex, which we understand why a retry is needed in that case -- the thread owns the mutex, however, it was abandoned by the previous owner thread, so releasing and then retry is needed. But for this IOException, we have no idea what causes it and thus I don't want to blindly add the retry logic.

I'd love to understand the root cause before making any changes in the code.

@daxian-dbw daxian-dbw added Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-More Investigation Initial investigation is done, but need to follow up. and removed Needs-Triage 🔍 It's a new issue that core contributor team needs to triage. labels Sep 20, 2021
@SteveL-MSFT
Copy link
Member

SteveL-MSFT commented Jul 12, 2022

This issue is quite annoying for dev testing powershell builds on macOS. Workaround is to use -noninteractive so that PSReadLine doesn't automatically get loaded.

It seems the problem is the way a mutex is implemented is different across Windows, Linux, and macOS. On macOS it's a file lock: dotnet/runtime#5211 (comment). So if access to that file can't be retrieved, you get the IOException.

@SteveL-MSFT
Copy link
Member

I think an alternative solution is if the mutex can't be created, just have that instance not save to history and put out a warning message.

@SteveL-MSFT
Copy link
Member

I have a potential fix, but no longer able to repro the issue. Probably because it's a race condition and my changes have changed the timing, but until I can verify it improves it, I'll hold off on a PR

@darthwalsh
Copy link
Author

darthwalsh commented Jul 27, 2022

@SteveL-MSFT Printing a warning message that this powershell instance will not save history isn't much of an improvement (in my opinion) -- if I saw that, I would always manually close and reopen that pwsh instance -- which is what I did after seeing the IOException.

But yeah, being able to reproduce the problem on demand is something I wish was easier. What the steps seem to be:

  • having iTerm2 set to reopen tabs after system restart, and 4 tabs open (it could help to open dozens of tabs)
  • either crash the operating system, OR do an operating system upgrade
  • sometimes this triggers the IOException

(I retried my earlier "tight loop" but with a C# app just calling new Mutex( -- launched 4000 times at once, but didn't have any errors.)

@SteveL-MSFT
Copy link
Member

SteveL-MSFT commented Jul 28, 2022

@darthwalsh PSReadLine currently already has some retry logic. So if that fails, I'm not sure what we can do other than increasing the number of retries, but that's not guaranteed to succeed. Currently if the file lock cannot be obtained, PSReadLine just repeatedly informs you and is not usable. I thought I was more easily repro'ing it with a nested process, but it no longer repros for me. I don't believe I've seen the issue with opening a extra tab in iTerm2 either. If we can figure out a consistent repro, the proper fix would need to be in .NET.

@darthwalsh
Copy link
Author

darthwalsh commented Jul 29, 2022

@SteveL-MSFT I agree a proper fix in .NET seems much better, but to be sure I want to know what happens in a retry loop. Something like: if retrying after 1 ms brings the failure chance from 1% to 0.01%, then to me it makes sense to put the retry in Mutex. But if macOS just refuses to allow new mutex creation for a full 2 second period after system upgrade, I wouldn't want my framework code to block in the Mutex ctor waiting for that.

daxian-dbw pointed out the existing PSReadLine retry logic is for abandoned mutex, not for IOException.

@SteveL-MSFT
Copy link
Member

@darthwalsh that is true, perhaps we could try adding to the retry logic with an IOException, but I can't verify such a change right now since it's no longer repro'ing on my system.

@darthwalsh
Copy link
Author

After I upgraded macOS to 12.6, only one out of 6 of my iTerm pwsh instances failed. Basically identical:

OS: Darwin 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_2886463743'

I think the best way to get a consistent repro would be to snapshot an older macOS in a VM, upgrade the guest OS, with iTerm reopening dozens of pwsh tabs. But I'm not familiar the tools to do this.

@darthwalsh
Copy link
Author

Ok, I ran another experiment with the upgrade to macOS 12.6.1 and I was able to repro this failure without pwsh or PSReadLine; if we ask the owner of dotnet BCL what the Muetx exception policy should be I hope this added context would help.


Repo Program.cs with a net6.0 console app:

try {
  using (var m = new Mutex(false, "darthwalsh_PSReadLine_issues_2658")) {
    Console.Write("created but didn't take Mutex darthwalsh_PSReadLine_issues_2658   ...");
    var line = Console.ReadLine();
    Console.WriteLine("QUITTING: " + line);
  }
} catch (Exception e) {
  Console.WriteLine(e.ToString());

  var line = Console.ReadLine();
  Console.WriteLine("QUITTING: " + line);
}

Built with dotnet publish -c Release --self-contained -a x64

I created a new iTerm profile that just runs command /Users/walshca/code/temp/MutexThrow/bin/Release/net6.0/osx-x64/publish/MutexThrow and opened 30 tabs with this. Then I upgraded the macOS version to 12.6.1, and after reboot only 1 tab had the exception:

System.IO.IOException: The system cannot open the device or file specified. : 'darthwalsh_PSReadLine_issues_2658'
   at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
   at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
   at Program.<Main>$(String[] args) in /Users/walshca/code/temp/MutexThrow/Program.cs:line 2

(I expected about a quarter, so something about this minimal program doesn't match what pwsh is doing).

I'll open some more tabs of this, and leave it running for a few more OS reboots to see if I can repro it without OS upgrade.

@SteveL-MSFT
Copy link
Member

@darthwalsh did you already open an issue in https://github.com/dotnet/runtime?

@darthwalsh
Copy link
Author

@SteveL-MSFT thanks for the nudge, I had been putting off rebooting but I just updated parallels today...

Created dotnet/runtime#79375

@Fastdruid
Copy link

I was going to raise a new bug but I think it's the same issue. In which case I have an easier way to reproduce it (on linux).

Log on to system as a user
su to root
Run pwsh as root
su to another user
run pwsh again

Example given below:

[fastdruid@localhost ~]$ sudo su - root
[sudo] password for fastdruid:
[root@localhost ~]# pwsh
PowerShell 7.3.4
PS /root> su - fastdruid
[fastdruid@localhost ~]$ pwsh
PowerShell 7.3.4
PS /home/fastdruid>
Oops, something went wrong.
Please report this bug with ALL the details below, including both the 'Environment' and 'Exception' sections.
Please report on GitHub: https://github.com/PowerShell/PSReadLine/issues/new?template=Bug_Report.yaml
Thank you!

Environment

PSReadLine: 2.2.6
PowerShell: 7.3.4
OS: Linux 4.18.0-425.3.1.el8.x86_64 #1 SMP Wed Nov 9 20:13:27 UTC 2022
BufferWidth: 153
BufferHeight: 46

Last 0 Keys:

Exception

System.IO.IOException: The system cannot open the device or file specified. : 'PSReadLineHistoryFile_1202693972'
at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
at Microsoft.PowerShell.PSConsoleReadLine.DelayedOneTimeInitialize()
at Microsoft.PowerShell.PSConsoleReadLine.Initialize(Runspace runspace, EngineIntrinsics engineIntrinsics)
at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics, CancellationToken cancellationToken, Nullable`1 lastRunStatus)

@daxian-dbw daxian-dbw added the issue-mutex The mutex creation issue on macOS label Sep 6, 2023
@darthwalsh
Copy link
Author

@Fastdruid that looks like it's a different issue: #3692

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Bug It either shouldn't be doing this or needs an investigation. issue-mutex The mutex creation issue on macOS Needs-More Investigation Initial investigation is done, but need to follow up.
Projects
None yet
Development

No branches or pull requests

5 participants