Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible failure to create file/directory #1253

Merged
merged 6 commits into from
Dec 14, 2024

Conversation

pomaroff
Copy link
Contributor

@pomaroff pomaroff commented Nov 28, 2024

If there are 2 (or more) concurrent requests to access a file, any 1 request with a CREATE disposition will be rejected with STATUS_OBJECT_NAME_COLLISION. The driver should not return this error code based on the fact that there other open requests before calling userland code to confirm that the file handle is actually open and that the file or directory exists.

We also do not need to run file sharing logic for the first of many concurrent requests on the same file.

@Liryna
Copy link
Member

Liryna commented Dec 1, 2024

Hi @pomaroff ,

This change seems to make sense! Thank you for providing a fix.

To be tested.

May I ask how did you discover this ?

@pomaroff pomaroff force-pushed the FixConcurrentCreateFile branch from d75ec7a to 88eb96a Compare December 1, 2024 21:30
@pomaroff
Copy link
Contributor Author

pomaroff commented Dec 1, 2024

Hi @Liryna,

Apologies for the scant details, I was hoping to use the build from AppVeyor to test the changes, but I've been unable to install the artifact from there. I'm not sure if it's related to the build failures that are currently happening on master? I may need some help getting an installer to verify the fix, or if you're happy to verify it for me then that'd be great.

May I ask how did you discover this ?

I discovered this issue because I'm trying to use MsBuild for some C# solutions inside a Dokan mirror, and this is failing on some solutions. I found that the CreateFile callback in userland code wasn't getting called after looking at procmon, where I found that dotnet.exe can rarely make concurrent requests to create the same directory which both receive STATUS_OBJECT_NAME_COLLISION responses that it is seemingly unable to recover from. See highlighted rows from procmon here:

image

The only relevant place in the Dokan driver where this status is returned without calling back to the userland code is the area I've been looking at in this PR. I'm not entirely convinced this code is necessary at all because we could handle the concurrent requests in the userland code, but nevertheless, I think the changes I'm proposing here would be sufficient to get things working?

This is relatively easy to reproduce in a simple test. See this C# example below, which for me tends to fail after a few loops.

for (var i = 0; i < 100; i++)
{
    var path = Path.Combine(mount.Path, $"Directory{i}");
    var task1 = Task.Run(() => Directory.CreateDirectory(path));
    var task2 = Task.Run(() => Directory.CreateDirectory(path));
    Task.WaitAll(task1, task2);
    Assert.That(Directory.Exists(path), Is.True, () => $"Failed to create Directory{i}");
}

@pomaroff pomaroff marked this pull request as ready for review December 1, 2024 22:42
@pomaroff pomaroff marked this pull request as draft December 4, 2024 03:51
@pomaroff
Copy link
Contributor Author

pomaroff commented Dec 4, 2024

I've managed to install this locally, and whilst I think it has improved it it hasn't fixed all scenarios. I will look closer into this area of the code.

… concurrent request to create a file to proceed"

This reverts commit 88eb96a.
@Liryna
Copy link
Member

Liryna commented Dec 10, 2024

I'm not entirely convinced this code is necessary at all because we could handle the concurrent requests in the userland code, but nevertheless,

I tend to agree with you. It is such checks that need to be there for correctness and could be in userland or in the driver. Having it in userland means having all filesystem implementation take care of it (count active handles etc).

Thanks for the update @pomaroff . Please let me know if you find the other reason MsBuild fails.

…e the correct value of FileCount and use this to permit the first caller to proceed.
@pomaroff
Copy link
Contributor Author

pomaroff commented Dec 11, 2024

I tend to agree with you. It is such checks that need to be there for correctness

Are we sure this check is correct?

dokany/sys/create.c

Lines 701 to 705 in 255e4a5

// Cannot create a file already open
if (fcb->FileCount > 1 && disposition == FILE_CREATE) {
status = STATUS_OBJECT_NAME_COLLISION;
__leave;
}

My understanding is that STATUS_OBJECT_NAME_COLLISION informs the caller that the file/directory exists at the provided path, which isn't always going to be true. It certainly isn't true just because Fcb->FileCount > 1, since this only represents the fact that we have at least one other active request on the same path. Addressing the race condition with FileCount to permit the first caller to continue doesn't solve the problem entirely because we'd still be returning the wrong information to the second caller.

Please let me know if you find the other reason MsBuild fails.

Looking at procmon, I suspect it is simply because there are many concurrent requests on the same (non-existent) directory path. Some of the requests have a Create disposition, but many others have Open dispositions. It's possible to have multiple concurrent active requests on the same path where at least 1 of them has a Create disposition, and by returning STATUS_OBJECT_NAME_COLLISION to the caller we are saying that "yes the directory exists" even though it does not.

I've been experimenting on this PR with removing this check, and so far I have not encountered any problems with MsBuild after making this change, but I'll continue testing. I have also been trying to fix the race condition on FileCount because the FileSharing logic does not need to apply to the first request on the file.

@Liryna
Copy link
Member

Liryna commented Dec 12, 2024

STATUS_OBJECT_NAME_COLLISION must be returned when the caller tries to create an item that already exists.
As you perfectly explained it, we use fcb->FileCount as an optimization to know that a handle is open (and therefore the item exists) BUT using fcb->FileCount is indeed wrong because it is incremented before userland gets the request and confirm whether an item was actually opened successfully (= exist).

We could introduce a new fcb->OpenHandleCount that is incremented on create completion success but that seems complex for just this check that can be handled by userland. And anyway userland could fail createfile for X Y Z reasons while the item actually does exist.

What I am trying to say is that we should drop this if and let the userland implementation take care of it (which is the contrary to what I said previously I know 😄 ). Feel free to change your PR to remove this if, I can look to add it to the mirror / memfs samples.

@pomaroff
Copy link
Contributor Author

pomaroff commented Dec 12, 2024

Thanks Liryna.

Regarding this:

STATUS_OBJECT_NAME_COLLISION must be returned when the caller tries to create an item that already exists.

I think we already cover this in the documentation:

@Liryna
Copy link
Member

Liryna commented Dec 12, 2024

Correct! 😎 good eye

@pomaroff
Copy link
Contributor Author

What do you think of the other change in this PR, which is to move the VCB Unlock until after the FCB is Locked? In doing so, we ensure that:

  1. Requests on the same file are processed in the order received.
  2. We read FileCount before releasing the VCB lock, so that even if there's another request queued on the same file, the first request will bypass any "file sharing" logic.

@pomaroff pomaroff marked this pull request as ready for review December 12, 2024 04:24
@Liryna
Copy link
Member

Liryna commented Dec 13, 2024

Holding the Vcb lock does resolve that issue but is usually not safe unless made very carefully. If any of the called function triggers a reentry (it easily happens) that also end up requiring the Vcb lock, we will deadlock.

It is safer to just move the condition to userland.

@pomaroff
Copy link
Contributor Author

Thanks @Liryna. I'm not sure if it would have caused a deadlock, but I've reverted it to be on the safe side as it's not strictly relevant to the root problem I'm trying to solve here. I can always raise another PR in the future if needed. Thanks.

@Liryna Liryna merged commit 2165b60 into dokan-dev:master Dec 14, 2024
3 checks passed
@Liryna
Copy link
Member

Liryna commented Dec 14, 2024

Thanks @pomaroff ! I will take care of the possible sample changes 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants