Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some very old ICDs in the wild crash when the loader tries to look up vkEnumerateInstanceVersion #182

Closed
hg-ctangora opened this issue Apr 24, 2019 · 28 comments
Assignees
Milestone

Comments

@hg-ctangora
Copy link

hg-ctangora commented Apr 24, 2019

I've been looking at a crash on startup we're seeing among end users, and I've traced it to some very old ICDs that crash when vkGetInstanceProcAddress( NULL, "vkEnumerateInstanceVersion" ) is called.

Since the 1.1 loader tries to do version enumeration in vkCreateInstance, if this ICD happens to be on a user's machine, it crashes without warning.

We've seen some cases where the user switched GPU vendors long ago, and are now seeing a crash due to a long-forgotten old ICD on their machine -- updating the driver for their new GPU doesn't help, and it's quite difficult for these end users to diagnose the problem.

A quick search of Steam forums reveals this issue is hitting more than one Vulkan game, so we'd like to try and work around it and unfortunately the loader seems like the least bad place to do it.

What would be a reasonable way to get the loader to avoid attempting the crashing vkEnumerateInstanceVersion lookup on certain ICDs?

@Plagman
Copy link

Plagman commented Apr 24, 2019

Do you have any details on the ICDs that are crashing? It would be helpful to try to reproduce this on our end and inform options.

@hg-ctangora
Copy link
Author

hg-ctangora commented Apr 24, 2019

So I initially reproduced this with an install of AMD driver version 17.1.1. (A rather grey-haired Vulkan 1.0.36 API) available from https://www.amd.com/en/support/kb/release-notes/rn-rad-win-17-1-1

I've been walking forward driver versions today, and it seems like it was resolved by 18.1.1 if not sooner.

It seems like one telltale is the driver JSON having the string "abi_version" or "abi_versions" for the ICD rather than "api_version." But who knows, there might be other drivers out there with that string.

@hrydgard
Copy link

We went as far as to detect Vulkan support in a separate process to avoid this:

hrydgard/ppsspp#11802

Would be very interested in detailed instructions for a better workaround, such as how to find file to look in for "abi_version". Is the Windows registry location of driver json mentioned in https://www.khronos.org/assets/uploads/developers/library/2017-vulkan-loader-webinar/VulkanLoaderDeepDive_Khronos_Mar17.pdf the one?

@lenny-lunarg
Copy link
Contributor

Back when Vulkan 1.1 was being developed, we had some discussions about how the loader should check if a driver supports Vulkan 1.1. The loader needs to know this because in Vulkan 1.0, passing a version other than 1.0 into the VkApplicationInfo will cause the driver to return an error. Since we want the loader to be capable of using 1.0 drivers, the loader has to change that value for all 1.0 drivers present.

The process we settled on is what the loader does now:

  • Call vkGetInstanceProcAddr to see if vkEnumerateInstanceVersion is present. If it isn't present, you have a 1.0 driver.
  • Call vkEnumerateInstanceVersion if present, to get the supported version.

This method should be perfectly safe if all drivers are following proper Vulkan standards, but obviously we have to live with the drivers we have.

As far as fixing the issue goes, I see three ways of avoiding this crash:

  • Change to getting the driver version from the json file. I had a prototype that did this back before Vulkan 1.1 released (I believe this is an easy change). I can't remember why we settled on the current method.
  • Compile a list of drivers that crash in this case. We might be able to make the loader know those are 1.0 drivers without the query, so the crash never happens.
  • Try creating a driver with 1.1, and change to 1.0 if that fails

The first option sounds best to me (assuming it works), but I want to try to find why we settled on our current mechanism. I'm going to have to go through the old meeting notes to try and remember why. If there's a compelling reason to keep checking through vkEnumerateInstanceVersion we could always try the second option. The last one doesn't seem viable to me because instance creation can be a very slow operation on some drivers, so doing it twice seems like a bad idea.

Would be very interested in detailed instructions for a better workaround, such as how to find file to look in for "abi_version". Is the Windows registry location of driver json mentioned in https://www.khronos.org/assets/uploads/developers/library/2017-vulkan-loader-webinar/VulkanLoaderDeepDive_Khronos_Mar17.pdf the one?

That was correct when it was made. If the driver that's crashing is more than about 2 years old, it will be there. I'm guessing most drivers that crash are old and will be in that location. But if the driver is newer, it could be located in any location described here. I'll also mention that I'm working on some changes that are going to make it even harder to track down exactly where a driver is located for future Windows drivers, so it may be tricky to find drivers in the future.

I'm going to look at old meeting notes and also check if anyone else remembers why we use vkEnumerateInstanceVersion instead of the ICD json file for version. It seems to me that switching to the json file would be the most natural solution, because I don't see a good work-around for this. But I'd also want to make sure we don't have any drivers out there that give an incorrect version in the json field. Using a separate process will work as a workaround, but I don't think we want to force people to do that.

@lenny-lunarg lenny-lunarg self-assigned this Apr 24, 2019
@hg-ctangora
Copy link
Author

Incidentally we are running in 64-bit Windows; I don't know if it makes a difference, but I noticed PPSSPP was hitting the same crash as us and also on the 64-bit driver.

@hg-ctangora
Copy link
Author

We're looking at a possible workaround for now, which is a weaker version of always getting the driver version from the json file - we've made a modified loader that assumes the driver is 1.0 if it has no api_version field in the json.

This definitely catches the problem ICDs that I personally know about, but it obviously might net other drivers I don't know about as well.

In our case, our application uses very few 1.1 features and has guards around those it does, so if it gets an erroneous 1.0 instance it'll trundle along happily enough. So we're hoping this will stop the bleeding for us while discussion of a better workaround can continue.

(We can also keep an eye out for any more crashy ICDs in the wild once we have a dodge for this specific case.)

The commit is here in case this is useful as a temporary workaround for anyone else:
hg-ctangora@74a45b5

@Xeticus
Copy link

Xeticus commented Aug 16, 2019

Hello, I'm a frustrated user who found this issue by looking at the Vulkan license is my copy of the game No Man's Sky. The game just had a huge update and went from OpenGL to Vulkan. Some users like myself can not load the game at all. it crashes. I noticed in this issue that switch GPU vendors might be a factor. I used to use a AMD Radeon R7 260X and I switched to an NVidia Geoforce 1050ti. Could that be affecting why my game crashes on loading up? I'm just grasping for answers at this point.

@MarkY-LunarG
Copy link
Collaborator

@Xeticus, if you installed the Vulkan SDK, run VIA (the Vulkan Installation Analyzer) and it will generate an HTML file which might hint at what problems you have. There are many reasons you could be seeing a crash (driver, app, etc) and this will at least help you know if Vulkan in general works on your system.

@Xeticus
Copy link

Xeticus commented Aug 17, 2019 via email

@SephiRok
Copy link

Has there been any progress on this? We recently switched to Vulkan and are experiencing the same issue on some of our players' machines.

@Xeticus
Copy link

Xeticus commented Aug 17, 2019 via email

@hg-ctangora
Copy link
Author

hg-ctangora commented Aug 18, 2019

Manually deleting the old driver entry certainly works, but it's hardly reasonable to ask players to dig through their registries before they can run any Vulkan application.

@Xeticus Could you possibly figure out what version of the AMD driver you had installed? Perhaps from your amd-vulkan64.json file (though as I recall there's not much there) or some other old files? It seems like there may be more than one way that old drivers are crashing during enumeration.

@SephiRok We ended up shipping our own vulkan-1.dll (with the patch above) to be able to include workarounds for this sort of thing. If it would help you, I can provide you the binary.

@SephiRok
Copy link

SephiRok commented Aug 18, 2019

@hg-ctangora Thanks, for now I've compiled your patch from source and will report back if it eliminates the issues once it's been deployed.

This will also happen when the user has an integrated card with outdated drivers and discrete card with up to date drivers, which does not appear to be very uncommon.

Here are two drivers on which we've seen creating the vulkan instance crash:

Card name: AMD Radeon(TM) R7 Graphics
Driver Version: 22.19.677.257
3221225477 EXCEPTION_ACCESS_VIOLATION 
(0) [0x00007FF82304B461] ?? (C:\WINDOWS\System32\amdvlk64.dll)
(1) [0x00007FF82304B641] ?? (C:\WINDOWS\System32\amdvlk64.dll)
(2) [0x00007FF8238C6618] ?? (C:\WINDOWS\System32\amdvlk64.dll)
Card name: Intel(R) HD Graphics 530
Driver Version: 20.19.15.4549
3221225477 EXCEPTION_ACCESS_VIOLATION 
(0) [0x00007FFB27DB4989] TpSetWait + 0x00000000000001C9 (C:\WINDOWS\SYSTEM32\ntdll.dll)
(1) [0x00007FFB27DAEFB7] RtlEnterCriticalSection + 0x0000000000000127 (C:\WINDOWS\SYSTEM32\ntdll.dll)
(2) [0x00007FFB27DAEED0] RtlEnterCriticalSection + 0x0000000000000040 (C:\WINDOWS\SYSTEM32\ntdll.dll)
(3) [0x00007FFAA137A284] ?? (C:\WINDOWS\System32\DriverStore\FileRepository\igdlh64.inf_amd64_d8ca5f86add535ef\igvk64.dll)
(4) [0x0000008658B2C0F0] ?? (C:\WINDOWS\System32\DriverStore\FileRepository\igdlh64.inf_amd64_d8ca5f86add535ef\igvk64.dll)
(5) [0x00007FFAE1E9A0E0] VulkanSteamOverlayProcessCapturedFrame + 0x0000000000025130 (D:\Program Files (x86)\Steam\gameoverlayrenderer64.dll)
(6) [0x00007FFAA1AAE1F8] ?? (C:\WINDOWS\System32\DriverStore\FileRepository\igdlh64.inf_amd64_d8ca5f86add535ef\igvk64.dll)

The Intel one is an outlier, it's been predominantly the AMD occurance.

If it's critical, I can likely dig up more instances from support reports.

@hg-ctangora
Copy link
Author

@SephiRok The AMD one is the same one we've been seeing in our dumps, so I'm afraid it's unlikely my patch will work on this. :(

Looks like the old driver is spinning up a thread for some reason, which then crashes, so it may be necessary to blacklist that DLL entirely. It may have that weirdly-formatted json as well, which I previously used for workaround logic.

Thanks for the driver version; I'll see if I can't get my PC into this state.

@SephiRok
Copy link

SephiRok commented Aug 18, 2019

@hg-ctangora Ah, that's unfortunate. I'll post any further info that comes across me.

I'd also like to point out that using a seperate process is not a great workaround since there could be an ICD with working Vulkan support, but an erroneous ICD crashes the detection process. It may be a viable workaround for programs with multiple renderers, but not for Vulkan-only apps.

@hg-ctangora
Copy link
Author

Looks like this driver (version 1.6.0) doesn't have a weirdly-formed JSON anymore like 1.5.0 did, but it still crashes in the same place trying to retrieve vkEnumerateInstanceVersion.

We're going to try a further hack, that skips those calls for any "amdvlk" driver that does not declare API 1.1 in its json -- new commit is linked below.

hg-ctangora@3e917fe

The main concern I can think of with this sort of workaround is that some ICD might exist that was previously reporting 1.1 API, but would now be seen as 1.0 with this change. (That would be some ICD that had vkEnumerateInstanceVersion implemented, but reported a 1.0 API or no API in its JSON.)

So a Vulkan app that requires 1.1 might see new issues somewhere.

@Xeticus
Copy link

Xeticus commented Aug 18, 2019 via email

@lenny-lunarg
Copy link
Contributor

lenny-lunarg commented Aug 19, 2019

I think I've seen two distinct crashes (both during instance creation) in the comments from this issue and I'd like to avoid confusing the two. But before I describe that I'd like to make a couple of points:

  • As far as I can tell, the crashes are all happening in the driver
  • The root cause of the problems seem to be driver bugs
  • Simply updating the loader is not enough — the updated loader needs to get onto the sytem that is crashing
  • Given that loader updates are generally distributed by packing the loader with the drivers, the normal way to get the new loader would be to update your drivers
  • If the loader fix requires a driver update, I think generally it would make more sense to fix the problems in the drivers, where the actual bug exists

So the distinct issues I see are:

  1. The original crash here occurs when the loader tries to get the instance version from the driver. The driver is crashing when calling vkGetInstanceProcAddr with a name the driver doesn't recognize. The driver is supposed to return null. This is what @hg-ctangora reported.
  2. The other crash happens when the loader tries to create an instance within the driver. The driver crashes when there is no device the uses a certain driver. While this can work if a driver uses the PnP registries, this is completely illegal if not using the PnP registries. This is what @Xeticus reported
  3. I'm not entirely clear on what @SephiRok is reporting. It could be either of the other two issues or even another issue.

That being said, trying to fix issue 2 through a driver update may be a problem. The issue is that an AMD driver was left behind, and that driver crashes without an AMD GPU. Solving this through a driver update probably isn't practical because if we tell people to update their drivers, they'll update the drivers for the card they actually have.

If we need to create a fix for that issue, the only way I see is roughly what @hrydgard described, except doing it when trying to create an instance. Splitting off another process adds complexity, on top of which it sounds like something that people would suspect of being a virus (which we've had people fearful of in the past). And it wouldn't even solve anything without a driver update.

@SephiRok
Copy link

SephiRok commented Aug 19, 2019

I think it's easy for us to agree that it would be preferable to update or remove the buggy drivers. The problem is that is not under an application's reasonable control, while distributing an updated loader is trivial enough -- unless I'm missing something.

If the fix is done in the loader, would updating any graphics card driver also install a new version of the loader, essentially working around all the other crashing drivers?

@hg-ctangora
Copy link
Author

We've been investigating further and I think these are largely still the issue of vkGetInstanceProcAddr crashing.

We had crash dumps in our reporting that matched the callstack that @SephiRok reported in the AMD driver. On installing that driver version, I did see a crash, but it was the familiar vkGetInstanceProcAddr issue.

Yesterday we pushed a new vulkan-1.dll with the more aggressive workaround (string-comparing the driver filename against "amdvlk" and trusting the api_version field in the JSON if it matches) and we are no longer seeing that crash in the automated reports.

So I don't know why the minidumps were showing that odd crash with no client code in the stack, but I think it was a red herring. Given that this is our most common driver-related issue at the moment I would guess it was the issue @Xeticus encountered as well.

The exception is the callstack in the Intel driver that @SephiRok posted, which does seem to be a different issue.

And yeah, I personally will be making sure to cleanly uninstall any old drivers from now on whenever I change graphics cards 😃 but it's an awfully hard thing to diagnose if you're just trying to play your new game. Getting some kind of workaround in the loader would make "update your drivers" work to fix it, and eventually propagate the workaround out via Windows Update.

@hrydgard
Copy link

I don't disagree that a vkGetInstanceProcAddr crash might be common, but I have gotten crash logs from users that definitely indicate a crash within vkCreateInstance, in this particular case of old AMD drivers lingering on a machine that no longer has an AMD card (this is when creating a 1.0 instance, I'm not attempting to create a 1.1 instance).

@hg-ctangora
Copy link
Author

hg-ctangora commented Aug 20, 2019

Apologies, the first call to vkGetInstanceProcAddr("vkEnumerateInstanceVersion") does actually happen in vkCreateInstance.

So a crash in that call matches the vkGetInstanceProcAddr issue I've been seeing. (The mystery on my side is why the crash dumps I had for this issue don't show that.)

Do you have any more information about the crashes you've been seeing that might differentiate them?

@hrydgard
Copy link

Here's one report: hrydgard/ppsspp#11719 (comment)

Seems to be the same one yeah.

@lenny-lunarg
Copy link
Contributor

The problem is that is not under an application's reasonable control, while distributing an updated loader is trivial enough -- unless I'm missing something.

This is easy enough to do, but there's one thing you have to be careful of. If you do distribute a loader, you should really make sure it gets installed system-wide (generally through the runtime installer). The problem with not installing system-wide is that driver vendors don't generally test their drivers against older loader versions. Instead, they rely on the fact that they're shipping a loader with the driver to establish a minimum version. This means that as future driver versions come out, your loader that used to work may stop working as a result of a driver update. As a result, I'd strongly recommend against just putting a vulkan-1.dll binary that only a single application will pick up.

@lenny-lunarg
Copy link
Contributor

lenny-lunarg commented Aug 20, 2019

Based on this issue and some outside discussion, I think I'm going to get two changes ready that should address these issues:

  • I want to do roughly what @hg-ctangora described and stop checking for enumerate instance version if we have a 1.0 driver. I think I'm probably going to use slightly different logic from what was described, but I'll work that out in a PR. This change should fix what appears to be the more common error, which is the crash on vkGetInstanceProcAddr.
  • I think we should add a change so that if one driver is found through the PnP registries, then the loader will not search the system-wide registry. This should solve the case where an old driver is left behind that doesn't have an installed GPU. This doesn't seem to be as common a case, but it would still be good to fix.

I'll try to post PRs that implement these fixes shortly, and then we can evaluate the exact changes.

@MarkY-LunarG
Copy link
Collaborator

I'm a little concerned with your first bullet point. I think it's a good idea, but my concern is how are you determining the ICD is 1.0 specific. Via the JSON file, via exposed entrypoints? I think that some of the ICD JSON files for some of the drivers that do support Vulkan 1.1 incorrectly list the driver version at 1.0.X. So we need to be very careful about this change. We should first make sure it's not going to break them by running this by people from the various ICD companies.

@hg-ctangora
Copy link
Author

From my point of view as an application developer I'd only be concerned if there were some 1.1 ICDs that actually crashed if the loader attempted to init them as 1.0.

Remember, right now we live in a world where a small but significant number of Vulkan installs just crash without warning! This was our number 4 crash on PC at launch, and if you google "amd-vulkan64.json" you will find many other reports of other games seeing this same issue.

If we instead have some installs that erroneously report only 1.0 support when they could in fact support 1.1, that is a much less serious problem. Especially if upgrading to latest drivers for your current card fixes it -- the natural thing for an app to do if it wants 1.1 but the loader reports 1.0 is nag the user to upgrade.

@lenny-lunarg
Copy link
Contributor

This issue was resolved in #228 and #237. As far as I'm aware, those two PRs resolve the complete issue. If someone does find something else, I'd recommend creating a new issue.

@shannon-lunarg shannon-lunarg added this to the sdk-1.1.126.0 milestone Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants