-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The process cannot access the file NuCache.Content.db because it is being used by another process #5035
Comments
Is this on Umbraco Cloud, or on pure Azure? Where is |
This is in standard Azure on a basic plan with single instance only, there’s no special configuration at all, pretty vanilla Umbraco 8 with hacked up starter kit |
The same config principles apply to v8 as v7 for running on Azure. We defo need to get the docs updated (PRs are welcome!) The current azure docs are here https://our.umbraco.com/Documentation/Getting-Started/Setup/Server-Setup/azure-web-apps for v8 you will need these in appSettings:
|
interestingly enough, there is no examine config files in this website - I hadn't noticed that before... is that normal? Out of the box shouldn't this just work? Do I need to add the files? |
There is no examine config files in v8. OOTB, just like v7, you need to adjust some config to make umbraco work with azure. So the above 2 config values are needed. These are equivalent to the v7 appSetting |
The 2 config values are just in your web.config, these are just |
ah, right - good to know :) |
so is it right to assume these are going to be automatically added in future, or are they only in certain situations? |
No you cannot assume that, Just like deploying v7 to Azure you need to setup specific config for that, it's no different in v8. If however, you create an Umbraco website from the Azure portal, then yes these should be pre-configured for you but I don't think we have a v8 build on the Azure portal yet. |
Thanks for explaining the required config settings. But in v8, even with the above settings, the exact error happens when you enable Slot Swap in the Azure App Service: before the swap both slots are working ok, but after the Swap the new "Production" slot throws the error every time. A full App Service restart is needed to make things working again. If you work with a single slot everything is Ok, |
That's interesting, will assume the same problem exists in v7 too since it's the same paradigm. I think the only way around such behavior is to have an option to not have a persisted cache file, or name the cache file based on the AppDomainAppId + MachineName (should be unique among processes). @zpqrtbnk what do you think here? |
Note: We have a site on Umbraco 7.13.2 running on Azure with slot swapping (and the "LocalTempStorage" ="EnvironmentTemp" app setting) and we never run into this problem with the XML cache file. |
NuCache stores its file in I don't know much about "slot swaps" in Azure. But, if they are running on the same machine, same application id, same site name, same temp dir then... there might be a collision? That would require some troubleshooting to get it right. But then, as you mention, it's not only NuCache but other things too. Now indeed, the difference is that v8 locks the NuCache files for as long as it's running where v7 was "using" the Xml file from time to time = the two sites may cohabit (but that was a bad idea). So... to make it short, I'd like to hear more about Slot Swaps and, on all slots, get the value of |
@zpqrtbnk thanks for looking into this. I will post today the requested information about the Slot Swaps and those values. |
This is what I got on each slot, before and after the swap:
And this time the error was not thrown.
I am positively sure that this setting was in effect when the error was thrown:
but this one was not (may this be the issue?)
In my latest test (no error) both settings were in effect. |
The Lucene setting has no impact on NuCache. The fact that you are seeing an error such as NuCache uses the built-in MainDom mechanism to ensure that only one app domain at a time can own the cache (and the associated files). MainDom uses a machine-wide named lock; the name is built by combining the application id and the application physical path. (so obviously I should also have asked you the A site cannot, in theory, even try to access the NuCache file until it has aquired the MainDom lock. Therefore, for the error to happen... the site must own the machine-wide lock on (app.id, app.path) and yet someone else must lock the files at (app.id, site.name), meaning
Any chance you can get the And... I don't know enough about "slots" to figure out the physical server thing. Ideas? |
Just tested it. The Regarding the physical server... it is not fully documented, but many posts assume it is the same because all slots share the same App Service Plan = the same set of resources (i.e. 3.75 Gb RAM). |
So... assuming that all sites run on the same physical server (for now), that leaves us with app.path.
Did not realize you posted a log file with the original issue - now looking at that file. |
@zpqrtbnk - the log file is not mine, so it does not apply to the slot issue. Thanks. |
In any case, for some mysterious reason the "process cannot access the file NuCache.Content.db" error has not appeared lately - in all slots swaps during the last three days. |
Thanks for the update. Even though it's not your log... the significant lines are:
Where we see that a new process (9472) starts while the previous one (9488) is running, and acquires the MainDom lock before the previous process releases it, thus hitting an exception when trying to read the cache. That should not be possible... If anything happens again, thanks for reporting. Meanwhile, I've made a few changes so that in 8.0.2 we log all the important infos (app.path, app.id, etc) when the site boots. |
Hi @zpqrtbnk Note that in this case:
UmbracoTraceLog.RD2818784FF824.20190409.zip I'm running v8.0.1 installed via the nuget.org package. Hope it helps. |
@zpqrtbnk - I could further isolate the problem. I performed the following steps:
So it looks like this is related to the Azure Slot swap - although it is the same machine, for some reason the process starts over without having shut down the previous one. |
One last comment about the slot swap process: |
Hey - thanks for the details - just FYI I am away at the Barcelona meetup, with little time for this issue - will resume work on it at the end of the week (so don't feel bad if I don't reply). |
Or maybe not - further experimenting. |
I have created an Azure App Service hosting a simple MVC web app, with two slots. Had to have one per-slot appSetting in order to force slot-swaps to trigger a restart. Then, have our MainDom system and some temp file lock to the app... and I can swap slots, and see the MainDom lock being properly released, and swaps happening without issues. I have to accept that something is wrong, considering your log, but for now I am running out of idea. During the swap, on my test, the old process terminates and then a new process starts. In your log, the new process starts before the old process terminates, but that should not be a problem: this is precisely why we have the MainDom lock. Is this happening every time you swap, or only from time to time? Is this on a production / live site? To which extend would it be possible for you to run some custom DLLs which would include more tracing / logging? |
@zpqrtbnk , I did not change anything in Umbraco since then - the same package 8.0.1 is used. I'm almost sure the difference is that, while trying to workaround the error, I completely disabled Application Insights on Azure - it was enabled by default when I first created the App Service, and I did not opt out of it as I usually do. To verify if this was the cause I enabled again Application Insights in both slots and something strange happened:
So looking in retrospect, now my thoughts are:
Current situation:
Thanks for your help! |
Thanks for keeping up with the detective work ;-) I have tried to experiment with enabling/disabling AppliCation Insights on my test app, but still cannot reproduce the out-of-order lock management that you see. Happy that it works for you, but it annoys me. Will lower the priority of this issue, but still, will try to send you a DLL to test, that would log way more details about what is going on. Stay tuned! |
This: Debug.zip contains a patched DLL that just logs infos when acquiring the main lock (process id, but also user name, app id and maindom hash) - if you have a moment to try it... |
There will always be more than one app domain in netframework (since very early days) when an appdomain 'restarts' because when you bump the web.config or restart, another appdomain starts while the current one is unwinding. An appdomain cannot be 'restarted', it can just be started or shutdown, netframework makes us think it's 'restarting' but it's just another appdomain starting while another is ending. You can in fact have several appdomains running at once if you have a constant shutdown loop. IIS also aids in this too and depends on how it's configured. This behavior is baked into netframework. This is why it's difficult to synchronize access to files in netframework that maintain locks (i.e. lucene, nucache, etc...). This is why MainDom exists. The default MainDom uses a system-wide semaphore named lock. This type of lock doesn't work in Azure or linux which is why we use the DB as the distributed lock with SqlMainDomLock. MainDom will depend on your setup but sounds like perhaps SqlMainDomLock should be used in your scenario. Keep in mind, you cannot 'share' (i.e. network share) files between 2 IIS instance without some very particular configurations. If you switch the underlying physical path of a site while it's running (not sure how that can be done without a restart), then it won't have access to the original files that are locked in order to unlock them. File locks are always done by the OS. Once they are locked, the OS will keep them locked until the original thing that locked them unlocks them. If you use the
like you would on azure, this will store these types of files in the current processes %TEMP% directory. This is set using environment variables so it's possible you can redirect this anyways. I don't really understand why making each individual folder for logs, lucene and nucache would help in any way. This locking issue to me sounds like an hosting configuration issue due to how deployments are being done. |
I do see the scenario that one app domain is in the process of shutting down while another starts but are you saying that its normal for the startup to run in two different app domains at the same time like my logs is showing? Or is this the shutdown loop that you are referring to? The logs that you see is from one website on one machine, nothing shared over network etc. But the "closing" app domain (which is the same website) of course logs to the same folder. The change of the "Physical Folder"-path of the website in IIS will trigger a restart, I can see in the logs that locks are released for the MainDom-lock (log says something like "MainDom lock released". But during the startup the boot process throws because the NuCache files is still in use by something. But looking at the implementation in the Umbraco source I can see that the release-code does dispose the BTree-classes so given that - files should be "free/released" when the site start. The thing that make me wonder is that it seems from the logs as if two different app domains is starting at the same time, this could potentially be the issue, that one of these keeps the lock on the files while the other fails. But I don't understand why two app domains would start at the same time. I mean one option for us is to rebuild the NuCache from db after each release - this would work fine for smaller sites but the current site we're migrating has lots of content (nucache-files around 25mb) so it would be very nice to find a way to avoid having to rebuild this during startup. But given that this issue is hard to nail, if we rebuild the NuCache its useful to be able to put ie. the Examine indexes in another folder to avoid having to rebuild this as well (I know that the LuceneDirectoryFactory can solve this). An interesting thing here is that in the setup that we have problem with here both share NuCache and ExamineFiles between the released - but it's always the NuCache files that are used by something else and that make the newly started website throw. But at the end of the day, this does not look like a MainDom-issue more a issue with the NuCache files and locks on them, either because they are not released or because something in the startup locks them. Edit: |
Looking more at this now. I actually created a deploy-script that will copy the NuCache-folder and the DistCache-folder from TEMP in from the old release-folder to the new release-folder. This to avoid the lock-issues on the files. Still, the startup will run on 3-4 different AppDomains simultaneously, some of the App domains will say that the local db does not exist (but I'm 100% sure that the file is there) and the last AppDomain start will find the cache, still the first 3 AppDomains will try to build the cache again. This happens both when deploying via Octopus and when just recycling the App Pool. This only happens when the site is under load, the stage-environment works as I would expect, using the copied temp-files. I'm actually not sure if this is a Umbraco issue, it feels strange that the AppDomains Init-method is executed multiple times simultaneously as this should not be the case according to the docs it should only run for the first request to the app after restart. Another thing that is also interesting is that we get about 300-400 log files outputed in the log directory during the 60-70 seconds it takes for the site to start again efter the deploy The problem with multiple log files is something that I have not seen in prev. version ie. we run a similar site on 8.8.1 with quite some load and this site only outputs one log file. I will admit that this is on a different server, different IIS etc. |
Could you maybe take a step back and simplify your release? - as your approach is definitely non-standard - doing anything non-standard is likely to get you into a death spiral situation. My approach is just - stop the website, copy new release into previous release folder overwriting existing files (so you automatically get re-use of the existing TEMP files for nucache and examine - which are not included in the new release), and start the website back up. That's it. No messing about with recycling app pools or copying folders of temp files or re-pointing IIS to a new folder. Is there a good reason you cannot take my approach? |
@John-Blair The approach is the default behavior in Octopus Deploy, it's basically creating a new folder with the new release and changed the Websites "Phisycal Folder" property to point to the new folder. The thing is that we would like to avoid stopping the server before to avoid ugly errors for visitors, our goal is to make the folder-switch and have the site restart again - a visitor would just interpret this as a slow load of the page. It's not a manual recycle but something that happens as a side-effect of the folder-switch - as you might know a change to settings on AppPools or websites can result in restarts of the webapp. @Shazwazza So any way, I continued my research around this, from what I understand, the deployment-process with Octopus will make multiple changes to different config files after the "Physical Folder"-change. This will result in multiple very rapid restarts of the Website and since the site is under load this means that after each config-change a small number of request might be start to be processed. Since IIS will continue to serve these request in the AppDomain that received the requests it starts to boot Umbraco in multiple AppDomains. The booting of Umbraco does not lock between AppDomains so each AppDomain (that is a result of numerus config changes by Octopus) will continue to boot Umbraco. I guess this is why I see a lot of log files during startup, I've tried to create a custom build with a lock around the this code internal void HandleApplicationStart(object sender, EventArgs evargs)
{
// ******** THIS IS WHERE EVERYTHING BEGINS ********
lock (_startupLock)
{
// create the register for the application, and boot
// the boot manager is responsible for registrations
var register = GetRegister();
_runtime = GetRuntime();
_runtime.Boot(register);
}
} But this did not work netither withe a MarshalByRefObject or a Mutex - i really don't get why its booting multiple times. At the end of the day, the way that we are deploying is kind of special and the fact that the site starts many times in a short period of time is probably something that is very hard to handle in the Umbraco core code. As of now, I'll stick with my copy of the Cache and let the boot rebuild this ad random times - the lock issue is gone after we copy the cache before each release. |
ok....fyi we used octopus deploy - 5 years ago not now - and we did not create a new folder for each release we re-used the same folder - so maybe checkout the config options on the version of octopus deploy you are using. |
wrt not stopping the server...the way we approached that was have the site running on 2 servers and we changed the in-house dns config to switch to the standby server (which ran in read only mode) during an upgrade and then switch back once the upgrade was done and tested. |
@John-Blair Each release gets it own folder so that there is always a easy way to rollback to a working state if something goes wrong. Octopus would basically switch the "Phycial Path" of the website back to the old release if we rollback. This is standard Octopus setup with IIS and we will not deploy to the same folder - that is not an option for us. We could stop the website before the release but this would mean that the site starts returning "Service Unavailable" to visitors and the total downtime will be longer using this approach. It would be an option but a bad option, the root problem is that the locks on the files does not get released when the app pool recycle, we have plenty of ways to work around that but I guess that the main idea with this issue here on GitHub is to figure out and solve the root cause and not provide work arounds. Sounds like a nice setup with the DNS-switch but in our case we have around 15-20 sites and I would like to keep the deploy process as simple as possible. |
Hi, The errors occures on app restarts, deploys and on reindexing. We are seeing a couple of errors in the logs.
|
I'm also running into this nucache error with a load balanced azure site. In my case, I suspect it's a latency issue, as the database in based in the US, but the only front end server that has had this issue is in France. If that is the cause, one potential solution would to add support for azure cosmos. Not sure that is feasible to implement, since it is a nosql database, but it should improve database read speeds for multi region sites. |
@jaandrews which maindomlock implemenation are you using? For azure this should be configured to SqlMainDomLock |
@p-m-j
in the web config. Site is running umbraco 8.17.1 with cloudflare acting as the load balancer in front of azure. |
@jaandrews presumably you also have EnvironmentTemp for LocalTempStorageLocation Any other details you can provide about the issue, is this on startup or just happens with no obvious trigger? |
@p-m-j Yeah, LocalTempStorage is already set to EnvironmentTemp. As for the error I got, it's the same one as at the beginning of this thread. After taking a second look at that log, I confirmed the error I encounter stems from a timeout error. The error message is "The thread has been aborted, because the request has timed out." which is then followed by the nucache error (see |
We received this error yesterday as well with our Azure App Service. App was running, then failed to boot. A complete restart of the App Service was necessary to fix this.
Are both within the web.config, and Umbraco.Core.MainDom.Lock = SqlMainDomLock |
@chriskarkowsky Is there a timeout error in front of that error in your logs (would look something like the logs I linked to in my previous comment). If so increasing the executionTimout of the httpRuntime in the web config might help. I've been in touch with umbraco support and that's one of the solutions they've suggested. Makes sense, but I still need to try it myself. So it would be something like
which would have it timeout after 30 minutes. Not ideal, but would avoid the site crashing if it works as expected. |
@jaandrews I have this timeout exception: However the timestamp was about an hour before the Boot Failure messages. It seems your case was right before the Boot Failure. What I find strange is that has a Boot Failure log on the same ProcessID as it was currently running on. This means Umbraco is trying to boot itself again on the same Process? Restarting the App Service would spin up a new process and boot, which is why i'm guessing this what fixed my case. |
Just wanted to note that we have been experiencing this issue consistently over the past week with v9.3.0 not only on an Azure AppService, but also in local development (VS2022). The only thing that made any difference was disabling the Disk Cache databases as per the documentation on Umbraco. (https://our.umbraco.com/Documentation/Reference/V9-Config/NuCacheSettings/) |
Thank you! I have been having the same problem with Version 9.3.0! |
I can't help feeling that something changed in 9.3.0 that has exacerbated the issue - we only started seeing this behaviour when we upgraded from the previous version... |
I just had this re-occur in Umbraco Cloud on Umbraco 8.17.2:
reading through the logs, I see this a little further down right before I get the NuCache errors above (~ every 4 seconds):
|
@p-m-j I'm because this issue is re-opened, does that mean that Umbraco are actively investigating again? |
Apologies, re-opened on a whim because it's clear from the activity there are still problems in this area, I shouldn't have re-opened as this issue already has an associated fix with a version number. If you have steps to reproduce please open a new issue. In this case, we fixed #5924 for version 8.1.1 - which wasn't enough, but that's why this issue is closed and included in the release notes. Sorry again for any confusion. |
@p-m-j we are using Umbraco version 8.12.2. The site was running since 8 months until this morning we got this issue,
We have these setings in Azure,
I have restarted the Azure web app to make it work. There is only one server. We are fear that this will happen again, What can we do? |
@iqb-dawn if it’s a small site the easiest thing that guarantees you will never get exceptions thrown relating to the nucache physical files is to just turn off the nucache physical files see example here https://our.umbraco.com/documentation/Reference/V9-Config/NuCacheSettings/#additional-settings, the only down side being for a large site it may take longer to boot. |
@p-m-j as per above
Switching of the file creation is a v9 feature AFAIK. |
The same setting was around in v8, the documentation for it is hidden away on this page https://our.umbraco.com/Documentation/Fundamentals/Setup/Server-Setup/Load-Balancing/azure-web-apps-v8 (search this page for PublishedSnapshotServiceOptions) I'm going to lock conversation here, please open a new issue if you have steps to reproduce or detailed logs. |
PR: #5924
Something is happening in Azure WebApps where the NuCache.Content.db file is locked causing the site to hang. I've attached the log file for reference.
The exception is as follows:
This is on Umbraco 8.0.1 which was upgraded from Umbraco 8.0.0. (attachment will need to be changed from .txt to.json)
UmbracoTraceLog.RD2818786B7D96.20190319.txt
The text was updated successfully, but these errors were encountered: