Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustments for SqlMainDomLock and others to make azure operations more resilient #8398

Merged
merged 17 commits into from
Jul 9, 2020

Conversation

Shazwazza
Copy link
Contributor

@Shazwazza Shazwazza commented Jul 8, 2020

Details are here: #8215

This should also resolve #8392

  • Improvements to SqlMainDomLock
    • Better management of db instances and transactions - previously this was trying to manage a single database instances which isn't what should be done, db instances should exist for a short period of time and disposed of. Changed to more explicit code for managing transactions
    • Increase polling times - though according to logs this wasn't causing problems but is still too aggressive for what it's doing
    • Don't stop listening (shut down maindom) if there are SQL exceptions, only if the appdomain has actually triggered a shutdown. We cannot just stop listening even if there are errors since it will lead to premature MainDom shutdown when the app is not actually being shutdown. This means it will just retry on the next poll.
  • Investigation into scheduled publishing lock request timeouts (also reported here Scheduled content leads to SQL error when rebuilding cache  #8392) which occurs around the same time as when SqlMainDomLock fails
    • The PerformScheduledPublishInternal was using yield returns with a wrapped Scope/IDisposable which is a bit hard to follow along. One issue is that we were yield returning when the Saving event was canceled for an individual item but not continuing so the logic would have proceeded anyways for that item even if that event was canceled and found other similar issues where it would actually break out of the entire process instead of skip to the next record
    • I believe one of the main issues seen with the lock request timeouts like those in Scheduled content leads to SQL error when rebuilding cache  #8392 and in the logs for 8.6.1 - Azure Web App - Indexes disappear completely #8215 is because the PerformScheduledPublishInternal takes a WriteLock at the very beginning of the method. This is the write lock that times out according to the stack traces. This is probably because scheduled publishing runs every minute which is actually a bit crazy to think we lock the whole content tree every minute even if we aren't writing anything. And because of that I think this may also be one of the other main suspects as to why we saw SqlMainDomLock error around the same time we see the sql lock timeout error for scheduled publishing. To fix this:
      • We need a new and much faster method: DocumentRepository.HasContentForRelease instead of only relying on the potentially expensive DocumentRepository.GetContentForRelease, similarly for GetContentForExpiration we should also have HasContentForExpiration. Then we separate the job of PerformScheduledPublishInternal into 2 parts: First, without any read or write locks just check if either HasContentForRelease or GetContentForExpiration is true, then we can take a write lock and proceed as we are today. This will mean that there isn't a writelock on all content attempted to be made every minute.
  • Don't short circuit index rebuilding if one of the populators throws an exception, continue with the other populators
  • Brings our sql transient fault handling in-line with the latest specs from MS

@Shazwazza Shazwazza changed the base branch from v8/contrib to v8/dev July 8, 2020 03:10
@Shazwazza Shazwazza force-pushed the v8/bugfix/sqlmaindom-updates branch from 56477d9 to 651756d Compare July 8, 2020 03:44
@Shazwazza Shazwazza changed the base branch from v8/dev to v8/8.6 July 8, 2020 03:44
@Shazwazza Shazwazza marked this pull request as ready for review July 8, 2020 08:17
Copy link
Member

@bergmania bergmania left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good, and makes sense due to your description..

Only thing I don't understand/think is intentional, is the delete of the ssl-port

src/Umbraco.Web.UI/Umbraco.Web.UI.csproj Outdated Show resolved Hide resolved
@nul800sebastiaan nul800sebastiaan added this to the sprint140 milestone Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants