Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alert banner for all MR blocks, using info from both repair task and MR jobs #850

Open
chensation opened this issue May 11, 2023 · 0 comments

Comments

@chensation
Copy link
Collaborator

chensation commented May 11, 2023

Blocks

  1. IS throttling updates to ensure safety, prevents job approvals
    Impact MR jobs do not get approved if already executing updates exceed the pre-defined safe threshold.
    Indication Today None.
    Banner Mesage:

    • "Active updates count has exceeded the max allowed for safe rollout of updates. Once the existing updates complete, the pending updates will start automatically. Track status of Active: List<repairTask/MRJob>. Pending: List. To forcefully allow an MR job to go through, connect to the SF cluster and execute this command: Invoke-ServiceFabricInfrastructureCommand -ServiceName <InfrastructureService Name like fabric:/System/InfrastructureService/nt1> -Command AllowAction:<MR_job-id_guid>:*:Prepare"
  2. Long running MR jobs
    Impact Due to the long running jobs, other jobs get throttled and do not get approved
    Indication Today Repair task executing for a very long time (more than 2hrs)
    Banner Message:

    • "Repair task X has been executing for Y amount of time, which doesn’t seem normal. This update can prevent other updates from going through. Please reach out to the Azure Compute teams (“Compute Manager/Blackbird”) to figure out why the platform updates are not completing."
  3. Safety checks blocking approvals (due to Service health/min replica config issues)
    Impact Repairs stuck in preparing and MR jobs are not approved.
    Indication Today
    a. Repair tasks get stuck in the preparing state while disabling the nodes
    b. The node lists the reason for the failing safety check, which prevents disabling
    Banner Message:

    • "Repair task X has been stuck in the preparing state for Y amount of time. This usually happens due to the following reasons:
      a. Service health related issues. Please check the health of the service on the node: List and fix the service for the updates to get unblocked.
      b. Service replica configuration for max/min replica count. Updates will not go through if the min replica configuration can’t be ensured"
    • Link to the public doc which talks about all this.
  4. Safety checks blocking approvals (Seed node removal)
    Impact Repairs stuck in preparing and MR jobs are not approved
    Indication Today Seed node in disabling state and repair task in preparing state
    Banner Message:

    • “Repair task X has been stuck in the preparing state, to disable the seed node Y for removal. This is blocked by design to prevent any risk to the cluster availability. There are multiple options available to come out of this state. Please follow the doc for details:
    • {link to the public doc}”.
  5. Health checks blocking updates
    Impact Repairs stuck in preparing/restoring health checks and MR jobs are not approved
    Indication Today Repair tasks stuck in preparing/restoring health check state
    Banner Message:

    • “Repair task X has been stuck in the preparing/restoring health check state, due to the cluster health related issues. This is expected when the preparing/restoring health checks have been enabled in this cluster and there is any entity which is not healthy. Please ensure all entities in the cluster like nodes & services are healthy for this check to pass and allow the updates to proceed."
  6. Customers deploying less than 5 VMs with MR
    Impact MR jobs can’t execute reliably
    Indication Today None
    Banner Message:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant