Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race on eventually consistent operations of provision/deprovision of Project's k8s resources [e.g. service, ingress] #114

Closed
elenaviter opened this issue Sep 21, 2023 · 4 comments · Fixed by #115
Labels
needs-triage Needs looking at to decide what to do

Comments

@elenaviter
Copy link
Contributor

Current Behavior

Symptoms

After Project is created/started: project is indicated as up an running but it is unreachable via Editor link.
For ops its very obvious because they can see the ingresses and simply detect the problem. But this is not obvious given with FlowFuse interface only where the project is stated as running and the link to the Editor is available but never get to work.

Problems description

There are 2 problems with the way how k8s resources are provisioned and deprovisioned, for the Project:

  1. Dependent provisioning operations are done concurrently instead of sequentially.
  2. In deprovision scenario (e.g. "suspend"), the state of the project is changed optimistically w/o guarantee (and handling) of the actual state of the resources it is bound to, which allows the clients to change the state of the project further while the previous operations are still in progress.

Problems detailed description

  1. In "createProject", it's a race condition between the operations that are almost simultaneously pushed onto concurrent promises list which allows them to execute in any order and priority. Whilst creation of a service depends on creation of deployment, and creation of ingress depends on creation of a service. This can lead to situation when the ingress won't be created in first place because service creation is yet not started or in flight, and k8s API will return, in attempt to create the ingress for service, "Not found" error.
    E..g instead of dealing with a list of concurrent promises

    return Promise.all(promises).then(async () => {...}

    must be smth like

    await this._k8sAppApi.createNamespacedDeployment(...);
    // poll, verify it's there
    await this._k8sApi.createNamespacedService(...);
    // poll, verify it's there
    await this._k8sNetApi.createNamespacedIngress(...);
    // poll, verify it's there
    // Done with resources
    this._app.log.debug(`[k8s] Container ${project.id} started`)
    project.state = 'running'
    await project.save()
    this._projects[project.id].state = 'starting'
  2. In "suspend" project handler, the operations of deprovisioning service and ingress are not guaranteed to be accomplished as a result of the handler execution (however the project's state is moved to "suspended" which unlocks consequential state change actions on it, by the clients).
    This problem is introduced by this commit which aims to deprovision the networking resources associated with the project when the project is suspended.
    k8s operations that provision/deprovision/modify resource (e.g. 'delete service', 'delete ingress') are "promises" for the client of this API. They just send the request onto queue of requests for the relevant k8s service. Thus in order to guarantee the operation was completed, corresponding resources should be polled in order to ensure they are left in desired state.
    This polling is already done in "suspend" handler for deployments but not for service and ingress resources.

Additional approach when concurrently work with [k8s] resources is the "[Kubernetes] Resource Versioning" so that the originator of the change would know if the change it conducted was applied ('no-lock write attempt').

Expected Behavior

  • Operations of provisioning of k8s resources are done with respect to the dependency between the resources
  • Whenever Project's operation can assume concurrent modification of k8s resources, such operations side effects should be guaranteed, e.g. they have to be "mutex-ed".
  • Project's state reflects the actual status of the deployment, including the bound [k8s] resources.

Steps To Reproduce

This is especially simple to reproduce when you try to change the stack for the project.
Starting from recently, changing the stack automatically suspends and then starts the project.

Above steps 100% lead to the Project unreachability via ingress after attempt to change the stack.

To recover, one has to suspend the instance, wait for a bit and then start it again.

Environment

  • FlowForge version: 1.11.3 (kuber driver 1.11.0)
  • Node.js version: any (in our case its 18.16.0)
  • npm version: any (9.5.1)
  • Platform/OS: any OS.
  • Browser: any

k8s setup

@elenaviter elenaviter added the needs-triage Needs looking at to decide what to do label Sep 21, 2023
@hardillb
Copy link
Contributor

Thanks for raising this. I'll have a look.

I've not seen this on my local deployment or on FlowFuse Cloud. I assume you are running on a K8s cluster with a lot more going on which is causing slight delays in provisioning artefacts?

hardillb added a commit that referenced this issue Sep 25, 2023
@hardillb hardillb mentioned this issue Sep 25, 2023
11 tasks
@elenaviter
Copy link
Contributor Author

Thanks for raising this. I'll have a look.

I've not seen this on my local deployment or on FlowFuse Cloud. I assume you are running on a K8s cluster with a lot more going on which is causing slight delays in provisioning artefacts?

True about delays.
Actually, I have 5 clusters now, they are very different in their nature and scale.
The problem is 100% reproducible on each of them. it's good case to cover.

@hardillb
Copy link
Contributor

@elenaviter can you have a look at the linked PR, I think that should fix the problem, but as I can't reproduce the issue it's hard to test. If you can build locally and test that would be even better.

Thanks.

@hardillb hardillb linked a pull request Sep 29, 2023 that will close this issue
11 tasks
@elenaviter
Copy link
Contributor Author

Hey @hardillb

I hope this message finds you well. My apologies for the delayed response due to a recent illness. I'm now back and have had the opportunity to review the PR in question.

Observations on the PR

Upon review and subsequent adjustments to the PR, the initial issue - "after suspension, the project cannot be started normally" - has been resolved.
However, a subsequent challenge has emerged related to the discrepancy between refreshing auth tokens for the Project and the actual application of these new tokens in Project deployment.

Identified Issue: Token Refresh and Deployment Discrepancy

The refreshed tokens are optimistically saved to the database during the assembly of metadata for project deployment within the createDeployment() routine:

Copy code
const authTokens = await project.refreshAuthTokens()
...
await project.save()

However, the deployment, where these tokens are intended to be applied, does not necessarily succeed. The attempt to create a k8s deployment may fail if, for some reason, the deployment existed previously.

The provisioning of k8s deployment will assuredly fail during FF upgrade if running projects exist due to the following sequence:

  • FF starts, and the driver "init" command executes.
  • This triggers a probe for registered projects, expecting those running as containers to be reachable with the "readNamespacedDeployment" k8s API.

Here there's a bug, according to which we searched for deployments in the incorrect namespace giving the "project does not exist" result, and consequently leading to createProject execution for all projects registered in FF (except suspended ones), thereby invalidating the auth tokens for those projects without reapplying refreshed tokens to deployments.

Consequently:

Existing deployments continue running with old, now unusable auth tokens. This prevents FF from communicating with the Editor using new auth tokens and client id, and the launcher (if needed) from communicating with FF using the old auth token, according to the one it has in its deployment.

I will make PR to your branch that solve both issues

  • projects auth unsync after upgrade
  • projects deprovision/reprovision problem

elenaviter added a commit to elenaviter/flowforge-driver-k8s that referenced this issue Oct 11, 2023
elenaviter added a commit to elenaviter/flowforge-driver-k8s that referenced this issue Oct 11, 2023
@elenaviter elenaviter mentioned this issue Oct 11, 2023
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage Needs looking at to decide what to do
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants