-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race on eventually consistent operations of provision/deprovision of Project's k8s resources [e.g. service, ingress] #114
Comments
Thanks for raising this. I'll have a look. I've not seen this on my local deployment or on FlowFuse Cloud. I assume you are running on a K8s cluster with a lot more going on which is causing slight delays in provisioning artefacts? |
True about delays. |
@elenaviter can you have a look at the linked PR, I think that should fix the problem, but as I can't reproduce the issue it's hard to test. If you can build locally and test that would be even better. Thanks. |
Hey @hardillb I hope this message finds you well. My apologies for the delayed response due to a recent illness. I'm now back and have had the opportunity to review the PR in question. Observations on the PRUpon review and subsequent adjustments to the PR, the initial issue - "after suspension, the project cannot be started normally" - has been resolved. Identified Issue: Token Refresh and Deployment DiscrepancyThe refreshed tokens are optimistically saved to the database during the assembly of metadata for project deployment within the Copy code
const authTokens = await project.refreshAuthTokens()
...
await project.save() However, the deployment, where these tokens are intended to be applied, does not necessarily succeed. The attempt to create a k8s deployment may fail if, for some reason, the deployment existed previously. The provisioning of k8s deployment will assuredly fail during FF upgrade if running projects exist due to the following sequence:
Consequently:Existing deployments continue running with old, now unusable auth tokens. This prevents FF from communicating with the Editor using new auth tokens and client id, and the launcher (if needed) from communicating with FF using the old auth token, according to the one it has in its deployment. I will make PR to your branch that solve both issues
|
…ost upgrade project auth sync fix
Current Behavior
Symptoms
After Project is created/started: project is indicated as up an running but it is unreachable via Editor link.
For ops its very obvious because they can see the ingresses and simply detect the problem. But this is not obvious given with FlowFuse interface only where the project is stated as running and the link to the Editor is available but never get to work.
Problems description
There are 2 problems with the way how k8s resources are provisioned and deprovisioned, for the Project:
Problems detailed description
In "createProject", it's a race condition between the operations that are almost simultaneously pushed onto concurrent promises list which allows them to execute in any order and priority. Whilst creation of a
service
depends on creation ofdeployment
, and creation ofingress
depends on creation of aservice
. This can lead to situation when the ingress won't be created in first place because service creation is yet not started or in flight, and k8s API will return, in attempt to create the ingress for service, "Not found" error.E..g instead of dealing with a list of concurrent promises
must be smth like
In "suspend" project handler, the operations of deprovisioning service and ingress are not guaranteed to be accomplished as a result of the handler execution (however the project's state is moved to "suspended" which unlocks consequential state change actions on it, by the clients).
This problem is introduced by this commit which aims to deprovision the networking resources associated with the project when the project is suspended.
k8s operations that provision/deprovision/modify resource (e.g. 'delete service', 'delete ingress') are "promises" for the client of this API. They just send the request onto queue of requests for the relevant k8s service. Thus in order to guarantee the operation was completed, corresponding resources should be polled in order to ensure they are left in desired state.
This polling is already done in "suspend" handler for deployments but not for service and ingress resources.
Expected Behavior
Steps To Reproduce
This is especially simple to reproduce when you try to
change the stack for the project
.Starting from recently, changing the stack automatically suspends and then starts the project.
Above steps 100% lead to the Project unreachability via ingress after attempt to change the stack.
To recover, one has to suspend the instance, wait for a bit and then start it again.
Environment
k8s setup
The text was updated successfully, but these errors were encountered: