[BUG] nifikop fails to scale down nifi cluster due to a crash in the middle of reconcileNifiPod() #79

srteam2020 · 2022-03-30T04:57:11Z

Bug Report

We find that nifikop will never be able to scale down the nificluster successfully if it crashes in the middle of reconcileNifiPod() and later restarts.

More concretely, inside reconcileNifiPod(), nifikop does the following:

check if the desired pod exists. If not, it does the following
create the pod
set status.nodesState[nodeId].configurationState of the nificluster cr to ConfigInSync
set status.nodesState[nodeId].gracefulActionState.actionState of the nificluster cr to GracefulUpscaleSucceeded

If nifikop crashes between 3 and 4 and later restarts, it results in an intermediate state, where the nifi pod is created (with ConfigInSync) but the corresponding actionState is not set. Note that given the pod already exists, nifikop will not run the above steps 2, 3, 4 again.

Later if the user wants to scale down the nificluster, this pod is supposed to be offloaded and deleted gracefully. Inside reconcileNifiPodDelete, nifikop checks whether the corresponding actionState of the pod is GracefulUpscaleSucceeded or GracefulUpscaleRequired. If so, it will add the pod to nodesPendingGracefulDownscale and later offload and delete the nifi node (pod). However, since the corresponding actionState is not set due to the previous crash, the graceful downscale will never happen.

What did you do?
Scale down a nificluster from 2 nodes to 1 node.

What did you expect to see?
The second nifi pod should be deleted successfully.

What did you see instead? Under which circumstances?
The second nifi pod never gets deleted.

Environment

nifikop version: 1546e02

go version: go1.13.9 linux/amd64

Kubernetes version information: v1.18.9

Possible Solution
One potential solution is to switch the order of 3 (set configurationState to ConfigInSync) and 4 (set actionState to GracefulUpscaleSucceeded). If nifikop crashes before ConfigInSync is set, reconcileNifiPod() will later deletes and recreates the pod.

Additional context
We are willing to help fix the bug.
The bug is automatically found by our tool Sieve: https://github.com/sieve-project/sieve

The text was updated successfully, but these errors were encountered:

juldrixx · 2022-03-30T07:53:34Z

Duplicate with #49.

srteam2020 · 2022-03-30T14:58:40Z

Hi @juldrixx Thanks for the reply.

This issue is kinda similar to #49 as both of them are triggered by a crash at a particular point. However, we believe they are different issues and should be handled in different ways.

First, the triggerings are different. #49 happens when a crash happens between (1) updating the config object and (2) setting ConfigOutOfSync in Reconcile() in resource.go. But this issue is triggered when a crash happens between (1) setting ConfigInSync and (2) setting GracefulUpscaleSucceeded in reconcileNifiPod () in nifi.go.

Second, the consequences are different. For #49, once the issue is triggered nifikop cannot successfully restart the pod to load the new configuration. For this isues, once triggered nifikop cannot scale down the nifi cluster.

Regarding the fix, from a high-level both of them could be fixed by carefully changing the order of certain updates. But the concrete fix would be different as the triggerings are different.

erdrix added bug Something isn't working community priority:1 labels Mar 31, 2022

srteam2020 mentioned this issue Apr 18, 2022

[BUG] nifikop fails to scale down the nifi cluster when it misses the chance to set proper gracefulActionState #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] nifikop fails to scale down nifi cluster due to a crash in the middle of reconcileNifiPod() #79

[BUG] nifikop fails to scale down nifi cluster due to a crash in the middle of reconcileNifiPod() #79

srteam2020 commented Mar 30, 2022

juldrixx commented Mar 30, 2022

srteam2020 commented Mar 30, 2022

[BUG] nifikop fails to scale down nifi cluster due to a crash in the middle of reconcileNifiPod() #79

[BUG] nifikop fails to scale down nifi cluster due to a crash in the middle of reconcileNifiPod() #79

Comments

srteam2020 commented Mar 30, 2022

Bug Report

juldrixx commented Mar 30, 2022

srteam2020 commented Mar 30, 2022