Liveness/readiness probe killing community pod before it has a chance to download package #344

martinpovolny · 2020-09-21T08:29:17Z

Hello!

I have encountered a problem with the Operators Marketplace in CRC. However, it's not CRC specific.

When the marketplace container "community-operators-XXX" is deployed, it starts by downloading a list of packages. This is over 50MB of data downloaded in small chunks and it takes some time to download even on a decent internet connection.

While this is happening the readiness and/or liveness probes report failure.

What happens in my case is that the pods get repeatedly killed before they collected the necessary data and could have become ready. This happens several times before OpenShift gives up and the deployment stays failed.

This a very well hidden problem because the Operator (list of operators) still shows items, there's no indication of the problem except for the failed Pods in the operator-marketplace project and a lower than expected the number of items in the .....

$ oc get packagemanifests -n openshift-marketplace | wc -l
124

vs

$ oc get packagemanifests -n openshift-marketplace | wc -l
250

In my case, I solved the problem by increasing the initialDelaySeconds to 300 and the failureThreshold to 100. That gave the container enough time to download the data before it would get killed and redeployed.

However, I am not sure what is the correct place to do that.

I think that I am surely not the only person having this issue. Especially with CRC people might be testing OpenShift on not-that-great internet connectivity. The issue is well hidden. In the console (web interface), there's no indication of a problem, just that operators are missing in the list. I have noticed only because I was missing a particular operator that I wanted to work with. Also, the wording in CRC docs suggests that some things might be degraded or reporting issues due to memory limitations so it's easy to miss the problem of a failed deployment.

A temporary solution to the problem could be increasing the initialDelaySeconds etc. in the right place. Fixing it properly might involve reimplementing the initialization with the "Init Container" pattern, or changing the readiness/liveness probes or something else?

Thanks and regards!

The text was updated successfully, but these errors were encountered:

martinpovolny · 2020-11-18T14:32:17Z

Any chance someone taking a look at this? Is there more information needed?

martinpovolny · 2020-11-18T17:39:12Z

Something like this fixes the problem for me:

oc patch deployment/community-operators -n openshift-marketplace --patch '{"spec": {"template": {"spec": {"containers": [{"name": "community-operators", "readinessProbe": {"initialDelaySeconds": 300}, "livenessProbe": {"initialDelaySeconds": 300}}]}}}}'

But I think it's a design flaw to count on the list of operators to be downloaded under 30s.

martinpovolny · 2020-11-18T18:15:34Z

Fixed in 4.6.

Related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1873546

martinpovolny mentioned this issue Sep 21, 2020

add ArgoCD application of deploying the ODH operator to crc fork operate-first/continuous-deployment#27

Closed

martinpovolny closed this as completed Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness/readiness probe killing community pod before it has a chance to download package #344

Liveness/readiness probe killing community pod before it has a chance to download package #344

martinpovolny commented Sep 21, 2020

martinpovolny commented Nov 18, 2020

martinpovolny commented Nov 18, 2020

martinpovolny commented Nov 18, 2020

Liveness/readiness probe killing community pod before it has a chance to download package #344

Liveness/readiness probe killing community pod before it has a chance to download package #344

Comments

martinpovolny commented Sep 21, 2020

martinpovolny commented Nov 18, 2020

martinpovolny commented Nov 18, 2020

martinpovolny commented Nov 18, 2020