Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness/readiness probe killing community pod before it has a chance to download package #344

Closed
martinpovolny opened this issue Sep 21, 2020 · 3 comments

Comments

@martinpovolny
Copy link

Hello!

I have encountered a problem with the Operators Marketplace in CRC. However, it's not CRC specific.

When the marketplace container "community-operators-XXX" is deployed, it starts by downloading a list of packages. This is over 50MB of data downloaded in small chunks and it takes some time to download even on a decent internet connection.

While this is happening the readiness and/or liveness probes report failure.

What happens in my case is that the pods get repeatedly killed before they collected the necessary data and could have become ready. This happens several times before OpenShift gives up and the deployment stays failed.

This a very well hidden problem because the Operator (list of operators) still shows items, there's no indication of the problem except for the failed Pods in the operator-marketplace project and a lower than expected the number of items in the .....

$ oc get packagemanifests -n openshift-marketplace | wc -l
124

vs

$ oc get packagemanifests -n openshift-marketplace | wc -l
250

In my case, I solved the problem by increasing the initialDelaySeconds to 300 and the failureThreshold to 100. That gave the container enough time to download the data before it would get killed and redeployed.

However, I am not sure what is the correct place to do that.

I think that I am surely not the only person having this issue. Especially with CRC people might be testing OpenShift on not-that-great internet connectivity. The issue is well hidden. In the console (web interface), there's no indication of a problem, just that operators are missing in the list. I have noticed only because I was missing a particular operator that I wanted to work with. Also, the wording in CRC docs suggests that some things might be degraded or reporting issues due to memory limitations so it's easy to miss the problem of a failed deployment.

A temporary solution to the problem could be increasing the initialDelaySeconds etc. in the right place. Fixing it properly might involve reimplementing the initialization with the "Init Container" pattern, or changing the readiness/liveness probes or something else?

Thanks and regards!

@martinpovolny
Copy link
Author

Any chance someone taking a look at this? Is there more information needed?

kill-cycle-2020-11-18_15-31

@martinpovolny
Copy link
Author

Something like this fixes the problem for me:

oc patch deployment/community-operators -n openshift-marketplace --patch '{"spec": {"template": {"spec": {"containers": [{"name": "community-operators", "readinessProbe": {"initialDelaySeconds": 300}, "livenessProbe": {"initialDelaySeconds": 300}}]}}}}'

But I think it's a design flaw to count on the list of operators to be downloaded under 30s.

@martinpovolny
Copy link
Author

Fixed in 4.6.

Related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1873546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant