-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaler changes broke deployment in non-default namespace #291
Comments
I tried setting the namespace to
And then it times out. |
I found the problem. My autoscaler change introduced a new service account for the pods to use to push metrics to the autoscaler. But the service account is created in the default namespace, so pods in other namespaces can't use it.
I wanted to get rid of that service account and use DNS anyway (https://github.com/elafros/elafros/blob/d7992e754237afae6468b41bb8ed304760f79108/cmd/ela-queue/main.go#L105). So I'll just do that. |
It looks like things might still be a bit broken - when I try to run the tests now, they timeout when waiting for the Configuration to be updated with the new revision after modifying the Configuration:
Sometimes the old behavior still happens:
I tried changing the namespace to
|
Okay, I think I have a better understanding about what's going on here. The autoscaler change (#229) was broken in non-default namespaces because it created the queue's service account in the default namespace, and the queue in namespace Additionally, the autoscaler increased the Revision Pod's CPU request from 0.025 to 1.000, which was necessary for autoscaling and for the scheduler to exert pressure on the cluster autoscaler by having unscheduled nodes. However the default developer setup is 3, 1-core nodes and each node has an overhead of 0.03 for k8s resources. Revisions simply didn't fit anymore. #335 increases the default node to a 4-core machine which solves this problem. Finally, initial pull latency on a new cluster (empty cache) is very large. Large enough to cause the conformance tests to fail even before the autoscaling change. Sometimes (seems non-deterministic) the conformance tests will fail the first time on a new cluster. To solve this problem, I recommend either 1) deploying a Revision to warm the cache as part of the conformance tests or 2) increasing the conformance test timeouts. I prefer warming the cache because I don't like slow tests. Here is the data I based my analysis on: There were two failure modes I observed by running
I tested two commits, the one before my autoscaler change (tagged Results:
Note: On 1 core clusters, failures were at the "Revision will be updated..." step, after deploying the first revision. On 2 core clusters, failures were at the "Configuration will be updated..." step, after deploying the second revision. |
I've been trying to hack in deploying a Revision before the conformance tests to warm the node cache, but it fails in all kinds of interesting and unexpected ways. I think it would be more clean to just install a warm image in the test setup (#286) to get a jump on the pull latency. |
@grantr confirms that bumping the timeout and the larger clusters makes the conformance tests pass consistently. |
🤖 Triggering CI on branch 'release-next' after synching to upstream/master
Discovered via a bisect, the autoscaler changes broke deployment for Routes and Configuration unless they are deployed to the namespace
default
.I discovered this by running the conformance tests, which currently timeout at
The Revision will be updated when it is ready to serve traffic
. The logs say something like the following repeatedly:You can reproduce this with the helloworld sample by:
kubectl create namespace spacedogs
spacedogs
)bazel run sample/helloworld:everything.create
kubectl -n ela-system logs -f $(kubectl -n ela-system get pods -l app=ela-controller -o name)
Then you will see the following error in the logs:
I think one of the main causes is that the
ela-autoscaler
andela-revision
service accounts exist in the namespacedefault
. When I remove them fromela_pod.go
andela_autoscaler.go
, deployments go further however requests to an updated Configuration end up failing repeatedly with 503's. I have also tried changing their namespace toela-system
but I suspect they need to be created in the same namespace as the Route and Revision, so the solution might be creating that service account dynamically.The text was updated successfully, but these errors were encountered: