Autoscaler changes broke deployment in non-default namespace #291

bobcatfish · 2018-03-06T02:30:43Z

Discovered via a bisect, the autoscaler changes broke deployment for Routes and Configuration unless they are deployed to the namespace default.

I discovered this by running the conformance tests, which currently timeout at The Revision will be updated when it is ready to serve traffic. The logs say something like the following repeatedly:

I0306 00:58:21.769098       1 controller.go:395] Figuring out routes for Route: route-example
I0306 00:58:21.770977       1 controller.go:425] Failed to fetch Revision:  : resource name may not be empty
I0306 00:58:21.770993       1 controller.go:400] Failed to get a route for target {Name: Revision: Configuration:configuration-example Percent:100} : "resource name may not be empty"
I0306 00:58:21.771015       1 controller.go:447] Failed to get routes for route-example : "resource name may not be empty"
I0306 00:58:21.771022       1 controller.go:329] Failed to create Routes: resource name may not be empty

You can reproduce this with the helloworld sample by:

Create a brand new namespace such as kubectl create namespace spacedogs
Update the hello world example's configuration.yaml and route.yaml to use the namespace you created (spacedogs)
Run bazel run sample/helloworld:everything.create
Look at the logs with kubectl -n ela-system logs -f $(kubectl -n ela-system get pods -l app=ela-controller -o name)

Then you will see the following error in the logs:

I0306 02:29:51.856424       1 controller.go:395] Figuring out routes for Route: route-example
I0306 02:29:51.857035       1 controller.go:404] Revision "p-e09d9cb4-bfd1-49f3-98cd-5999acdfb5d9" is not ready
I0306 02:29:51.862241       1 controller.go:424] Failed to fetch Revision:  : resource name may not be empty
I0306 02:29:51.862533       1 controller.go:400] Failed to get a route for target {Name: Revision: Configuration:configuration-example Percent:100} : "resource name may not be empty"
I0306 02:29:51.862900       1 controller.go:446] Failed to get routes for route-example : "resource name may not be empty"
I0306 02:29:51.863126       1 controller.go:329] Failed to create Routes: resource name may not be empty
E0306 02:29:51.863353       1 controller.go:251] error syncing "spacedogs/route-example": resource name may not be empty

I think one of the main causes is that the ela-autoscaler and ela-revision service accounts exist in the namespace default. When I remove them from ela_pod.go and ela_autoscaler.go, deployments go further however requests to an updated Configuration end up failing repeatedly with 503's. I have also tried changing their namespace to ela-system but I suspect they need to be created in the same namespace as the Route and Revision, so the solution might be creating that service account dynamically.

The text was updated successfully, but these errors were encountered:

bobcatfish · 2018-03-07T18:44:50Z

I tried setting the namespace to default in the conformance test just to get it to pass, but it fails in the last phase (making sure it can hit the new revision):

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated when the Revision is ready to serve traffic
STEP: Once the Configuration has been updated with the Revision, the Route will be updated to route traffic to the Revision
STEP: Wait for the ingress loadbalancer address to be set
STEP: Make a request to the Revision that is now deployed and serving traffic
STEP: Patch the Configuration to use a new image
STEP: A new Revision will be made and the Configuration will be updated with it
STEP: The new Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated to indicate the new revision is ready
STEP: The Route will then immediately send all traffic to the new revision
STEP: Wait for the ingress to actually start serving traffic from the newly deployed Revision
Connected to previous revision, retryingRetrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
...

And then it times out.

josephburnett · 2018-03-07T21:21:21Z

I found the problem. My autoscaler change introduced a new service account for the pods to use to push metrics to the autoscaler. But the service account is created in the default namespace, so pods in other namespaces can't use it.

2018/03/07 21:14:32 Failed to create pod: pods "p-3b116aaa-879d-4756-8f82-3ee7f751b7bf-warmpod-d7a9f3b9-8c3a-4800-912c-f651949a50e0" is forbidden: error looking up service account joe/ela-revision: serviceaccount "ela-revision" not found

I wanted to get rid of that service account and use DNS anyway (https://github.com/elafros/elafros/blob/d7992e754237afae6468b41bb8ed304760f79108/cmd/ela-queue/main.go#L105). So I'll just do that.

bobcatfish · 2018-03-08T00:19:55Z

It looks like things might still be a bit broken - when I try to run the tests now, they timeout when waiting for the Configuration to be updated with the new revision after modifying the Configuration:

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated when the Revision is ready to serve traffic
STEP: Once the Configuration has been updated with the Revision, the Route will be updated to route traffic to the Revision
STEP: Wait for the ingress loadbalancer address to be set
STEP: Make a request to the Revision that is now deployed and serving traffic
STEP: Patch the Configuration to use a new image
STEP: A new Revision will be made and the Configuration will be updated with it
STEP: The new Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated to indicate the new revision is ready

• Failure [89.510 seconds]
Route
/usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:152
  Deploying an app with a pre-built container
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:196
    Creates a route, serves traffic to it, and serves traffic to subsequent revisions [It]
    /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197

    Expected error:
        <*errors.errorString | 0xc4204ce410>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    not to have occurred

    /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/crd_polling.go:64

Sometimes the old behavior still happens:

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic

• Failure [60.549 seconds]
Route
/usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:152
  Deploying an app with a pre-built container
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:196
    Creates a route, serves traffic to it, and serves traffic to subsequent revisions [It]
    /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197

    Expected error:
        <*errors.errorString | 0xc4203edd70>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    not to have occurred

I tried changing the namespace to default at @grantr 's suggestion, which put me back in the state where I get endless 503's after updating an existing Configuration:

➜  elafros git:(master) ✗ go test -v ./test/conformance -ginkgo.v                
=== RUN   TestConformance
Running Suite: Conformance Suite
================================
Random Seed: 1520467880
Will run 1 of 1 specs

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated when the Revision is ready to serve traffic
STEP: Once the Configuration has been updated with the Revision, the Route will be updated to route traffic to the Revision
STEP: Wait for the ingress loadbalancer address to be set
STEP: Make a request to the Revision that is now deployed and serving traffic
STEP: Patch the Configuration to use a new image
STEP: A new Revision will be made and the Configuration will be updated with it
STEP: The new Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated to indicate the new revision is ready
STEP: The Route will then immediately send all traffic to the new revision
STEP: Wait for the ingress to actually start serving traffic from the newly deployed Revision
Connected to previous revision, retryingRetrying for code 503
Retrying for code 503

josephburnett · 2018-03-09T18:46:52Z

Okay, I think I have a better understanding about what's going on here.

The autoscaler change (#229) was broken in non-default namespaces because it created the queue's service account in the default namespace, and the queue in namespace pizzaplanet couldn't use it. That was fixed by finding the autoscaler via DNS (#309).

Additionally, the autoscaler increased the Revision Pod's CPU request from 0.025 to 1.000, which was necessary for autoscaling and for the scheduler to exert pressure on the cluster autoscaler by having unscheduled nodes. However the default developer setup is 3, 1-core nodes and each node has an overhead of 0.03 for k8s resources. Revisions simply didn't fit anymore. #335 increases the default node to a 4-core machine which solves this problem.

Finally, initial pull latency on a new cluster (empty cache) is very large. Large enough to cause the conformance tests to fail even before the autoscaling change. Sometimes (seems non-deterministic) the conformance tests will fail the first time on a new cluster.

To solve this problem, I recommend either 1) deploying a Revision to warm the cache as part of the conformance tests or 2) increasing the conformance test timeouts. I prefer warming the cache because I don't like slow tests.

Here is the data I based my analysis on:

There were two failure modes I observed by running watch kubectl get all -n pizzaplanet while running the conformance tests.

ContainerCreating until the test timed out--this was initial pull latency. The Pod was scheduled but it hasn't started yet.
Pending. The Pod was unscheduled. Didn't fit. Never would.

I tested two commits, the one before my autoscaler change (tagged lastknowngood and my autoscaler change with the DNS change (#309) cherry picked (tagged dnsfix). I also tested three cluster configurations--3 nodes with 1, 2 and 4 cores each. In each case, I created a brand new cluster, deployed Elafros, and then ran conformance tests right away. In cases where it failed on ContainerCreating, I retested with a new cluster but slept for 120 seconds after deploying Elafros.

Results:

lastknowngood with 1 core: FAILED with ContainerCreating
lastknowngood with 2 cores: PASSED after 108 seconds (second, warm run was 45 seconds)
lastknowngood with 2 cores, sleep 120: PASSED after 95 seconds
dnsfix with 1 core: FAILED with Pending
dnsfix with 2 cores: FAILED with ContainerCreating
dnsfix with 2 cores, sleep 120: FAILED with ContainerCreating once and Pending a second time
dnsfix with 4 cores: PASSED after 95 seconds

Note: On 1 core clusters, failures were at the "Revision will be updated..." step, after deploying the first revision. On 2 core clusters, failures were at the "Configuration will be updated..." step, after deploying the second revision.

josephburnett · 2018-03-09T19:58:42Z

I've been trying to hack in deploying a Revision before the conformance tests to warm the node cache, but it fails in all kinds of interesting and unexpected ways. I think it would be more clean to just install a warm image in the test setup (#286) to get a jump on the pull latency.

josephburnett · 2018-03-12T21:55:52Z

@grantr confirms that bumping the timeout and the larger clusters makes the conformance tests pass consistently.

🤖 Triggering CI on branch 'release-next' after synching to upstream/master

bobcatfish assigned josephburnett Mar 6, 2018

mattmoor added kind/bug Categorizes issue or PR as related to a bug. kind/feature Well-understood/specified features, ready for coding. labels Mar 6, 2018

josephburnett mentioned this issue Mar 7, 2018

Find autoscaler via DNS. #309

Merged

josephburnett mentioned this issue Mar 8, 2018

Disable autoscaling in non-default namespaces. #325

Closed

josephburnett mentioned this issue Mar 9, 2018

Increase default node cpu size from 1 to 4. #335

Merged

josephburnett mentioned this issue Mar 9, 2018

Bump polling timeout from 1 to 2 minutes. #338

Merged

josephburnett closed this as completed Mar 12, 2018

mgencur pushed a commit to mgencur/serving-1 that referenced this issue Oct 22, 2019

Merge pull request knative#291 from openshift/release-next-ci

0010555

🤖 Triggering CI on branch 'release-next' after synching to upstream/master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler changes broke deployment in non-default namespace #291

Autoscaler changes broke deployment in non-default namespace #291

bobcatfish commented Mar 6, 2018

bobcatfish commented Mar 7, 2018

josephburnett commented Mar 7, 2018

bobcatfish commented Mar 8, 2018

josephburnett commented Mar 9, 2018 •

edited

Loading

josephburnett commented Mar 9, 2018

josephburnett commented Mar 12, 2018

Autoscaler changes broke deployment in non-default namespace #291

Autoscaler changes broke deployment in non-default namespace #291

Comments

bobcatfish commented Mar 6, 2018

bobcatfish commented Mar 7, 2018

josephburnett commented Mar 7, 2018

bobcatfish commented Mar 8, 2018

josephburnett commented Mar 9, 2018 • edited Loading

josephburnett commented Mar 9, 2018

josephburnett commented Mar 12, 2018

josephburnett commented Mar 9, 2018 •

edited

Loading