Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler changes broke deployment in non-default namespace #291

Closed
bobcatfish opened this issue Mar 6, 2018 · 6 comments
Closed

Autoscaler changes broke deployment in non-default namespace #291

bobcatfish opened this issue Mar 6, 2018 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Well-understood/specified features, ready for coding.

Comments

@bobcatfish
Copy link
Contributor

Discovered via a bisect, the autoscaler changes broke deployment for Routes and Configuration unless they are deployed to the namespace default.

I discovered this by running the conformance tests, which currently timeout at The Revision will be updated when it is ready to serve traffic. The logs say something like the following repeatedly:

I0306 00:58:21.769098       1 controller.go:395] Figuring out routes for Route: route-example
I0306 00:58:21.770977       1 controller.go:425] Failed to fetch Revision:  : resource name may not be empty
I0306 00:58:21.770993       1 controller.go:400] Failed to get a route for target {Name: Revision: Configuration:configuration-example Percent:100} : "resource name may not be empty"
I0306 00:58:21.771015       1 controller.go:447] Failed to get routes for route-example : "resource name may not be empty"
I0306 00:58:21.771022       1 controller.go:329] Failed to create Routes: resource name may not be empty

You can reproduce this with the helloworld sample by:

  1. Create a brand new namespace such as kubectl create namespace spacedogs
  2. Update the hello world example's configuration.yaml and route.yaml to use the namespace you created (spacedogs)
  3. Run bazel run sample/helloworld:everything.create
  4. Look at the logs with kubectl -n ela-system logs -f $(kubectl -n ela-system get pods -l app=ela-controller -o name)

Then you will see the following error in the logs:

I0306 02:29:51.856424       1 controller.go:395] Figuring out routes for Route: route-example
I0306 02:29:51.857035       1 controller.go:404] Revision "p-e09d9cb4-bfd1-49f3-98cd-5999acdfb5d9" is not ready
I0306 02:29:51.862241       1 controller.go:424] Failed to fetch Revision:  : resource name may not be empty
I0306 02:29:51.862533       1 controller.go:400] Failed to get a route for target {Name: Revision: Configuration:configuration-example Percent:100} : "resource name may not be empty"
I0306 02:29:51.862900       1 controller.go:446] Failed to get routes for route-example : "resource name may not be empty"
I0306 02:29:51.863126       1 controller.go:329] Failed to create Routes: resource name may not be empty
E0306 02:29:51.863353       1 controller.go:251] error syncing "spacedogs/route-example": resource name may not be empty

I think one of the main causes is that the ela-autoscaler and ela-revision service accounts exist in the namespace default. When I remove them from ela_pod.go and ela_autoscaler.go, deployments go further however requests to an updated Configuration end up failing repeatedly with 503's. I have also tried changing their namespace to ela-system but I suspect they need to be created in the same namespace as the Route and Revision, so the solution might be creating that service account dynamically.

@mattmoor mattmoor added kind/bug Categorizes issue or PR as related to a bug. kind/feature Well-understood/specified features, ready for coding. labels Mar 6, 2018
@bobcatfish
Copy link
Contributor Author

I tried setting the namespace to default in the conformance test just to get it to pass, but it fails in the last phase (making sure it can hit the new revision):

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated when the Revision is ready to serve traffic
STEP: Once the Configuration has been updated with the Revision, the Route will be updated to route traffic to the Revision
STEP: Wait for the ingress loadbalancer address to be set
STEP: Make a request to the Revision that is now deployed and serving traffic
STEP: Patch the Configuration to use a new image
STEP: A new Revision will be made and the Configuration will be updated with it
STEP: The new Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated to indicate the new revision is ready
STEP: The Route will then immediately send all traffic to the new revision
STEP: Wait for the ingress to actually start serving traffic from the newly deployed Revision
Connected to previous revision, retryingRetrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
Retrying for code 503
...

And then it times out.

@josephburnett
Copy link
Contributor

I found the problem. My autoscaler change introduced a new service account for the pods to use to push metrics to the autoscaler. But the service account is created in the default namespace, so pods in other namespaces can't use it.

2018/03/07 21:14:32 Failed to create pod: pods "p-3b116aaa-879d-4756-8f82-3ee7f751b7bf-warmpod-d7a9f3b9-8c3a-4800-912c-f651949a50e0" is forbidden: error looking up service account joe/ela-revision: serviceaccount "ela-revision" not found

I wanted to get rid of that service account and use DNS anyway (https://github.com/elafros/elafros/blob/d7992e754237afae6468b41bb8ed304760f79108/cmd/ela-queue/main.go#L105). So I'll just do that.

@bobcatfish
Copy link
Contributor Author

It looks like things might still be a bit broken - when I try to run the tests now, they timeout when waiting for the Configuration to be updated with the new revision after modifying the Configuration:

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated when the Revision is ready to serve traffic
STEP: Once the Configuration has been updated with the Revision, the Route will be updated to route traffic to the Revision
STEP: Wait for the ingress loadbalancer address to be set
STEP: Make a request to the Revision that is now deployed and serving traffic
STEP: Patch the Configuration to use a new image
STEP: A new Revision will be made and the Configuration will be updated with it
STEP: The new Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated to indicate the new revision is ready

• Failure [89.510 seconds]
Route
/usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:152
  Deploying an app with a pre-built container
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:196
    Creates a route, serves traffic to it, and serves traffic to subsequent revisions [It]
    /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197

    Expected error:
        <*errors.errorString | 0xc4204ce410>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    not to have occurred

    /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/crd_polling.go:64

Sometimes the old behavior still happens:

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic

• Failure [60.549 seconds]
Route
/usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:152
  Deploying an app with a pre-built container
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:196
    Creates a route, serves traffic to it, and serves traffic to subsequent revisions [It]
    /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197

    Expected error:
        <*errors.errorString | 0xc4203edd70>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    not to have occurred

I tried changing the namespace to default at @grantr 's suggestion, which put me back in the state where I get endless 503's after updating an existing Configuration:

➜  elafros git:(master) ✗ go test -v ./test/conformance -ginkgo.v                
=== RUN   TestConformance
Running Suite: Conformance Suite
================================
Random Seed: 1520467880
Will run 1 of 1 specs

Route Deploying an app with a pre-built container 
  Creates a route, serves traffic to it, and serves traffic to subsequent revisions
  /usr/local/google/home/christiewilson/Code/go/src/github.com/elafros/elafros/test/conformance/route_test.go:197
STEP: Creating a new Route
STEP: Creating Configuration for the Route using the first image
STEP: The Configuration will be updated with the Revision after it is created
STEP: The Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated when the Revision is ready to serve traffic
STEP: Once the Configuration has been updated with the Revision, the Route will be updated to route traffic to the Revision
STEP: Wait for the ingress loadbalancer address to be set
STEP: Make a request to the Revision that is now deployed and serving traffic
STEP: Patch the Configuration to use a new image
STEP: A new Revision will be made and the Configuration will be updated with it
STEP: The new Revision will be updated when it is ready to serve traffic
STEP: The Configuration will be updated to indicate the new revision is ready
STEP: The Route will then immediately send all traffic to the new revision
STEP: Wait for the ingress to actually start serving traffic from the newly deployed Revision
Connected to previous revision, retryingRetrying for code 503
Retrying for code 503

@josephburnett
Copy link
Contributor

josephburnett commented Mar 9, 2018

Okay, I think I have a better understanding about what's going on here.

The autoscaler change (#229) was broken in non-default namespaces because it created the queue's service account in the default namespace, and the queue in namespace pizzaplanet couldn't use it. That was fixed by finding the autoscaler via DNS (#309).

Additionally, the autoscaler increased the Revision Pod's CPU request from 0.025 to 1.000, which was necessary for autoscaling and for the scheduler to exert pressure on the cluster autoscaler by having unscheduled nodes. However the default developer setup is 3, 1-core nodes and each node has an overhead of 0.03 for k8s resources. Revisions simply didn't fit anymore. #335 increases the default node to a 4-core machine which solves this problem.

Finally, initial pull latency on a new cluster (empty cache) is very large. Large enough to cause the conformance tests to fail even before the autoscaling change. Sometimes (seems non-deterministic) the conformance tests will fail the first time on a new cluster.

To solve this problem, I recommend either 1) deploying a Revision to warm the cache as part of the conformance tests or 2) increasing the conformance test timeouts. I prefer warming the cache because I don't like slow tests.

Here is the data I based my analysis on:

There were two failure modes I observed by running watch kubectl get all -n pizzaplanet while running the conformance tests.

  1. ContainerCreating until the test timed out--this was initial pull latency. The Pod was scheduled but it hasn't started yet.
  2. Pending. The Pod was unscheduled. Didn't fit. Never would.

I tested two commits, the one before my autoscaler change (tagged lastknowngood and my autoscaler change with the DNS change (#309) cherry picked (tagged dnsfix). I also tested three cluster configurations--3 nodes with 1, 2 and 4 cores each. In each case, I created a brand new cluster, deployed Elafros, and then ran conformance tests right away. In cases where it failed on ContainerCreating, I retested with a new cluster but slept for 120 seconds after deploying Elafros.

Results:

  • lastknowngood with 1 core: FAILED with ContainerCreating
  • lastknowngood with 2 cores: PASSED after 108 seconds (second, warm run was 45 seconds)
  • lastknowngood with 2 cores, sleep 120: PASSED after 95 seconds
  • dnsfix with 1 core: FAILED with Pending
  • dnsfix with 2 cores: FAILED with ContainerCreating
  • dnsfix with 2 cores, sleep 120: FAILED with ContainerCreating once and Pending a second time
  • dnsfix with 4 cores: PASSED after 95 seconds

Note: On 1 core clusters, failures were at the "Revision will be updated..." step, after deploying the first revision. On 2 core clusters, failures were at the "Configuration will be updated..." step, after deploying the second revision.

@josephburnett
Copy link
Contributor

I've been trying to hack in deploying a Revision before the conformance tests to warm the node cache, but it fails in all kinds of interesting and unexpected ways. I think it would be more clean to just install a warm image in the test setup (#286) to get a jump on the pull latency.

@josephburnett
Copy link
Contributor

@grantr confirms that bumping the timeout and the larger clusters makes the conformance tests pass consistently.

mgencur pushed a commit to mgencur/serving-1 that referenced this issue Oct 22, 2019
🤖 Triggering CI on branch 'release-next' after synching to upstream/master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Well-understood/specified features, ready for coding.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants