Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

kube-aws: decrypt-tls-assets.service failed in a controller node #675

Closed
mumoshu opened this issue Sep 16, 2016 · 20 comments
Closed

kube-aws: decrypt-tls-assets.service failed in a controller node #675

mumoshu opened this issue Sep 16, 2016 · 20 comments

Comments

@mumoshu
Copy link
Contributor

mumoshu commented Sep 16, 2016

With the latest master(066d888caa85c00112d891f3c69a8416a59aaee4), it failed and kubelet.service didn't come up.

systemctl status decrypt-tls-assets showed:

● decrypt-tls-assets.service - decrypt kubelet tls assets using amazon kms
   Loaded: loaded (/etc/systemd/system/decrypt-tls-assets.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2016-09-16 13:02:33 UTC; 3min 34s ago
  Process: 884 ExecStart=/opt/bin/decrypt-tls-assets (code=exited, status=1/FAILURE)
 Main PID: 884 (code=exited, status=1/FAILURE)

journalctl -u decrypt-tls-assets showed:

Sep 16 13:02:03 ip-10-0-0-50.ap-northeast-1.compute.internal systemd[1]: Starting decrypt kubelet tls assets using amazon kms...
Sep 16 13:02:04 ip-10-0-0-50.ap-northeast-1.compute.internal sudo[890]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/rkt run --volume=ssl,kind=host,source=/etc/kubernetes/ssl,readOnly=fal
Sep 16 13:02:04 ip-10-0-0-50.ap-northeast-1.compute.internal sudo[890]: pam_unix(sudo:session): session opened for user root by (uid=0)
Sep 16 13:02:04 ip-10-0-0-50.ap-northeast-1.compute.internal decrypt-tls-assets[884]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-coreos.aci
Sep 16 13:02:18 ip-10-0-0-50.ap-northeast-1.compute.internal decrypt-tls-assets[884]: image: searching for app image quay.io/coreos/awscli
Sep 16 13:02:33 ip-10-0-0-50.ap-northeast-1.compute.internal decrypt-tls-assets[884]: run: discovery failed
Sep 16 13:02:33 ip-10-0-0-50.ap-northeast-1.compute.internal systemd[1]: decrypt-tls-assets.service: Main process exited, code=exited, status=1/FAILURE
Sep 16 13:02:33 ip-10-0-0-50.ap-northeast-1.compute.internal systemd[1]: Failed to start decrypt kubelet tls assets using amazon kms.
Sep 16 13:02:33 ip-10-0-0-50.ap-northeast-1.compute.internal systemd[1]: decrypt-tls-assets.service: Unit entered failed state.
Sep 16 13:02:33 ip-10-0-0-50.ap-northeast-1.compute.internal systemd[1]: decrypt-tls-assets.service: Failed with result 'exit-code'.

A work-around for me was to run

sudo systemctl restart decrypt-tls-assets
sudo systemctl restart kubelet

and then waiting several minutes until curl localhost:8080 succeeds.
I'm not really sure but the fix for #666 caused another issue?

@cgag
Copy link
Contributor

cgag commented Sep 20, 2016

Regardless of how we improve decrypt-tls-assets, I think we should make starting the kubelet more robust by using the workaround on the upstream systemd issue for failed depedencies. @dghubble used here: https://github.com/coreos-inc/tectonic/pull/250. This would mean
adding something like ExecStartPre=/usr/bin/systemctl is-active decrypt-tls-assets.service.

Alternatively we could ExecPreStart the decrypt-tls-assets script directly and drop the oneshot service. This would probably mean adding a check to avoid re-decrypting the assets if the kubelet restarts.

@robszumski
Copy link
Member

This would mean
adding something like ExecStartPre=/usr/bin/systemctl is-active decrypt-tls-assets.service.

I have attempted to test out that ExecStartPre without success, but I am far from a systemd expert.

This would probably mean adding a check to avoid re-decrypting the assets if the kubelet restarts.

Is that actually a problem?

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 21, 2016

@cgag @robszumski Thanks for your response. So the problems are:

(1) decrypt-tls-assets.service doesn't restart/retry on failure

and/or

(2) kubelet.service fails when (1) failed one or more times and has no way to recover from it

right?

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 21, 2016

If so, for (1), I'd

  • add Restart=on-failure to the decrypt-tls-assets systemd unit(btw I don't know if it works as I expect) or
  • add a retrying logic written in shell script directly to /opt/bin/decrypt-tls-assets

so that decrypt-tls-assets.service would eventually succeed and become active.

If going with the latter option, I'd also add TimeoutStartSec=T where T is enough time required for decrypt-tls-assets to succeed after several retries.

Then for (2), I'd add ExecStartPre=/usr/bin/systemctl is-active decrypt-tls-assets.service. to the kubelet unit as @cgag pointed out so that kubelet would try to start only after decrypt-tls-assets.service is active i.e. succeeded.

Do these changes look good to you? @cgag @robszumski

@robszumski
Copy link
Member

I think the underlying restart issue was solved in #666 but hasn't made it into a release yet. But @cgag is right that we need to restructure things to make the kubelet restart properly overall.

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 21, 2016

@robszumski Thanks, but as I've reporter earlier in this issue, I'm using kube-aws built against the latest master including fix for #666 and unfortunately found decrypt-tls-assets.service not being active.

As it is not active i.e. eventually failed without any success, just changing kubelet.service to make it try to start only after decrypt-tls-assets is active wouldn't work, right?

@cgag
Copy link
Contributor

cgag commented Sep 21, 2016

@robszumski I actually only tested the alternate I proposed of, ust directly ExecStartPre'ing decrypting tls assets. I'm validating the is-active version now and realizing it might not actually solve, things which doesn't make much sense to me. Perhaps it's something related to being a oneshot. I'll need to play with it more.

@mumoshu:
It's slightly more complicated than that. Even if decrypt-tls-assets did have retry logic, it wouldn't solve the problem. What happens is this:

Say service B depends on service A. If you attempt to start B, it will then try to start A. If A fails, B will be marked as "failed due to dependency", and then systemd will never attempt to restart it. Things marked as failing due to dependency are dead for good.

If A has restart logic, systemd will try to restart it, but it won't go back and start B once A has come up.

The idea behind the is-active workaround is to force B to fail due to exit-code rather than dependency. Systemd will attempt to restart things that fail due to bad exit codes. I thought that these restarts would cause it to re-run it's dependencies if they were inactive, but my initial tests seem to show I'm wrong. I need to validate further. I now think that the is-active check needs to be used in concert with restart logic on the dependency.

Another smaller problem is that Type=oneshot services can't have any Restart= value other than no. We could do what you said and add the retry logic to the shell script, but if possible I'd prefer to keep any restart logic in the service files, which pushes me towards using the ExecStartPre option to directly run decrypt-tls-assets

@mumoshu
Copy link
Contributor Author

mumoshu commented Sep 21, 2016

@cgag Thanks for your detailed explanation. It was very helpful to me.

Say service B depends on service A. If you attempt to start B, it will then try to start A. If A fails, B will be marked as "failed due to dependency", and then systemd will never attempt to restart it. Things marked as failing due to dependency are dead for good.

Ack. That's also why I've proposed (1) in #675 (comment)

If A has restart logic, systemd will try to restart it, but it won't go back and start B once A has come up.

Thanks. That's what I wasn't sure in the time of writing (1) above.

The idea behind the is-active workaround is to force B to fail due to exit-code rather than dependency. Systemd will attempt to restart things that fail due to bad exit codes.

Sounds nice. I've seen someone in the internet also came up with that.

I thought that these restarts would cause it to re-run it's dependencies if they were inactive, but my initial tests seem to show I'm wrong. I need to validate further.

Thanks for sharing the great findings.
Sorry for saying something without real experiments on my side, but my assumption on that behavior is that Once a oneshot dependency is failed, it will be permanent. I.e. starting dependant(e.g. kubelet) does trigger initial start of the oneshot dependency(e.g. decrypt-tls-assets) but won't try to restart it on failure

So:

We could do what you said and add the retry logic to the shell script, but if possible I'd prefer to keep any restart logic in the service files, which pushes me towards using the ExecStartPre option to directly run decrypt-tls-assets

I'd also like the ExecStartPre option in that regard if it works.
Sorry for the back and forth but, now, your initial comment totally made sense to me. Thanks a lot 👍

@dghubble
Copy link
Member

dghubble commented Oct 19, 2016

Yep, I use the ExecStartPre=/usr/bin/systemctl is-active flanneld.service trick across open source and Tectonic bare-metal clusters to fail and restart the kubelet when flannel hasn't finished setup. Here is a public example:

https://github.com/coreos/coreos-baremetal/blob/master/examples/ignition/k8s-controller.yaml#L68

For AWS configurations on the other hand, I've been seeing the same decrypt-tls-assets failure issue a lot. The is-active approach doesn't work well because the decrypt-tls-assets service won't retry. Systemd oneshot services can't use Restart=on-failure issue.

Adding ExecStartPre=/opt/bin/decrypt-tls-assets to the kubelet does guarantee a serial ordering and the kubelet itself will retry. That's been working well and it looks like that's what @cgag suggested too. I'm not using the oneshot unit anymore, perhaps kube-aws would like to take that approach?

@mumoshu
Copy link
Contributor Author

mumoshu commented Oct 19, 2016

@dghubble Thanks for sharing your valuable insights!

May I ask you if #697 looks good to you in that regard?

I started to believe the content of the PR:

  • replacing every occurrence(or ones with enough chances to fail) of Requires+After to Wants+ExecStartPre,
  • oneshot service to ExecStartPre

is the way to go (for now) according to what you and @cgag said.

However, I'm not completely sure and I'd greatly appreciate feedbacks from experts like you, to move towards fixing this long standing issue 🙇

@cmcconnell1
Copy link

Thanks for your documentation here. @mumoshu / All. I am trying to get kube-aws to deploy correctly inside an existing VPC and perhaps having related issues due to this requirement and my configurations in the cluster.yaml, etc. (kube-aws seems to currently have issues with existing subnets RE: #538) .. Unfortunately, I'm stuck with the decrypt-tls-assets.service loaded failed failed decrypt kubelet tls assets using amazon kms service failures a few minutes after [re]booting the controller or fresh worker node. @mumoshu 's above noted workaround unfortunately does not work for me. Wondering if there's any pending updates/changes coming soon?

Here's what I see when I am trying to restart the systemd services on the controller/worker nodes

core@ip-10-1-100-100 ~ $ sudo systemctl restart decrypt-tls-assets Job for decrypt-tls-assets.service failed because the control process exited with error code. See "systemctl status decrypt-tls-assets.service" and "journalctl -xe" for details.

core@ip-10-1-100-100 ~ $ sudo systemctl restart kubelet
A dependency job for kubelet.service failed. See 'journalctl -xe' for details.

excerpt from journalctl -xe output showing flannel issues

-- The start-up result is done.
Nov 08 01:41:50 ip-10-1-100-100.foo.com rkt[1672]: run: discovery failed
Nov 08 01:41:50 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 01:41:50 ip-10-1-100-100.foo.com systemd[1]: Failed to start Network fabric for containers.
-- Subject: Unit flanneld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has failed.
--
-- The result is failed.
Nov 08 01:41:50 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Unit entered failed state.
Nov 08 01:41:50 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Failed with result 'exit-code'.
Nov 08 01:41:55 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Service hold-off time over, scheduling restart.
Nov 08 01:41:55 ip-10-1-100-100.foo.com systemd[1]: Stopped Network fabric for containers.
-- Subject: Unit flanneld.service has finished shutting down

Thanks

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 8, 2016

@cmcconnell1 hi, thanks for your feedback here!

Does restarting flanneld.service and then kubelet.service work for you?

If you'd been already using the external etcd, flanneld.service might have failed due to e.g. ACL forbidding communication between etcd and workers/controllers, slow startup of etcd, etc.

If restarting flanneld.service and then kubelet.service does work for you, it may be the latter, which is already addressed in kubernetes-retired/kube-aws@7576939 (btw, we've moved our repo!)

It restarting doesn't work for you, it may be related to ACL or something I'm not aware of for now.

@cmcconnell1
Copy link

Thanks @mumoshu for the quick response.

Regarding ACL's (at least from the EC2 side) I added both the kube-aws-controller and the kube-aws-worker nodes to another (development) security group which allows All traffic in/out (all protocols and port ranges in that security group). Sorry, I should have included this info last night.

The below terminal out is from the controller node of a fresh kube-aws cluster deploy (with the above mentioned controller/worker also included in an "Allow All" security group in addition to their respective Controller/Worker security groups).

ssh [email protected]
Last login: Tue Nov  8 17:10:12 UTC 2016 from 10.1.0.178 on pts/0
CoreOS stable (1185.3.0)
Update Strategy: No Reboots
Failed Units: 1
  decrypt-tls-assets.service

core@ip-10-1-100-100 ~ $ sudo systemctl restart flanneld.service ; sudo systemctl restart kubelet.service
Job for flanneld.service failed because the control process exited with error code.
See "systemctl status flanneld.service" and "journalctl -xe" for details.
A dependency job for kubelet.service failed. See 'journalctl -xe' for details.

core@ip-10-1-100-100 ~ $ journalctl -xe
--
-- Unit flanneld.service has begun starting up.
Nov 08 17:21:42 ip-10-1-100-100.foo.com curl[2262]: {"errorCode":105,"message":"Key already exists","cause":"/coreos.com/network/config","index":5}
Nov 08 17:21:42 ip-10-1-100-100.foo.com rkt[2267]: image: using image from file /usr/lib/rkt/stage1-images/stage1-fly.aci
Nov 08 17:21:42 ip-10-1-100-100.foo.com rkt[2267]: image: searching for app image quay.io/coreos/flannel
Nov 08 17:21:47 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 108.61.73.244:123 (2.coreos.pool.ntp.org).
Nov 08 17:21:57 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 152.2.133.52:123 (2.coreos.pool.ntp.org).
Nov 08 17:22:07 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 104.232.3.3:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:18 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 132.163.4.102:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:28 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 204.9.54.119:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:38 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 204.11.201.12:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:42 ip-10-1-100-100.foo.com rkt[2267]: run: discovery failed
Nov 08 17:22:42 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 17:22:42 ip-10-1-100-100.foo.com systemd[1]: Failed to start Network fabric for containers.
-- Subject: Unit flanneld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has failed.
--
-- The result is failed.
Nov 08 17:22:42 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Unit entered failed state.
Nov 08 17:22:42 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Failed with result 'exit-code'.
Nov 08 17:22:47 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Service hold-off time over, scheduling restart.
Nov 08 17:22:47 ip-10-1-100-100.foo.com systemd[1]: Stopped Network fabric for containers.
-- Subject: Unit flanneld.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has finished shutting down.
Nov 08 17:22:47 ip-10-1-100-100.foo.com systemd[1]: Starting Network fabric for containers...
-- Subject: Unit flanneld.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has begun starting up.
Nov 08 17:22:47 ip-10-1-100-100.foo.com curl[2316]: {"errorCode":105,"message":"Key already exists","cause":"/coreos.com/network/config","index":5}
Nov 08 17:22:47 ip-10-1-100-100.foo.com rkt[2320]: image: using image from file /usr/lib/rkt/stage1-images/stage1-fly.aci
Nov 08 17:22:48 ip-10-1-100-100.foo.com rkt[2320]: image: searching for app image quay.io/coreos/flannel
Nov 08 17:23:48 ip-10-1-100-100.foo.com rkt[2320]: run: discovery failed
Nov 08 17:23:48 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 17:23:48 ip-10-1-100-100.foo.com systemd[1]: Failed to start Network fabric for containers.
-- Subject: Unit flanneld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has failed.
--
-- The result is failed.
Nov 08 17:23:48 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Unit entered failed state.
Nov 08 17:23:48 ip-10-1-100-100.foo.com systemd[1]: flanneld.service: Failed with result 'exit-code'.
core@ip-10-1-100-100 ~ $ systemctl --failed
  UNIT                       LOAD   ACTIVE SUB    DESCRIPTION
● decrypt-tls-assets.service loaded failed failed decrypt kubelet tls assets using amazon kms

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

I've also tried posting on the #kubernetes-users slack channel too, but haven't got any useful responses. Is there another more appropriate IRC channel someone could suggest for kube-aws related issues?

Thanks for your help.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 9, 2016

@cmcconnell1 Thanks for the info 👍

The part of log messages you've provided:

Nov 08 17:21:42 ip-10-1-100-100.foo.com rkt[2267]: image: searching for app image quay.io/coreos/flannel
Nov 08 17:21:47 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 108.61.73.244:123 (2.coreos.pool.ntp.org).
Nov 08 17:21:57 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 152.2.133.52:123 (2.coreos.pool.ntp.org).
Nov 08 17:22:07 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 104.232.3.3:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:18 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 132.163.4.102:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:28 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 204.9.54.119:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:38 ip-10-1-100-100.foo.com systemd-timesyncd[657]: Timed out waiting for reply from 204.11.201.12:123 (3.coreos.pool.ntp.org).
Nov 08 17:22:42 ip-10-1-100-100.foo.com rkt[2267]: run: discovery failed

seems to indicate that your node doesn't have the connection with the internet.

Excuse me if I'm just revisiting what you've already tried before but please let me blindly cover all the possibilities I have in my mind:

  • Does your VPC have an internet gateway assigned? If no, assigning it would help.
  • Which route table is used for your cluster? (You may have multiple route tables in a VPC if you've configured so. Then, which one is actually used?)
  • Does your route table have a rule for routing packets to the internet through the internet gateway? If no, adding such rule to the table would help.
  • Does your network ACL(not security groups) allow necessary inbound/outbound rules? ACL can shut communication even it is allowed in SGs. You don't need to mind if you've never customized network ACL though, because if I remember correctly, it allows all the inbound/outbound communication by default.

@cmcconnell1
Copy link

Hey @mumoshu In all deployments I was using the default route table for the VPC (and specified it in the cluster.yaml file just to be sure). There is access to an internet gateway via NAT on that subnet and both the controller and worker nodes were able to access the internet, ping google, yahoo, had DNS resolution, etc. AFAIK, we decided to not use ACL's for this reason and just use security groups as it's a bit tough to troubleshoot if you are trying to use both. I tried downloading your latest RC-1 candidate earlier today and seemed to get a bit further, just the docker service would not start, but had no issues with the previous failing services. Still couldn't connect remotely with kubectl, etc. So I've jumped over to trying to get going with kops, but am able/willing to try to work on both kube-aws and kops if/as needed. Thanks for your responses.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 17, 2016

Closing due to the fact this is already fixed in https://github.com/coreos/kube-aws/

@mumoshu mumoshu closed this as completed Nov 17, 2016
@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 17, 2016

@cmcconnell1 Did you have any luck with kops and/or reached to the root cause of your issue? I'd appreciate it if you could share your experience 🙏

@cmcconnell1
Copy link

Hey @mumoshu thanks for reaching out. . . it seems that so far, we have issues with both kube-aws and kops when trying to deploy into an existing VPC. I've tried specifying internal NAT subnets/routing tables and external subnets/routing tables with routes to internet gateway. With both approaches, the controller and workers are able to ping and resolve external sites, etc. But services aren't happy and currently on latest RC candidate have issues with docker service and install-kube-system.service etc failing. . .
Was trying to follow this procedure here: https://coreos.com/kubernetes/docs/latest/network-troubleshooting.html and get the toolbox container on my CoreOS boxes, but it hangs for ever at the root term with the toolbox command and unfortunately I'm not able to get it.

Noting that with kops and existing VPC, I had connection refused errors/issues with kubelet service:

journalctl -f -u kubelet
-- Logs begin at Wed 2016-11-09 01:13:05 UTC. --
Nov 10 00:51:25 ip-10-1-42-34 kubelet[30857]: E1110 00:51:25.285598   30857 pod_workers.go:184] Error syncing pod a13427cc5e58ab27cce8f1272417c039, skipping: failed to "StartContainer" for "etcd-container" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=etcd-container pod=etcd-server-events-ip-10-1-42-34.us-west-1.compute.internal_kube-system(a13427cc5e58ab27cce8f1272417c039)"
Nov 10 00:51:25 ip-10-1-42-34 kubelet[30857]: E1110 00:51:25.586058   30857 reflector.go:203] pkg/kubelet/kubelet.go:384: Failed to list *api.Service: Get http://127.0.0.1:8080/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
Nov 10 00:51:25 ip-10-1-42-34 kubelet[30857]: E1110 00:51:25.587029   30857 reflector.go:203] pkg/kubelet/kubelet.go:403: Failed to list *api.Node: Get http://127.0.0.1:8080/api/v1/nodes?fieldSelector=metadata.name%3Dip-10-1-42-34.us-west-1.compute.internal&resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused

I've tried posting in both the kubenetes-users and sig-aws slack channels with no responses other than one person simply stating that "kops works in an existing VPC," but after I posted the errors I get with kubelet, I didn't get any other responses. I was asking if anyone has seen them working in a VPC.
I've read posts where folks claim to have got it working i.e.: #340 (comment) but when people followed-up and asked for working cluster.yaml files, and the basic 411 how did you hack that to make it work, etc. there isn't ever a response. Pretty much everything we see is geared towards using a separate VPC, which would require VPC peering and networking complexity, etc. which essentially prevents us from being able to utilize it as a solution. If anyone could point me to a github repo or any location with working required files to get kube-aws working with an existing VPC, that would be very helpful.
I've also looked at kube-up, but that project is clearly deprecated and also doesn't support VPC's.
As always, thanks for your help, info, etc.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 17, 2016

Hey @cmcconnell1!

Excuse me if i'm missing the point again but could you ensure that the etcd cluster is up?

If I recall correctly, docker.service and flanneld.service in controller and worker nodes rely on an etcd cluster to be up beforehand.
kubelet.service won't be able to schedule pods if flanneld.service and/or flanneld.service aren't up yet, thus install-kube-system.service would have failed.

If your etcd cluster isn't up, would you mind checking out kubernetes-retired/kube-aws#62?
To say short, kube-aws of today doesn't support etcd nodes with hostnames coming from non-default private DNS i.e. your own Route53 private DNS, out of box. As you can see in that issue, you can workaround it by modifying cloud-config-etcd to use hostnames from the AWS's default private DNS.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 17, 2016

Sorry I've submitted my comment too early by mistake! Edited.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants