Machine api mvp #119

enxebre · 2018-08-10T16:19:05Z

This PR provides a minimum functional integration for the installer with the machine API and machineSets.

It removes the terraform step for creating workers machines on AWS so after bootstrapping a cluster with openshift install the worker machines are managed by a machineSet object so we can have an early e2e test driven workflow

Follow ups:

Loads of fixes for the underlying tooling(machine-api-actuator, aws actuator)
Discuss/Consolidate operator config with machineConfig
Discuss how we deal with the machineSet object (is it owned by machineConfig operator? is it owned by machine-api-operator along with machine-api components? is it owned by its custom operator?)

enxebre · 2018-08-10T16:23:00Z

cc @derekwaynecarr @ingvagabund @trawler @yifan-gu @abhinavdahiya

trawler · 2018-08-13T10:43:20Z

/retest

ingvagabund

Still not completely familiar with the tf/ign and parts of the code. Though, left some comments and questions.

ingvagabund · 2018-08-13T10:51:18Z

steps/assets/base/tectonic.tf

@@ -50,6 +50,7 @@ module "bootkube" {
  service_account_private_key_pem = "${local.service_account_private_key_pem}"

  etcd_endpoints = "${data.template_file.etcd_hostname_list.*.rendered}"
+  worker_ign_config = "${file("worker.ign")}"


Who/what creates the worker.ign file? What is inside the file?

I see, most likely generateIgnConfigStep is responsible for that, right?

ingvagabund · 2018-08-13T10:53:22Z

modules/bootkube/resources/manifests/machine-api-operator.yaml

+      - name: machine-api-operator
+        image: quay.io/alberto_lamela/machine-api-operator:a661677 # TODO: move this to openshift org
+        command:
+        - "/machine-api-operator"


Is the rule to put all the binaries under /? I noticed his pattern is used on multiple places (even in cluster-api upstream repo). IMHO, whenever possible I prefer /usr/bin/ or /bin. It's more intuitive.

the convention is that all tectonic operator binaries are built under the image's root folder.

ingvagabund · 2018-08-13T10:55:26Z

modules/bootkube/manifests.tf

@@ -17,6 +17,9 @@ variable "manifest_names" {
    "tectonic-node-controller-operator.yaml",
    "tnc-tls-secret.yaml",
    "ign-config.yaml",
+    "app-version-mao.yaml",


mao is not so far from Mao. What about mao -> machine-api-operator. It's not so long even with the additional chars.

trawler · 2018-08-13T11:25:37Z

/retest

trawler · 2018-08-14T09:39:47Z

/retest

enxebre · 2018-08-14T12:51:11Z

/retest

enxebre · 2018-08-14T13:27:04Z

Getting:

ip-10-0-144-110.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-168-133.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-81-142.ec2.internal    Ready     master    3m        v1.11.0+d4cacc0
Waiting for router to be created ...
NAME                           STATUS    ROLES     AGE       VERSION
ip-10-0-139-152.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-144-110.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-168-133.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-81-142.ec2.internal    Ready     master    3m        v1.11.0+d4cacc0
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
router    1         1         1            0           10s
error: timed out waiting for the condition
2018/08/14 13:15:57 Container setup in pod e2e-aws failed, exit code 1, reason Error
Another process exited
2018/08/14 13:16:02 Container test in pod e2e-aws failed, exit code 1, reason Error

@yifan-gu @smarterclayton would you be able to help to get this green? is there any way we can look at the kube logs to troubleshoot

smarterclayton · 2018-08-14T13:41:33Z

You'd probably have to SSH in to the pod or host while it's running. We need to capture cluster logs in failure cases but we're trying to use the cluster APIs to get all those. However, in the case you're looking at it looks like the control plane came up but then some of the control plane components failed. Let me try to get that scenario handled (we have kubeconfig but the cluster isn't healthy).

…

On Tue, Aug 14, 2018 at 9:34 AM OpenShift CI Robot ***@***.***> wrote: @enxebre <https://github.com/enxebre>: The following test *failed*, say /retest to rerun them all: Test name Commit Details Rerun command ci/prow/e2e-aws d4738f2 <d4738f2> link <https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/119/pull-ci-origin-installer-e2e-aws/447/> /test e2e-aws Full PR test history <https://openshift-gce-devel.appspot.com/pr/openshift_installer/119>. Your PR dashboard <https://openshift-gce-devel.appspot.com/pr/enxebre>. Please help us cut down on flakes by linking to <https://github.com/kubernetes/community/blob/master/contributors/devel/flaky-tests.md#filing-issues-for-flaky-tests> an open issue <https://github.com/openshift/installer/issues?q=is:issue+is:open> when you hit one in your PR. Instructions for interacting with me using PR comments are available here <https://git.k8s.io/community/contributors/guide/pull-requests.md>. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra <https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:> repository. I understand the commands that are listed here <https://go.k8s.io/bot-commands>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p2naE1LjIV8wvv3J9Chst2708pPpks5uQtHagaJpZM4V4dst> .

smarterclayton · 2018-08-14T13:55:30Z

I added openshift/release#1179 which should capture these logs. Will kick your job as soon as I've merged and updated.

…

On Tue, Aug 14, 2018 at 9:41 AM Clayton Coleman ***@***.***> wrote: You'd probably have to SSH in to the pod or host while it's running. We need to capture cluster logs in failure cases but we're trying to use the cluster APIs to get all those. However, in the case you're looking at it looks like the control plane came up but then some of the control plane components failed. Let me try to get that scenario handled (we have kubeconfig but the cluster isn't healthy). On Tue, Aug 14, 2018 at 9:34 AM OpenShift CI Robot < ***@***.***> wrote: > @enxebre <https://github.com/enxebre>: The following test *failed*, say > /retest to rerun them all: > Test name Commit Details Rerun command > ci/prow/e2e-aws d4738f2 > <d4738f2> > link > <https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/119/pull-ci-origin-installer-e2e-aws/447/> /test > e2e-aws > > Full PR test history > <https://openshift-gce-devel.appspot.com/pr/openshift_installer/119>. Your > PR dashboard <https://openshift-gce-devel.appspot.com/pr/enxebre>. > Please help us cut down on flakes by linking to > <https://github.com/kubernetes/community/blob/master/contributors/devel/flaky-tests.md#filing-issues-for-flaky-tests> > an open issue > <https://github.com/openshift/installer/issues?q=is:issue+is:open> when > you hit one in your PR. > > Instructions for interacting with me using PR comments are available here > <https://git.k8s.io/community/contributors/guide/pull-requests.md>. If > you have questions or suggestions related to my behavior, please file an > issue against the kubernetes/test-infra > <https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:> > repository. I understand the commands that are listed here > <https://go.k8s.io/bot-commands>. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#119 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABG_p2naE1LjIV8wvv3J9Chst2708pPpks5uQtHagaJpZM4V4dst> > . >

smarterclayton · 2018-08-14T13:57:48Z

/retest

enxebre · 2018-08-14T14:47:36Z

@smarterclayton thanks for help! we need to rerun test to include @trawler fix openshift/machine-api-operator@d84ba81

This PR runs only the workers as a machineSet. The machine-api-operator and the cluster-api stack run on masters so it can be debugged getting the logs for the relevant pods when workers does not come up.
For doing the same on CI we'd need some guidance on how to localise the cluster

smarterclayton · 2018-08-14T14:54:08Z

The teardown scripts use the kuberentes API to find and retrieve logs - we grab all pods, all node logs, all events, and a number of other things without looking at AWS info.

smarterclayton · 2018-08-14T15:42:22Z

/test e2e-aws

Failed due to something about reuse

smarterclayton · 2018-08-14T16:26:07Z

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/119/pull-ci-origin-installer-e2e-aws/449/artifacts/e2e-aws/

artifacts from the failure

abhinavdahiya · 2018-08-14T18:46:30Z

modules/bootkube/resources/manifests/mao-config.yaml

@@ -8,6 +8,7 @@ data:
    clusterName: ${cluster_name}
    clusterDomain: ${cluster_domain}
    region: ${region}
+    image: ${image}


@enxebre Is there is path to upgrade the image when cluster updates?

enxebre · 2018-08-14T18:49:38Z

Looking better now. Jenkins test actually passed, how ever it fails to destroy as workers are not under tf control now

enxebre · 2018-08-14T19:21:57Z

/test build-tarball

enxebre · 2018-08-14T19:47:38Z

All test are passing now. Jenkins ones show red because terraform is failing to destroy as it does not know how to handle the machines created by the machineSet. We could add kubectl here https://github.com/openshift/installer/blob/master/images/tectonic-smoke-test-env/Dockerfile and kubectl scale machineSet replicas=0 here https://github.com/openshift/installer/blob/master/tests/run.sh#L20 cc @paulfantom @trawler @ingvagabund @yifan-gu

paulfantom · 2018-08-14T20:32:06Z

I wonder why ci/prow/e2e-aws tests are green and Jenkins ones are failing. Is destroying cluster tested only in Jenkins?

trawler · 2018-08-15T07:53:17Z

Jenkins fails with account limit error:

* aws_elb.api_external: TooManyLoadBalancers: Exceeded quota of account 846518947292
status code: 400, request id: adf75e45-a05e-11e8-b6ea-1b1a348c0d2c
* module.vpc.aws_nat_gateway.nat_gw[0]: 1 error(s) occurred:
 
* aws_nat_gateway.nat_gw.0: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
status code: 400, request id: 1a23a804-f99c-4392-9e94-083bffcadaec

ingvagabund · 2018-08-15T10:33:30Z

retest this please

paulfantom · 2018-08-15T11:26:37Z

retest this please

trawler · 2018-08-15T12:12:07Z

@enxebre I found that this PR (as it is now) breaks the behavior of the destroy function of the installer... If terraform fails (for any reason) to complete the installation, it is impossible to then run installer destroy since the destroy workflow now looks for the cluster API, that never came up.

trawler · 2018-08-15T12:59:40Z

/test unit

trawler · 2018-08-15T13:03:59Z

/test e2e-aws

openshift-ci-robot · 2018-09-13T21:14:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhinavdahiya · 2018-09-13T21:17:21Z

/hold cancel

wking · 2018-09-18T04:06:26Z

modules/bootkube/variables.tf

@@ -160,3 +160,9 @@ variable "pull_secret" {
  type        = "string"
  description = "Your pull secret.  Obtain this from your Tectonic Account: https://account.coreos.com."
 }
+
+variable "worker_ign_config" {
+  description = "Worker ignition config"


Is this temporary? I'd have expected the machine API operator to manage worker ignition configs on its own.

Not temporary to my knowledge. I'm not sure why the machine-api-operator would manage the contents of the ignition configs. The configs are just an input to be passed along to the MachineSets. If anything, I would expect the machine-config-operator (this is getting confusing...) to manage the ignition configs and provide them in some way to the machine-api-operator.

If anything, I would expect the machine-config-operator (this is getting confusing...) to manage the ignition configs and provide them in some way to the machine-api-operator.

That makes sense to me. @abhinavdahiya?

@Non-Git

… cluster-api Generated with: $ glide update --strip-vendor $ glide-vc --use-lock-file --no-tests --only-code $ bazel run //:gazelle using: $ glide --version (cd $GOPATH/src/github.com/Masterminds/glide && git describe) v0.13.1-7-g3e13fd1 $ (cd $GOPATH/src/github.com/sgotti/glide-vc && git describe) v0.1.0-2-g6ddf6ee $ bazel version Build label: 0.16.1- (@Non-Git) Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar Build time: Mon Aug 13 16:42:29 2018 (1534178549) Build timestamp: 1534178549 Build timestamp as int: 1534178549 The tectonic-node-controller removal catches us up with 596591b (.*: replace tectonic node controller with machine config operator, 2018-09-10, openshift#232). The cluster-api trim adjusts the content from b00e40e (vendor: Add client from sigs.k8s.io/cluster-api, 2018-09-04, openshift#119). Because cluster-api wasn't in glide.lock, I suspect neither glide nor glide-vc were run before that commit.

The last consumers for these were removed by 124ac35 (*: Use machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).

And the related, libvirt-specific, tectonic_libvirt_worker_ips. This simplifies the Terraform logic for AWS and OpenStack, and consistently pushes most worker setup into the cluster-API providers who are creating the workers since 124ac35 (*: Use machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).

…improvements Improvements for the PowerVS asset/installconfig package

openshift-ci-robot added the size/L label Aug 10, 2018

enxebre mentioned this pull request Aug 10, 2018

Mvp openshift/machine-api-operator#17

Merged

ingvagabund reviewed Aug 13, 2018

View reviewed changes

abhinavdahiya added the run-smoke-tests label Aug 14, 2018

abhinavdahiya reviewed Aug 14, 2018

View reviewed changes

openshift-ci-robot added size/XXL and removed size/L labels Aug 15, 2018

openshift-ci-robot assigned crawford Sep 13, 2018

openshift-ci-robot added the lgtm label Sep 13, 2018

openshift-ci-robot added the approved label Sep 13, 2018

openshift-ci-robot removed the do-not-merge/hold label Sep 13, 2018

openshift-merge-robot merged commit b6a182a into openshift:master Sep 13, 2018

bison deleted the machine-api-mvp branch September 13, 2018 21:54

This was referenced Sep 14, 2018

installer/pkg/config: Support loading InstallConfig YAML #236

Closed

modules/bootkube/resources/bootkube: Restore --tmpfs #253

Merged

bison mentioned this pull request Sep 17, 2018

Add Terraform module for IAM resources #265

Merged

wking reviewed Sep 18, 2018

View reviewed changes

wking mentioned this pull request Sep 18, 2018

asset: Add TerraformVariables asset to generated tfvar files #273

Merged

wking mentioned this pull request Sep 20, 2018

vendor: Glide update, droppping tectonic-node-controller and trimming cluster-api #291

Closed

wking mentioned this pull request Sep 20, 2018

Teach installer to specify OS version through oscontainer image URLs #281

Closed

wking mentioned this pull request Nov 27, 2018

data/data/aws/variables-aws: Drop tectonic_aws_worker_ec2_type, etc. #742

Merged

wking mentioned this pull request Nov 27, 2018

data/data/config: Drop tectonic_worker_count #743

Closed

clnperez added a commit to clnperez/installer that referenced this pull request Mar 12, 2022

Merge pull request openshift#119 from HamzyOrg/installconfig-powervs-…

cac3d04

…improvements Improvements for the PowerVS asset/installconfig package

Machine api mvp #119

Machine api mvp #119

Conversation

enxebre commented Aug 10, 2018

enxebre commented Aug 10, 2018

trawler commented Aug 13, 2018

ingvagabund left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trawler commented Aug 13, 2018

trawler commented Aug 14, 2018

enxebre commented Aug 14, 2018

enxebre commented Aug 14, 2018 • edited Loading

smarterclayton commented Aug 14, 2018 via email

smarterclayton commented Aug 14, 2018 via email

smarterclayton commented Aug 14, 2018

enxebre commented Aug 14, 2018

smarterclayton commented Aug 14, 2018

smarterclayton commented Aug 14, 2018

smarterclayton commented Aug 14, 2018

Choose a reason for hiding this comment

enxebre commented Aug 14, 2018

enxebre commented Aug 14, 2018

enxebre commented Aug 14, 2018

paulfantom commented Aug 14, 2018

trawler commented Aug 15, 2018

ingvagabund commented Aug 15, 2018

paulfantom commented Aug 15, 2018

trawler commented Aug 15, 2018

trawler commented Aug 15, 2018

trawler commented Aug 15, 2018

openshift-ci-robot commented Sep 13, 2018

abhinavdahiya commented Sep 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Aug 14, 2018 •

edited

Loading