Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openshift 4.2 installation on vSphere/ESXI 6.7 #2537

Closed
stefanskotte opened this issue Oct 20, 2019 · 28 comments
Closed

Openshift 4.2 installation on vSphere/ESXI 6.7 #2537

stefanskotte opened this issue Oct 20, 2019 · 28 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@stefanskotte
Copy link

stefanskotte commented Oct 20, 2019

Version

$ openshift-install version
openshift-install v4.2.0
built from commit 90ccb37ac1f85ae811c50a29f9bb7e779c5045fb
release image quay.io/openshift-release-dev/ocp-release@sha256:c5337afd85b94c93ec513f21c8545e3f9e36a227f55d41bc1dfb8fcc3f2be129

Platform:

What happened?

Installation on vSphere fails for OpenShift 4.2, following the documentation on point.

vSphere (6.7.0 Build 14368073)
VMware ESXi, 6.7.0, 13006603

For some reason the ovfEnv variables for ignition are not picked up. I have booted a RHEL 8, and I could succesfully get the vApp variables using the vmtoolsd command.

I have tried numerous times reimporting the CoreOS (at this time 4.2) as a template, clone it exactly as mentioned in the instructions. Only thing I see is that CoreOS gets correct IP+DNS from my DHCP server, but then its just stuck at the login screen (without my ssh key provisioned into it).

(tried setting the kernel argument "core.first_boot=detected", but it doesnt make ignition trigger the installation).

In the mean time, I have booted and installed a complete OC 4.2 cluster using the bare metal instructions here (https://blog.openshift.com/deploying-a-user-provisioned-infrastructure-environment-for-openshift-4-1-on-vsphere/), together with the latest 4.2 documentation for OC.

What you expected to happen?

Installation on vSphere should work, where CoreOS picks up the ovf environment.

How to reproduce it (as minimally and precisely as possible)?

Follow the current Openshift 4.2 vSphere documentation, import the ova and clone it to bootstrap-0.

Insert vApp variables as described:
image

boot the cloned vm - it will stall, and just boot to login screen where ssh-keys are not deployed and installation doesn't start.

@rayabueg
Copy link

rayabueg commented Nov 1, 2019

I may have the same issue: DHCP address assigned then no more boot progress

@dav1x
Copy link
Contributor

dav1x commented Nov 8, 2019

I'm going to try and recreate this. I'll let you know what I find out.

@crissonpl
Copy link

I"m hitting the same issue. Any updates @dav1x ?

@dav1x
Copy link
Contributor

dav1x commented Nov 12, 2019

I just attempted to recreate this with rhcos-4.2 and VMware ESXi, 6.7.0, 10764712 with VCSA 10244857 and I was not able to recreate the issue. I imported the OVA and left it on vcenter as a VM.

[core@bootstrap-0 ~]$ rpm-ostree status
...omitted...
        Version: 42.80.20191002.0

I followed these steps exactly:
https://blog.openshift.com/deploying-a-user-provisioned-infrastructure-environment-for-openshift-4-1-on-vsphere/

[root@rhel-d ocp-42]# ./openshift-install wait-for bootstrap-complete
INFO Waiting up to 30m0s for the Kubernetes API at https://api.example.com:6443... 
INFO API v1.14.6+868bc38 up                       
INFO Waiting up to 30m0s for bootstrapping to complete... 
INFO It is now safe to remove the bootstrap resources 

Does anyone want to share their ignition config files or install-config.yaml here or via a DM?

@rayabueg
Copy link

rayabueg commented Nov 21, 2019

@dav1x thanks for attempting to repro...I will attempt to do so again and provide you details

@crissonpl
Copy link

I think this issue is the same #2552 (comment)
@dav1x ?

I will try the install one more time and update the status.

@bortek
Copy link
Contributor

bortek commented Nov 28, 2019

That would be great since I am stuck on this.

If I manually added base64 ignition data into guestinfo.ignition.config.data variable in advanced properties then nodes do boot up and fetch data from bootstrap URL. But then whats the point of terraform automation. Even after that manual step is performed the static IPs provided in the config are not being set in the /etc/syscong/network-script ens192 interface file. Something else is broken there too.

@bortek
Copy link
Contributor

bortek commented Nov 28, 2019

Have managed to get passed this and set base64 variables on vms using

extra_config {
"guestinfo.ignition.config.data" = "${base64encode(data.ignition_config.ign.*.rendered[count.index])}"
"guestinfo.ignition.config.data.encoding" = "base64"
}

which was suggested at hashicorp/terraform-provider-vsphere#243

"vapp properties" might not be working dues to some missing license vCenter/vSphere as suggested in the same link. Perhaps someone can update the code to use extra_config instead.

Still need to make statit IPs to work...

@crissonpl
Copy link

maybe a stupid question - does terraform require still dhcp for the initial boot phase?

@bortek
Copy link
Contributor

bortek commented Nov 28, 2019

I am still trying to figure out how dhcp/static_ip is though to work therefore I opened #2733 .

With DHCP disabled the bootstrap node cannot get the IP address and therefore it cannot properly boot. But to get an IP via DHCP the DHCP has to be pre-provisioned using MAC address, thus it is a manual step. Static IP address provisioned should be working since there is a config for it in TF config file. But it does not seem to work.

@joaquinfll
Copy link

@bortek There is no need to have the MAC address preprovisioned. You can assign an IP in from a range and after the initial ignition download it will reboot the node and use the static IP.

@mostafahussein
Copy link

mostafahussein commented Jan 8, 2020

@bortek , have you solved this issue ? I am in the same boat but I have not solved it yet. Can you tell me what have you done in order to fix it ?

@bortek
Copy link
Contributor

bortek commented Jan 8, 2020

Nop. Right now I am using half manual process for IP/MAC provisioning. I'm hopeful that soon I have time to look into automating it too.

@joaquinfll
Copy link

joaquinfll commented Jan 9, 2020

For the static IP install, I'm using a DHCP server with an IP range in a specific subnet. That is sufficient for DHCP requirements from OCP install process. On the first RHCOS boot the server will catch a temporary IP, download the ignition file and reboot with the fix IP.

Ex. config for the DHCP server:
/etc/dhcp/dhcpd.conf
`option domain-name "example.cluster.local";
option domain-name-servers 10.1.1.1, 10.1.1.2;

	default-lease-time 600;
	max-lease-time 7200;

	log-facility local7;

	subnet 10.1.1.0 netmask 255.255.255.0 {
		range 10.1.1.210 10.1.1.220;
		option routers 10.1.1.254;
	}`

@rayabueg
Copy link

@nodanero Thanks for this. I'm revisiting this issue again. Would you mind sharing the rhcos and vsphere versions used?

@joaquinfll
Copy link

I have tested this static IP procedure with the DHCP server with most of the versions from 4.1.x to 4.3.5 and the procedure works for me.

For RHCOS I'm currently using the template rhcos-4.3.0-x86_64-vmware.ova but I can't tell you at the moment which version is (43?).

For vSphere I've only used version 6.5

@rayabueg
Copy link

rayabueg commented Mar 19, 2020 via email

@rayabueg
Copy link

rayabueg commented Apr 1, 2020

@nodanero, would you mind sharing your ign files? I'd like to attempt to repro manually by feeding your working ign files into the vapp properties then booting. We're still stuck once machine boots, then gets a DHCP address, but makes no progress afterward

@joaquinfll
Copy link

@rayabueg Sorry, I can't share the working files but I can give you an example source file to ingest by openshift-install binary.

I would focus on bootstrap server. Aside from the variable in vapps it needs to boot with the ignition pulled from a web server.

I'm in the openshift channel in freenode.
https://webchat.freenode.net/#openshift

Example install-config.yaml (be careful with the quote marks):

apiVersion: v1
baseDomain: example.com
metadata:
name: dev1
networking:
machineCIDR: "10.10.10.0/24"
platform:
vsphere:
vCenter: vcenter.example.com
username: "vcenter-user"
password: "vcenter-password"
datacenter: datacenter
defaultDatastore: datastore
pullsecret: 'pull-secret-content'
sshKey: 'ssh-rsa AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'

@rayabueg
Copy link

rayabueg commented Apr 1, 2020

Thanks for the detail @nodanero and also for the freenode webchat link. Hope to find you there!

Your install config is no different what we've customized for our environment so we're still scratching our heads on why the coreos VM is behaving differently than yours. I have a few more questions if you don't mind:

Are you using the terraform process as generally prescribed in this project? We've customized it for our IPAM but follow the process for the most part.

Regarding the coreos boot process:

  • Is the bootstrap node getting a DHCP address first, pulling down its ign from an http server, then setting static ip and rebooting?

  • Are the master and worker nodes using DHCP first, or simply booting with the provided ign data then assigning a static ip and rebooting?

  • As a coreos newb, I've got to ask...why even use DHCP if we can assign a static IP via ign in the first place?

Feels to me like we're running into an environment issue perhaps with dhcp, maybe even vsphere itself, but we're hoping it's simply us not configuring coreos properly. Thanks for any feedback on your particular boot process that is allowing static IPs. According to RH, to do static IPs in vsphere we need to apply them as a kernel IP (manually interrupting the boot process to input the IP) or through a boot ISO (Edit: been informed this can be automated!) which we obviously won't be doing since the goal is to automate the OCP cluster build via terraform.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2020
@dwagelaar
Copy link

I've noticed that Fedora CoreOS (okd 4.5) and Flatcar Container Linux (manual install) both fail to pick up any VMWare guestinfo data from our vSphere 6.7. It still worked on our previous vSphere 4.5 cluster.

@dwagelaar
Copy link

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2020
@frenchtoasters
Copy link

I can also confirm that on vSphere 6.7 we are no longer able to have Fedora CoreOS or Flatcar Container Linux pick up the guestinfo data.

vSphere: 6.7.0
Build: 15679289

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2020
@openshift-ci-robot openshift-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 26, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests