[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172

SamuelStuchly · 2021-08-31T19:51:47Z

Support instances of "A2" family as well as all five GPUs available with "N1" instances. Regular and preemptible instances.

Does not support GPUs for graphics workloads.

JoelSpeed

This looks really good, I've left a few comments inline but for the most part they are stylistic or naming nits, good work!

JoelSpeed · 2021-09-01T09:18:35Z

pkg/apis/gcpprovider/v1beta1/gcpmachineproviderconfig_types.go

+	AcceleratorCount int64  `json:"acceleratorCount,omitempty"`
+	AcceleratorType  string `json:"acceleratorType,omitempty"`


Please add descriptions to these fields as these will end up in oc describe

JoelSpeed · 2021-09-01T09:22:15Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+	return false
+}
+
+func (r *Reconciler) checkQuota(machineFamily int64) error {


Maybe this would be a more obvious as a name for this variable as to what this number really is?

Suggested change

func (r *Reconciler) checkQuota(machineFamily int64) error {

func (r *Reconciler) checkQuota(machineTypeAcceleratorCount int64) error {

JoelSpeed · 2021-09-01T09:24:00Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+	if machineFamily != 0 {
+		guestAccelerators = append(guestAccelerators, &v1beta1.GCPAcceleratorConfig{AcceleratorType: "nvidia-tesla-a100", AcceleratorCount: machineFamily})
+	} else {


This could use a comment, something along the lines of When the machine type has associated accelerator instances, these will be A100s. Additional guest accelerators are not allowed so ignore the providerSpec GuestAccelerators.

pkg/cloud/gcp/actuators/machine/reconciler.go

JoelSpeed · 2021-09-01T09:26:24Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+
+func (r *Reconciler) validateGuestAccelerators() error {
+
+	a2MachineFamily, n1MachineFamily := r.computeService.MachineTypesList(r.providerSpec.ProjectID, r.providerSpec.Zone, r.Context)


Should we make sure we only make this API call when it's definitely required, ie guest accelerators are listed or it's an A2 type?

JoelSpeed · 2021-09-01T09:27:09Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+	var guestAccelerators = []*compute.AcceleratorConfig{}
+	for index, ga := range r.providerSpec.GuestAccelerators {
+		guestAccelerators = append(guestAccelerators, &compute.AcceleratorConfig{
+			AcceleratorType:  fmt.Sprintf("zones/%s/acceleratorTypes/%s", zone, r.providerSpec.GuestAccelerators[index].AcceleratorType),


Please move the format here to a const at the top of the file

JoelSpeed · 2021-09-01T09:28:25Z

pkg/cloud/gcp/actuators/services/compute/computeservice.go

@@ -101,3 +108,32 @@ func (c *computeService) TargetPoolsRemoveInstance(project string, region string
 func (c *computeService) MachineTypesGet(project string, zone string, machineType string) (*compute.MachineType, error) {
 	return c.service.MachineTypes.Get(project, zone, machineType).Do()
 }
+
+func (c *computeService) MachineTypesList(project string, zone string, ctx context.Context) (map[string]int64, []string) {


Perhaps we should name this to identify that it is for MachineTypes that support guest accelerators?

JoelSpeed · 2021-09-01T16:33:22Z

pkg/apis/gcpprovider/v1beta1/gcpmachineproviderconfig_types.go

+type GCPAcceleratorConfig struct {
+	// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance
+	AcceleratorCount int64 `json:"acceleratorCount,omitempty"`
+	// AcceleratorType is the type of accelerator (GPU) to be attached to an instance


Perhaps we should expand this with a list of supported acccelerators?

Makes sense since it is for the oc describe

Btw supported accelerators are listed in a structure supportedGPUTypes.

JoelSpeed · 2021-09-01T16:40:46Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+		guestAccelerators = r.providerSpec.GuestAccelerators
+	}
+	// validate zone and then quota
+	for _, elem := range guestAccelerators {


Stylistic nit: I would typically use a more meaningful name here, eg accelerator or guestAccelerator, normally it's like machine from machines for example

JoelSpeed · 2021-09-01T16:43:48Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+}
+
+func (r *Reconciler) validateGuestAccelerators() error {
+	if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {


To reduce indentation and make this a bit easier to follow, what about inverting this

Suggested change

if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

if len(r.providerSpec.GuestAccelerators) == 0 && 1strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

// no accelerators to validate so return nil

return nil

}

Not exactly. The inversion you proposed would leave the case of any regular machineType with 0 accelerators to go through and make the pointless api call, which we want to avoid.

Logically that means that with the current version, a regular machineType with 0 accelerators is allowed to go through and make the API call, I think you should double check that as it is today, and update the if statement here, perhaps it should be...

Suggested change

if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

if (len(r.providerSpec.GuestAccelerators) != 0 && strings.HasPrefix(r.providerSpec.MachineType, "n1-") || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

JoelSpeed · 2021-09-01T16:46:13Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+		if a2MachineFamily[machineType] != 0 {
+			// a2 family machine - has fixed type and count of GPUs
+			if err := r.checkQuota(a2MachineFamily[machineType]); err != nil {
+				return err
+			} else {
+				return nil
+			}
+		} else if containsString(n1MachineFamily, machineType) {
+			// n1 family machine
+			if err := r.checkQuota(0); err != nil {
+				return err
+			} else {
+				return nil
+			}
+		} else {
+			// any other machine type
+			return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType))
+		}


This might be simpler if it were

Suggested change

if a2MachineFamily[machineType] != 0 {

// a2 family machine - has fixed type and count of GPUs

if err := r.checkQuota(a2MachineFamily[machineType]); err != nil {

return err

} else {

return nil

}

} else if containsString(n1MachineFamily, machineType) {

// n1 family machine

if err := r.checkQuota(0); err != nil {

return err

} else {

return nil

}

} else {

// any other machine type

return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType))

}

switch {

case a2MachineFamily[machineType] != 0:

// a2 family machine - has fixed type and count of GPUs

return r.checkQuota(a2MachineFamily[machineType])

case containsString(n1MachineFamily, machineType):

// n1 family machine

return r.checkQuota(0)

default:

// any other machine type

return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType))

}

JoelSpeed · 2021-09-01T16:48:12Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+	var guestAccelerators = []*compute.AcceleratorConfig{}
+	for index, ga := range r.providerSpec.GuestAccelerators {
+		guestAccelerators = append(guestAccelerators, &compute.AcceleratorConfig{
+			AcceleratorType:  fmt.Sprintf(acceleratorTypeFmt, zone, r.providerSpec.GuestAccelerators[index].AcceleratorType),


Do you need to use the index to look up the type here? You have ga in scope? I would have expected ga.AcceleratorType to suffice?

My mistake. I assumed since r.providerSpec.GuestAccelerators is slice, it might have more then one AcceleratorConfigs in it. However it cannot have more than one.

Same logic applied here.

JoelSpeed · 2021-09-03T15:08:12Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+		guestAccelerators = r.providerSpec.GuestAccelerators
+	}
+	// validate zone and then quota
+	accelerator := guestAccelerators[0] // guestAccelerators slice cannot be longer than 1


Why is this limited to 1? Is this a GCP limitation, if so, where is this written in the docs?

I did not find it the docs, however gcp web console does not even have an option for attaching more then 1 type of GPU. And I tried to create an instance with 2 different guest accelerators through the api, and got an googleapi error googleapi: Error 413: Value for field 'resource.guestAccelerators' is too large: maximum size 1 element(s); actual size 2., fieldSizeTooLarge

JoelSpeed · 2021-09-03T15:12:05Z

pkg/cloud/gcp/actuators/machine/reconciler.go

+}
+
+func (r *Reconciler) validateGuestAccelerators() error {
+	if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {


Logically that means that with the current version, a regular machineType with 0 accelerators is allowed to go through and make the API call, I think you should double check that as it is today, and update the if statement here, perhaps it should be...

Suggested change

if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

if (len(r.providerSpec.GuestAccelerators) != 0 && strings.HasPrefix(r.providerSpec.MachineType, "n1-") || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

JoelSpeed

/approve

Thanks Sam!

openshift-ci · 2021-09-15T17:02:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko

nice work Sam
/lgtm

openshift-bot · 2021-09-16T14:24:42Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-16T17:58:00Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-16T21:56:59Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-17T00:21:00Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-17T02:09:23Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-17T06:09:20Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-17T09:00:35Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-17T09:13:20Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

SamuelStuchly changed the title ~~Enable support for instance with GPUs on GCP~~ Enable support for instances with GPUs on GCP Sep 1, 2021

SamuelStuchly force-pushed the gpu-support branch from 965906b to e6c995f Compare September 1, 2021 08:52

JoelSpeed reviewed Sep 1, 2021

View reviewed changes

SamuelStuchly force-pushed the gpu-support branch from e6c995f to 3299a62 Compare September 1, 2021 14:41

JoelSpeed changed the title ~~Enable support for instances with GPUs on GCP~~ [OCPCLOUD-812] Enable support for instances with GPUs on GCP Sep 1, 2021

SamuelStuchly force-pushed the gpu-support branch 2 times, most recently from 4646493 to b8198b8 Compare September 1, 2021 16:17

JoelSpeed reviewed Sep 1, 2021

View reviewed changes

SamuelStuchly force-pushed the gpu-support branch 2 times, most recently from de31e37 to 94c2e39 Compare September 2, 2021 21:30

JoelSpeed reviewed Sep 3, 2021

View reviewed changes

SamuelStuchly force-pushed the gpu-support branch 4 times, most recently from f2cc121 to 35bbec3 Compare September 10, 2021 16:28

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 10, 2021

SamuelStuchly force-pushed the gpu-support branch from 35bbec3 to a43aa80 Compare September 15, 2021 13:43

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2021

SamuelStuchly force-pushed the gpu-support branch from a43aa80 to 1bf062b Compare September 15, 2021 14:34

add support for GPUs on GCP

fe366ad

SamuelStuchly force-pushed the gpu-support branch from 1bf062b to fe366ad Compare September 15, 2021 14:42

JoelSpeed reviewed Sep 15, 2021

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2021

elmiko reviewed Sep 16, 2021

View reviewed changes

openshift-ci bot assigned elmiko Sep 16, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2021

openshift-merge-robot merged commit 3ef3e2b into openshift:master Sep 17, 2021

SamuelStuchly mentioned this pull request Nov 11, 2021

GPU support kubernetes-sigs/cluster-api-provider-gcp#289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172

[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172

SamuelStuchly commented Aug 31, 2021 •

edited

Loading

JoelSpeed left a comment

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

SamuelStuchly Sep 2, 2021 •

edited

Loading

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

SamuelStuchly Sep 2, 2021

JoelSpeed Sep 3, 2021

JoelSpeed Sep 1, 2021

JoelSpeed Sep 1, 2021

SamuelStuchly Sep 2, 2021

JoelSpeed Sep 3, 2021

SamuelStuchly Sep 3, 2021 •

edited

Loading

JoelSpeed Sep 3, 2021

JoelSpeed left a comment

openshift-ci bot commented Sep 15, 2021

elmiko left a comment

openshift-bot commented Sep 16, 2021

openshift-bot commented Sep 16, 2021

openshift-bot commented Sep 16, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

		AcceleratorCount int64 `json:"acceleratorCount,omitempty"`
		AcceleratorType string `json:"acceleratorType,omitempty"`

	func (r *Reconciler) checkQuota(machineFamily int64) error {
	func (r *Reconciler) checkQuota(machineTypeAcceleratorCount int64) error {


		func (r *Reconciler) validateGuestAccelerators() error {

		a2MachineFamily, n1MachineFamily := r.computeService.MachineTypesList(r.providerSpec.ProjectID, r.providerSpec.Zone, r.Context)

	if len(r.providerSpec.GuestAccelerators) != 0 \|\| strings.HasPrefix(r.providerSpec.MachineType, "a2-") {
	if (len(r.providerSpec.GuestAccelerators) != 0 && strings.HasPrefix(r.providerSpec.MachineType, "n1-") \|\| strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172

[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172

Conversation

SamuelStuchly commented Aug 31, 2021 • edited Loading

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamuelStuchly Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamuelStuchly Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 15, 2021

elmiko left a comment

Choose a reason for hiding this comment

openshift-bot commented Sep 16, 2021

openshift-bot commented Sep 16, 2021

openshift-bot commented Sep 16, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

openshift-bot commented Sep 17, 2021

SamuelStuchly commented Aug 31, 2021 •

edited

Loading

SamuelStuchly Sep 2, 2021 •

edited

Loading

SamuelStuchly Sep 3, 2021 •

edited

Loading