Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172

Merged
merged 1 commit into from
Sep 17, 2021

Conversation

SamuelStuchly
Copy link
Contributor

@SamuelStuchly SamuelStuchly commented Aug 31, 2021

Support instances of "A2" family as well as all five GPUs available with "N1" instances. Regular and preemptible instances.

Does not support GPUs for graphics workloads.

@SamuelStuchly SamuelStuchly changed the title Enable support for instance with GPUs on GCP Enable support for instances with GPUs on GCP Sep 1, 2021
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, I've left a few comments inline but for the most part they are stylistic or naming nits, good work!

Comment on lines 114 to 115
AcceleratorCount int64 `json:"acceleratorCount,omitempty"`
AcceleratorType string `json:"acceleratorType,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add descriptions to these fields as these will end up in oc describe

return false
}

func (r *Reconciler) checkQuota(machineFamily int64) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would be a more obvious as a name for this variable as to what this number really is?

Suggested change
func (r *Reconciler) checkQuota(machineFamily int64) error {
func (r *Reconciler) checkQuota(machineTypeAcceleratorCount int64) error {

Comment on lines 68 to 75
if machineFamily != 0 {
guestAccelerators = append(guestAccelerators, &v1beta1.GCPAcceleratorConfig{AcceleratorType: "nvidia-tesla-a100", AcceleratorCount: machineFamily})
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a comment, something along the lines of When the machine type has associated accelerator instances, these will be A100s. Additional guest accelerators are not allowed so ignore the providerSpec GuestAccelerators.

pkg/cloud/gcp/actuators/machine/reconciler.go Show resolved Hide resolved

func (r *Reconciler) validateGuestAccelerators() error {

a2MachineFamily, n1MachineFamily := r.computeService.MachineTypesList(r.providerSpec.ProjectID, r.providerSpec.Zone, r.Context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make sure we only make this API call when it's definitely required, ie guest accelerators are listed or it's an A2 type?

var guestAccelerators = []*compute.AcceleratorConfig{}
for index, ga := range r.providerSpec.GuestAccelerators {
guestAccelerators = append(guestAccelerators, &compute.AcceleratorConfig{
AcceleratorType: fmt.Sprintf("zones/%s/acceleratorTypes/%s", zone, r.providerSpec.GuestAccelerators[index].AcceleratorType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the format here to a const at the top of the file

@@ -101,3 +108,32 @@ func (c *computeService) TargetPoolsRemoveInstance(project string, region string
func (c *computeService) MachineTypesGet(project string, zone string, machineType string) (*compute.MachineType, error) {
return c.service.MachineTypes.Get(project, zone, machineType).Do()
}

func (c *computeService) MachineTypesList(project string, zone string, ctx context.Context) (map[string]int64, []string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should name this to identify that it is for MachineTypes that support guest accelerators?

@JoelSpeed JoelSpeed changed the title Enable support for instances with GPUs on GCP [OCPCLOUD-812] Enable support for instances with GPUs on GCP Sep 1, 2021
@SamuelStuchly SamuelStuchly force-pushed the gpu-support branch 2 times, most recently from 4646493 to b8198b8 Compare September 1, 2021 16:17
type GCPAcceleratorConfig struct {
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance
AcceleratorCount int64 `json:"acceleratorCount,omitempty"`
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should expand this with a list of supported acccelerators?

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense since it is for the oc describe

Btw supported accelerators are listed in a structure supportedGPUTypes.

guestAccelerators = r.providerSpec.GuestAccelerators
}
// validate zone and then quota
for _, elem := range guestAccelerators {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nit: I would typically use a more meaningful name here, eg accelerator or guestAccelerator, normally it's like machine from machines for example

}

func (r *Reconciler) validateGuestAccelerators() error {
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce indentation and make this a bit easier to follow, what about inverting this

Suggested change
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {
if len(r.providerSpec.GuestAccelerators) == 0 && 1strings.HasPrefix(r.providerSpec.MachineType, "a2-") {
// no accelerators to validate so return nil
return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. The inversion you proposed would leave the case of any regular machineType with 0 accelerators to go through and make the pointless api call, which we want to avoid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically that means that with the current version, a regular machineType with 0 accelerators is allowed to go through and make the API call, I think you should double check that as it is today, and update the if statement here, perhaps it should be...

Suggested change
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {
if (len(r.providerSpec.GuestAccelerators) != 0 && strings.HasPrefix(r.providerSpec.MachineType, "n1-") || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

Comment on lines 112 to 127
if a2MachineFamily[machineType] != 0 {
// a2 family machine - has fixed type and count of GPUs
if err := r.checkQuota(a2MachineFamily[machineType]); err != nil {
return err
} else {
return nil
}
} else if containsString(n1MachineFamily, machineType) {
// n1 family machine
if err := r.checkQuota(0); err != nil {
return err
} else {
return nil
}
} else {
// any other machine type
return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be simpler if it were

Suggested change
if a2MachineFamily[machineType] != 0 {
// a2 family machine - has fixed type and count of GPUs
if err := r.checkQuota(a2MachineFamily[machineType]); err != nil {
return err
} else {
return nil
}
} else if containsString(n1MachineFamily, machineType) {
// n1 family machine
if err := r.checkQuota(0); err != nil {
return err
} else {
return nil
}
} else {
// any other machine type
return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType))
}
switch {
case a2MachineFamily[machineType] != 0:
// a2 family machine - has fixed type and count of GPUs
return r.checkQuota(a2MachineFamily[machineType])
case containsString(n1MachineFamily, machineType):
// n1 family machine
return r.checkQuota(0)
default:
// any other machine type
return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType))
}

var guestAccelerators = []*compute.AcceleratorConfig{}
for index, ga := range r.providerSpec.GuestAccelerators {
guestAccelerators = append(guestAccelerators, &compute.AcceleratorConfig{
AcceleratorType: fmt.Sprintf(acceleratorTypeFmt, zone, r.providerSpec.GuestAccelerators[index].AcceleratorType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to use the index to look up the type here? You have ga in scope? I would have expected ga.AcceleratorType to suffice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. I assumed since r.providerSpec.GuestAccelerators is slice, it might have more then one AcceleratorConfigs in it. However it cannot have more than one.

Same logic applied here.

@SamuelStuchly SamuelStuchly force-pushed the gpu-support branch 2 times, most recently from de31e37 to 94c2e39 Compare September 2, 2021 21:30
guestAccelerators = r.providerSpec.GuestAccelerators
}
// validate zone and then quota
accelerator := guestAccelerators[0] // guestAccelerators slice cannot be longer than 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this limited to 1? Is this a GCP limitation, if so, where is this written in the docs?

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find it the docs, however gcp web console does not even have an option for attaching more then 1 type of GPU. And I tried to create an instance with 2 different guest accelerators through the api, and got an googleapi error googleapi: Error 413: Value for field 'resource.guestAccelerators' is too large: maximum size 1 element(s); actual size 2., fieldSizeTooLarge

}

func (r *Reconciler) validateGuestAccelerators() error {
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically that means that with the current version, a regular machineType with 0 accelerators is allowed to go through and make the API call, I think you should double check that as it is today, and update the if statement here, perhaps it should be...

Suggested change
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {
if (len(r.providerSpec.GuestAccelerators) != 0 && strings.HasPrefix(r.providerSpec.MachineType, "n1-") || strings.HasPrefix(r.providerSpec.MachineType, "a2-") {

@SamuelStuchly SamuelStuchly force-pushed the gpu-support branch 4 times, most recently from f2cc121 to 35bbec3 Compare September 10, 2021 16:28
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 10, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2021
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks Sam!

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 15, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2021
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work Sam
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants