-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172
[OCPCLOUD-812] Enable support for instances with GPUs on GCP #172
Conversation
965906b
to
e6c995f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good, I've left a few comments inline but for the most part they are stylistic or naming nits, good work!
AcceleratorCount int64 `json:"acceleratorCount,omitempty"` | ||
AcceleratorType string `json:"acceleratorType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add descriptions to these fields as these will end up in oc describe
return false | ||
} | ||
|
||
func (r *Reconciler) checkQuota(machineFamily int64) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this would be a more obvious as a name for this variable as to what this number really is?
func (r *Reconciler) checkQuota(machineFamily int64) error { | |
func (r *Reconciler) checkQuota(machineTypeAcceleratorCount int64) error { |
if machineFamily != 0 { | ||
guestAccelerators = append(guestAccelerators, &v1beta1.GCPAcceleratorConfig{AcceleratorType: "nvidia-tesla-a100", AcceleratorCount: machineFamily}) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use a comment, something along the lines of When the machine type has associated accelerator instances, these will be A100s. Additional guest accelerators are not allowed so ignore the providerSpec GuestAccelerators.
|
||
func (r *Reconciler) validateGuestAccelerators() error { | ||
|
||
a2MachineFamily, n1MachineFamily := r.computeService.MachineTypesList(r.providerSpec.ProjectID, r.providerSpec.Zone, r.Context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make sure we only make this API call when it's definitely required, ie guest accelerators are listed or it's an A2 type?
var guestAccelerators = []*compute.AcceleratorConfig{} | ||
for index, ga := range r.providerSpec.GuestAccelerators { | ||
guestAccelerators = append(guestAccelerators, &compute.AcceleratorConfig{ | ||
AcceleratorType: fmt.Sprintf("zones/%s/acceleratorTypes/%s", zone, r.providerSpec.GuestAccelerators[index].AcceleratorType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the format here to a const at the top of the file
@@ -101,3 +108,32 @@ func (c *computeService) TargetPoolsRemoveInstance(project string, region string | |||
func (c *computeService) MachineTypesGet(project string, zone string, machineType string) (*compute.MachineType, error) { | |||
return c.service.MachineTypes.Get(project, zone, machineType).Do() | |||
} | |||
|
|||
func (c *computeService) MachineTypesList(project string, zone string, ctx context.Context) (map[string]int64, []string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should name this to identify that it is for MachineTypes that support guest accelerators?
e6c995f
to
3299a62
Compare
4646493
to
b8198b8
Compare
type GCPAcceleratorConfig struct { | ||
// AcceleratorCount is number of AcceleratorType accelerators (GPUs) to be attached to an instance | ||
AcceleratorCount int64 `json:"acceleratorCount,omitempty"` | ||
// AcceleratorType is the type of accelerator (GPU) to be attached to an instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should expand this with a list of supported acccelerators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense since it is for the oc describe
Btw supported accelerators are listed in a structure supportedGPUTypes
.
guestAccelerators = r.providerSpec.GuestAccelerators | ||
} | ||
// validate zone and then quota | ||
for _, elem := range guestAccelerators { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic nit: I would typically use a more meaningful name here, eg accelerator
or guestAccelerator
, normally it's like machine
from machines
for example
} | ||
|
||
func (r *Reconciler) validateGuestAccelerators() error { | ||
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce indentation and make this a bit easier to follow, what about inverting this
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") { | |
if len(r.providerSpec.GuestAccelerators) == 0 && 1strings.HasPrefix(r.providerSpec.MachineType, "a2-") { | |
// no accelerators to validate so return nil | |
return nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. The inversion you proposed would leave the case of any regular machineType with 0 accelerators to go through and make the pointless api call, which we want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically that means that with the current version, a regular machineType with 0 accelerators is allowed to go through and make the API call, I think you should double check that as it is today, and update the if statement here, perhaps it should be...
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") { | |
if (len(r.providerSpec.GuestAccelerators) != 0 && strings.HasPrefix(r.providerSpec.MachineType, "n1-") || strings.HasPrefix(r.providerSpec.MachineType, "a2-") { |
if a2MachineFamily[machineType] != 0 { | ||
// a2 family machine - has fixed type and count of GPUs | ||
if err := r.checkQuota(a2MachineFamily[machineType]); err != nil { | ||
return err | ||
} else { | ||
return nil | ||
} | ||
} else if containsString(n1MachineFamily, machineType) { | ||
// n1 family machine | ||
if err := r.checkQuota(0); err != nil { | ||
return err | ||
} else { | ||
return nil | ||
} | ||
} else { | ||
// any other machine type | ||
return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be simpler if it were
if a2MachineFamily[machineType] != 0 { | |
// a2 family machine - has fixed type and count of GPUs | |
if err := r.checkQuota(a2MachineFamily[machineType]); err != nil { | |
return err | |
} else { | |
return nil | |
} | |
} else if containsString(n1MachineFamily, machineType) { | |
// n1 family machine | |
if err := r.checkQuota(0); err != nil { | |
return err | |
} else { | |
return nil | |
} | |
} else { | |
// any other machine type | |
return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType)) | |
} | |
switch { | |
case a2MachineFamily[machineType] != 0: | |
// a2 family machine - has fixed type and count of GPUs | |
return r.checkQuota(a2MachineFamily[machineType]) | |
case containsString(n1MachineFamily, machineType): | |
// n1 family machine | |
return r.checkQuota(0) | |
default: | |
// any other machine type | |
return machinecontroller.InvalidMachineConfiguration(fmt.Sprintf("MachineType %s does not support accelerators. Only A2 and N1 machine type families support guest acceleartors.", machineType)) | |
} |
var guestAccelerators = []*compute.AcceleratorConfig{} | ||
for index, ga := range r.providerSpec.GuestAccelerators { | ||
guestAccelerators = append(guestAccelerators, &compute.AcceleratorConfig{ | ||
AcceleratorType: fmt.Sprintf(acceleratorTypeFmt, zone, r.providerSpec.GuestAccelerators[index].AcceleratorType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to use the index to look up the type here? You have ga
in scope? I would have expected ga.AcceleratorType
to suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake. I assumed since r.providerSpec.GuestAccelerators
is slice, it might have more then one AcceleratorConfig
s in it. However it cannot have more than one.
Same logic applied here.
de31e37
to
94c2e39
Compare
guestAccelerators = r.providerSpec.GuestAccelerators | ||
} | ||
// validate zone and then quota | ||
accelerator := guestAccelerators[0] // guestAccelerators slice cannot be longer than 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this limited to 1? Is this a GCP limitation, if so, where is this written in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find it the docs, however gcp web console does not even have an option for attaching more then 1 type of GPU. And I tried to create an instance with 2 different guest accelerators through the api, and got an googleapi error googleapi: Error 413: Value for field 'resource.guestAccelerators' is too large: maximum size 1 element(s); actual size 2., fieldSizeTooLarge
} | ||
|
||
func (r *Reconciler) validateGuestAccelerators() error { | ||
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically that means that with the current version, a regular machineType with 0 accelerators is allowed to go through and make the API call, I think you should double check that as it is today, and update the if statement here, perhaps it should be...
if len(r.providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(r.providerSpec.MachineType, "a2-") { | |
if (len(r.providerSpec.GuestAccelerators) != 0 && strings.HasPrefix(r.providerSpec.MachineType, "n1-") || strings.HasPrefix(r.providerSpec.MachineType, "a2-") { |
f2cc121
to
35bbec3
Compare
35bbec3
to
a43aa80
Compare
a43aa80
to
1bf062b
Compare
1bf062b
to
fe366ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Thanks Sam!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work Sam
/lgtm
/retest-required Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
Support instances of "A2" family as well as all five GPUs available with "N1" instances. Regular and preemptible instances.
Does not support GPUs for graphics workloads.