Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sdk): accelerator type setting in kfp #11373

Merged
merged 1 commit into from
Nov 13, 2024

Conversation

HumairAK
Copy link
Collaborator

@HumairAK HumairAK commented Nov 13, 2024

In this PR, new API fields were introduced to set accelerator resources via pipeline spec. Note that type and count were deprecated. However in the follow up implementation pr here we can see that these fields are removed entirely from the compilation step. This has basically broken setting accelerator types in KFP, because the driver is expecting type and count to be present, and these are thus not being set at all.

This PR re-introduces these fields, see test case for how the pipeline spec will look like. This is to allow for a proper deprecation period on the api, so we can adjust the driver changes accordingly.

To verify the changes you can use the following sample pipeline on the changes before/after:

from kfp import dsl
from kfp import compiler
@dsl.component(base_image="docker.io/python:3.9.17")
def empty_component():
    pass

@dsl.pipeline(name='pipeline-accel')
def pipeline_accel():
    task = empty_component()
    task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")


compiler.Compiler().compile(
    pipeline_func=pipeline_accel,
    package_path=__file__.replace('.py', '-v2.yaml'))

After the changes this should compile to:

...
deploymentSpec:
  executors:
    exec-empty-component:
      container:
        args: ..omitted
        command: ..omitted
        image: ...
        resources:
          accelerator:
            count: '1'
            resourceCount: '1'
            resourceType: nvidia.com/gpu
            type: nvidia.com/gpu
...

The resulting executor pod should have:

resources:
        limits:
          nvidia.com/gpu: '1'
        requests:
          nvidia.com/gpu: '1'

Fixes #11374

An api breaking change was introduced in 2.10 that removed deprecate
fields in the compilation step for the accelerator fields. This has
resulted in the driver being unable to fetch the old fields. This change
re-introduces the deprecated fields to allow for a proper timespan for
allowing driver to adjust to the new values.

Signed-off-by: Humair Khan <[email protected]>
@HumairAK
Copy link
Collaborator Author

HumairAK commented Nov 13, 2024

Follow up to this would be to update the driver code, the golang proto changes snuck in here. These should have probably been added with the implementation pr for the accelerator changes. Which makes me think we really should be testing for proto changes and generated compiled diffs. We now just need to update driver code to utilize these fields instead of the deprecated ones.

If we are expecting no other downstream impacts, we can remove these deprecated fields for KFP 2.4 or 2.5

@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 13, 2024
@HumairAK
Copy link
Collaborator Author

cc @chensun PTAL

this one I think is severe enough that it warrants a patch release for 2.10, being unable to use GPUs is no good

@chensun chensun changed the title fix: accelerator type setting in kfp fix(sdk): accelerator type setting in kfp Nov 13, 2024
@chensun
Copy link
Member

chensun commented Nov 13, 2024

cc @chensun PTAL

this one I think is severe enough that it warrants a patch release for 2.10, being unable to use GPUs is no good

Sounds reasonable for a patch release. I'll target it by this week.

Copy link
Member

@chensun chensun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chensun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gregsheremeta
Copy link
Contributor

this one I think is severe enough that it warrants a patch release for 2.10, being unable to use GPUs is no good

It should be a hard rule that any time we accidentally make a breaking API change, it warrants a patch release to undo the breakage.

Of course, best if we never let breaking API changes sneak in 😅

It would also be good if we have some announcement mechanism to let people know that SDK 2.10.0 is broken in this way. Perhaps we can add a note to the releases page? Shrug.

@HumairAK
Copy link
Collaborator Author

@gregsheremeta agree, FWIW I send out a note in the community slack, but a more permanent notice somewhere would be ideal. We can utilize the KFP discussions for this if we wanted to (we can pin those), or we can also Pin the issue itself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed All CI tests on a pull request have passed lgtm size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Pipelines generated from kfp 2.10 ignore accelerator
3 participants