Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step Resource "Request" Distribution Regression #6525

Open
skaegi opened this issue Apr 12, 2023 · 9 comments
Open

Step Resource "Request" Distribution Regression #6525

skaegi opened this issue Apr 12, 2023 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@skaegi
Copy link
Contributor

skaegi commented Apr 12, 2023

This was originally work done in #723 but it looks like part of the functionality was omitted during a refactor in #4176

The now missing functionality is...
"Since steps run sequentially we rewrite their resource "requests" to only have the max request for that resource name in one step container and zero out the requests for that resource name in the other step containers."

The result is that Tekton "requests" more resources than it needs which really reduces bin packing when running a Tekton service.

For example...

apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  name: resourcerequest-taskrun
spec:
  taskSpec:
    steps:
    - name: s1
      image: alpine
      script: sleep 1
      resources:
        requests:
          memory: "32Mi"
          cpu: "250m"
    - name: s2
      image: alpine
      script: sleep 1
      resources:
        requests:
          memory: "64Mi"
          cpu: "125m"
    sidecars:
    - name: sc1
      image: alpine
      script: sleep 30
      resources:
        requests:
          memory: "32Mi"
          cpu: "250m"
    - name: sc2
      image: alpine
      script: sleep 30
      resources:
        requests:
          memory: "64Mi"
          cpu: "125m"

Expected Behavior

Containers:
  step-s1:
    Requests:
      cpu:                250m
      memory:             0
  step-s2:
    Requests:
      cpu:                0
      memory:             64Gi
  sidecar-sc1:
    Requests:
      cpu:                250m
      memory:             32Mi
  sidecar-sc2:
    Requests:
      cpu:                125m
      memory:             64Gi

Actual Behavior

Containers:
  step-s1:
    Requests:
      cpu:                250m
      memory:             32Mi
  step-s2:
    Requests:
      cpu:                125m
      memory:             64Gi
  sidecar-sc1:
    Requests:
      cpu:                250m
      memory:             32Mi
  sidecar-sc2:
    Requests:
      cpu:                125m
      memory:             64Gi
@skaegi skaegi added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2023
@vdemeester
Copy link
Member

cc @lbernick
This changed again as.. sometimes, settings things to 0 is a problem.

@lbernick
Copy link
Member

It seems like we could have been clearer about the fact that this was a breaking change, but I'm hesitant to revert this. This behavior has been in place since 0.28 (about a year and a half) and reverting it may be confusing. Also, another user filed an issue saying that the original behavior was confusing: #2986 (comment)

I think you can achieve the behavior you're interested in with task-level compute resources. Would you consider trying this feature?

Also, to what Vincent said, setting compute resources to 0 doesn't interact correctly with limitranges, and I think some flavors of k8s don't allow it. That's why for task-level compute resources, we chose to divide the requests between containers (tektoncd/community@2a66576).

@skaegi
Copy link
Contributor Author

skaegi commented Apr 13, 2023

Link to previous code that had this logic - https://github.com/tektoncd/pipeline/blob/release-v0.27.x/pkg/pod/resource_request.go

re: 0 -- instead we might then omit the value

Containers:
  step-s1:
    Requests:
      cpu:                250m
  step-s2:
    Requests:
      memory:             64Gi
  sidecar-sc1:
    Requests:
      cpu:                250m
      memory:             32Mi
  sidecar-sc2:
    Requests:
      cpu:                125m
      memory:             64Gi

or alternately spread it like we do with limit ranges...

Containers:
  step-s1:
    Requests:
      cpu:                125m
      memory:             32Mi
  step-s2:
    Requests:
      cpu:                125m
      memory:             32Mi
  sidecar-sc1:
    Requests:
      cpu:                250m
      memory:             32Mi
  sidecar-sc2:
    Requests:
      cpu:                125m
      memory:             64Gi

I think we really do want to fix this. The new syntax is well... new and we have 1000s of pre-existing pipelines that are currently requesting resources and reducing our bin packing.

@skaegi
Copy link
Contributor Author

skaegi commented Apr 13, 2023

@lbernick just educated me a bit more and it's tricky -- it looks like spread is the way, but we would also have to the same sort of limit spreading as is done in TEP 0104. We might just re-use that logic and compute the TaskLevelComputeResource by examining the steps.

@lbernick
Copy link
Member

related: #4347

@lbernick
Copy link
Member

@skaegi did you have the chance to investigate whether task-level compute resources would work well for you? Is there any action remaining for this issue?

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 28, 2023
@sibelius
Copy link

what is the best approach for this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants