TEP-0086: Larger Results via Sidecar Logs #745

jerop · 2022-06-28T16:03:48Z

We propose that we provide the multiple solutions, all guarded behind a larger-results feature flag, so that we can experiment and figure out a way forward. These gated solutions will be alpha, and can be changed or removed at any time.

In this change, we propose experimenting with Sidecar Logs as a solution for providing larger Results within the CRDs. This will be enabled by setting larger-results: "sidecar-logs".

This solution can be changed at any time, or removed completely. This will give us an opportunity to gather user feedback and find ways to address the concerns, as we figure out a way forward.

Many thanks to @chitrangpatel for implementing the proof of concept - demo.

/kind tep

jerop · 2022-06-28T16:10:31Z

/assign @pritidesai

wlynch · 2022-06-28T16:12:03Z

/cc @wlynch

pritidesai · 2022-06-28T16:20:22Z

/assign

jerop · 2022-06-28T16:22:08Z

@tlawrie @imjasonh @Tomcli @ScrapCodes - thought this may be of interest to you, please take a look

wlynch · 2022-06-28T16:42:04Z

teps/0086-changing-the-way-result-parameters-are-stored.md

+
+We propose injecting a dedicated `Sidecar` alongside the `Steps` which will watch the `Results` path of the `Steps`. 
+This `Sidecar` will output the name of the `Result` and its contents to stdout in a parsable pattern. The `TaskRun`
+controller will access the stdout logs of the `Sidecar`, extract the `Result` and its contents during reconciliation.


What RBAC permissions would the controller need? Is it possible to only grant permissions to the Sidecar pod logs without granting access to the user container logs?

@chitrangpatel please look into whether this is possible in the POC

The controller needs to access the resource pods/log and requires get access. I don't think we can go finer than a pod (but I could be wrong). The sidecar runs in a different container in the the same pod as the other steps so in principle, the controller can access the logs of the other sidecars/steps in the same task. While accessing the logs, we need to provide the pod and the container name.

This is a pretty large downside. 😢

User logs can contain PII and other sensitive information, so granting the Tekton Pipeline controller global read access to all Pod logs in the cluster (incl. non-Tekton pods) by default seems overreaching. IIRC, there's also no good way today to scope this dynamically or to only resources you create without granting the controller global RBAC modify permissions. 😭

I see what you mean. This configuration is orthogonal to the feature flag. Let me investigate if we can update the RBAC permissions on the fly from the controller when we generate the sidecar.

we probably don't want the controller or something in Tekton to have the permission to add RBAC configurations. One way might be that we can add a separate role just for this feature. To enable the feature, users have to bind this role to the controller's service account along with setting the feature flag.

Re: permission to logs - I agree global access is less than ideal but the controller already has read access to secrets across the cluster

@dibyom that's a much better way. Thanks!
I did not appreciate that the controller had access to the secrets. They probably contain more sensitive information that the logs.

I made this change in the POC.

Re: permission to logs - I agree global access is less than ideal but the controller already has read access to secrets across the cluster

I don't consider this a compelling reason to further expand our access to the cluster. We should work to reduce our permissions so that Tekton can be used in more environments, not use our existing overly-broad access to justify even more access going forward.

teps/0086-changing-the-way-result-parameters-are-stored.md

wlynch · 2022-06-28T17:14:22Z

teps/0086-changing-the-way-result-parameters-are-stored.md

+- No guarantee you'll be able to access a log before it disappears e.g. logs will not be available via the k8s API
+  once a `Pod` is deleted.
+- The storage of the result parameter may still be limited by the [1.5 MB CRD size][crd-size]. 
+
 ## Alternatives


Another alternative that might help with some of the sidecar issues mentioned above - OIDC API but using the entrypoint as the client instead of a dedicated sidecar.

You can reuse the existing Pod resources to handle the uploading (no additional sidecar overhead).

There's no ambiguity when the user container is complete.

Avoids concerns about result availability since the upload is done from the user Pod itself.

is it the same alternative described in https://github.com/tektoncd/community/blob/main/teps/0086-changing-the-way-result-parameters-are-stored.md#considerations?

in that case, the OIDC approach involves storing the results externally (HTTP server) which is something we can explore alongside the solutions for external storage - which makes this alternative out of scope for this PR

cc @pritidesai

Similar, but without the Sidecar. You also don't need to store the results externally - the OIDC API can sit local on the cluster and write to the Results to the CRD similar to what's proposed with the log based approach. The OIDC bits are just to ensure that the upload requests are coming from the expected Pods.

thanks @wlynch who will be responsible for managing OIDC? Do we need a separate controller to manage the API? How would the pipeline controller work without access to openID?

IIRC, the rough plan was to use the built in cluster service account token projection, then use the TokenReview API to verify the token (or worst case verify via OIDC discovery) to authz requests coming from Pods. I think it's reasonable to require this, especially since it's been stable since 1.20 and I think most cloud providers have this enabled by default.

Task output API would be hosted in a container controlled by Tekton itself, probably another Pod in the tekton-pipelines namespace so that it can have separate RBAC permissions from the controller.

cc @tlawrie @mattmoor in case I missed anything.

@wlynch thanks for sharing this alternative - it sounds worthwhile pursing and was wondering if you could add it to the alternatives listed in https://github.com/tektoncd/community/blob/main/teps/0086-changing-the-way-result-parameters-are-stored.md#alternatives 🙏🏾

+1 this sounds promising, we should add it to the alternatives and see if we can try out a PoC for this approach

teps/0086-changing-the-way-result-parameters-are-stored.md

pritidesai · 2022-06-28T18:14:03Z

teps/0086-changing-the-way-result-parameters-are-stored.md

+
+- If a `Result` file is created first and then contents are written to it, and the `Result` loop happens to read the 
+file in the interim then only partial contents will be considered as `Results`.
+- If a `Step` fails to produce a `Result`, the `Sidecar` will continue to look for the `Result` until it times out.


This is a very common case. We can impose the restriction mentioned in tektoncd/pipeline#3497 but even with this restriction, we will have to find a way to fail instead of letting it time out.

I think this was also my worry while I was thinking through the logic. In practice, when a step errors out, I think the controller kills all the sidecar and step containers and fails the task. I don't think timeout is a worry. I will update the text here. I will add some examples to demonstrate this.

cc @vinamra28

Correction.
If a step fails (ie. crashes), the sidecar will be killed by the controller.
However, if a step runs successfully, just does not write out the results, then the sidecar continues to run until it times out because its looking for the result. I will see if I can send out a kill signal to the sidecar if all the steps have successfully completed.

Ok. I fixed this behaviour. The sidecar will be stopped by the controller with nop image and error it out immediately instead of waiting for it to timeout.

I will provide a suggestion for the text.

teps/0086-changing-the-way-result-parameters-are-stored.md

jerop · 2022-06-28T19:49:09Z

/assign @dibyom

tlawrie · 2022-06-29T06:10:19Z

Hi @jerop I think its fantastic that a PoC was achieved.

we propose moving forward with Sidecar Logs as the solution for providing larger Results within the CRDs.
IMO, there are a number of concerns with the sidecar log tailing approach that I think would put pause on proposing this as the final solution, considering the alternatives listed in the TEP.

I suggest either waiting for the other PoC is finished, or alternatively continuing with a new PoC to determine if one of the other alternatives may be better suited to solving this challenge.

jerop · 2022-06-29T12:30:02Z

Hi @jerop I think its fantastic that a PoC was achieved.

we propose moving forward with Sidecar Logs as the solution for providing larger Results within the CRDs.
IMO, there are a number of concerns with the sidecar log tailing approach that I think would put pause on proposing this as the final solution, considering the alternatives listed in the TEP.

I suggest either waiting for the other PoC is finished, or alternatively continuing with a new PoC to determine if one of the other alternatives may be better suited to solving this challenge.

@tlawrie didn't mean this as the only solution, intended this as a solution that we provide behind a feature flag so that we can experiment with it and find ways to address the concerns - and that we can do the same with the PoCs when they are ready - so that we can have tangible results to help us move this work forward, what do you think?

clarified the language to reflect that this is an experiment, and we can provide the other solutions behind feature gates as well - and that they can be changed or removed at any time

wlynch · 2022-06-29T13:53:34Z

didn't mean this as the only solution, intended this as a solution that we provide behind a feature flag so that we can experiment with it and find ways to address the concerns - and that we can do the same with the PoCs when they are ready - so that we can have tangible results to help us move this work forward, what do you think?

If we're not yet sure which solution we are going to land on as the primary answer long term, perhaps we should just leave this under alternatives? By moving this idea out of alternatives and into a top level proposal section, my reaction was that we were selecting this approach over the other alternatives.

I don't think any of this blocks experimentation, and if we didn't want to make changes to the upstream controller until we're settled on a design I think this could be implemented without any changes to the Pipelines controller if we really wanted.

jerop · 2022-06-29T15:17:47Z

If we're not yet sure which solution we are going to land on as the primary answer long term, perhaps we should just leave this under alternatives?

@wlynch happy to move it to alternatives - what I care the most about is if we can make them implementable while gated behind feature flags so that we can experiment - what do you think?

jerop · 2022-06-29T15:26:57Z

@wlynch @tlawrie @pritidesai @dibyom - moved the solution to the alternatives section and left the proposal to experiment behind feature gates - please take a look

jerop · 2022-06-29T18:28:54Z

The other alternative is we just experiment in a fork until we have more confidence.

@wlynch it is hard to experiment and gather feedback from forks where visibility is limited. this is why we experimented with the alternatives for remote resolution (TEP-0060) behind feature flags to figure out a way forward before we chose the solution - tektoncd/pipeline#4168. the same experimentation approach i'm suggesting here. i believe this kind of experimentation to get tangible feedback from users and dogfooding/fishfooding is what will give us more confidence.

dibyom · 2022-06-30T21:14:01Z

If we want to be more conservative, pretty sure we could also implement this with a mutating Pod admission webhook and just inject the sidecar / entrypoint without making changes to the Pipelines controller (though this won't work once SPIRE support lands). 🤷

I think we still need the controller changes for parsing the logs.

I don't think any of this blocks experimentation, and if we didn't want to make changes to the upstream controller until we're settled on a design I think this could be implemented without any changes to the Pipelines controller if we really wanted.

I'm assuming in this approach instead of the pipelinerun controller parsing the logs, it is done by a separate controller?

dibyom · 2022-06-30T21:25:27Z

@wlynch it is hard to experiment and gather feedback from forks where visibility is limited. this is why we experimented with the alternatives for remote resolution (TEP-0060) behind feature flags to figure out a way forward before we chose the solution - tektoncd/pipeline#4168. the same experimentation approach i'm suggesting here. i believe this kind of experimentation to get tangible feedback from users and dogfooding/fishfooding is what will give us more confidence.

+1. I agree that while we have to be careful around not rearchitecting too much for an experiment, I think its worthwhile making it easy to try out the feature and gather feedback, and decide on the way forward.

In terms of marking the TEP implementable or not - personally I'm ok with experimenting without the implementable flag given that we haven't selected the way forward.

teps/0086-changing-the-way-result-parameters-are-stored.md

tekton-robot · 2022-08-01T16:14:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from dibyom after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

teps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

We propose that we provide the multiple solutions, all guarded behind a `larger-results` feature flag, so that we can experiment and figure out a way forward. These gated solutions will be alpha, and can be changed or removed at any time. In this change, we propose experimenting with Sidecar Logs as a solution for providing larger Results within the CRDs. This will be enabled by setting `larger-results`: `"sidecar-logs"`. This solution can be changed at any time, or removed completely. This will give us an opportunity to gather user feedback and find ways to address the concerns, as we figure out a way forward. /kind tep

jerop · 2022-08-02T21:07:10Z

@dibyom @pritidesai @vdemeester please take a look

cc @wlynch @tlawrie @imjasonh @chitrangpatel

dibyom

Thanks @jerop

Have we done any measurements around the overhead of 1. adding a sidecar to each taskRun that requires a result and 2. having the reconciler parse potentially a few MBs worth of results in the controller itself

dibyom · 2022-08-03T21:59:08Z

teps/0086-changing-the-way-result-parameters-are-stored.md

+the status of `TaskRuns` and `Runs` to improve performance, reduce memory bloat and improve extensibility. Now that
+those changes have been implemented, the `PipelineRun` status is set up to handle larger `Results` without
+exacerbating the performance and storage issues that were there before. For `ChildReferences` to be populated, the
+`embedded-status` must be set to `"minimal"`. Thus, will require that minimal embedded status is enabled during the


This sentence seems a bit out of place

dibyom · 2022-08-03T22:51:32Z

teps/0086-changing-the-way-result-parameters-are-stored.md

+- when a `Result` is found, it prints it to stdout in a parsable pattern.
+- When all the expected `Results` are found, it breaks out of the periodic loop and exits.
+
+#### Caveats of Sidecar Logs


We should also note the RBAC considerations mentioned above

dibyom · 2022-08-03T23:17:01Z

teps/0086-changing-the-way-result-parameters-are-stored.md

+#### Feature Gates for Sidecar Logs
+
+This solution will be gated using a `larger-results` feature flag which users can set to `"sidecar-logs"` to enable it.
+This provides an opportunity to experiment with this solution to provide `Results` within the CRDs as we figure out 


+1 In general, I'd like us to decouple the storing results outside the CRD problem from the results can only be a few KB problem. I think being able to store results above a few KB will unlock quite a few use cases regardless of whether the result is stored in the CRD or not. And storing it in the CRD is a great first step since we don't have to figure out the interfaces for fetching results from external storage either.

That being said, we should also experiment with other ways of getting these larger results from the TaskRun onto the CRD e.g. sending results over HTTPS

chitrangpatel · 2022-08-04T14:36:05Z

Thanks @jerop

Have we done any measurements around the overhead of 1. adding a sidecar to each taskRun that requires a result and 2. having the reconciler parse potentially a few MBs worth of results in the controller itself

@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.

dibyom · 2022-08-04T19:15:04Z

@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.

No worries :) I think pod/run startup time would also be a good thing to measure

chitrangpatel · 2022-08-05T17:08:01Z

@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.

No worries :) I think pod/run startup time would also be a good thing to measure

@wlynch @dibyom @jerop I did the overhead tests (pipeline controller's CPU, Memory and Pod/Task startup time).
For the test, I spawned 20-30 pipeline runs, each with 3 tasks (each with two steps) and gathered metrics from the metrics server during the tests. I ran it with and without sidecar logs.
Here are the results:

Average pipeline controller's CPU difference during pipeline run: 1% 
Average pipeline controller's Memory usage difference during pipeline run: 0.2% 
Average Pod Startup time (time to get to running state) difference: 3 s per task (I think this is the most significant difference)

overhead with sidecar logs

The figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs enabled.

overhead without sidecar logs

The figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs disabled.

overhead pod startup times

The figure below shows measurements of start up times of ~ 50 pods where startup times is the time to get to Running state while monitoring on k9s. The black points and lines are measurements when spawning pods without the results sidecar while red shows measurements when we do have results sidecar being spawned along side the steps.

Let me know if anything is unclear 🙂. I can share the scripts I made to do these measurements.

afrittoli · 2022-08-08T16:15:16Z

/assign @afrittoli

dibyom · 2022-08-08T16:41:05Z

@chitrangpatel Thank you so much for the performance benchmarking numbers. The extra 3s overhead to startup time per taskrun is a bit concerning. Wondering - does the latency go up if say you increase the load (e.g. run 50 pipelineruns concurrently).

From a mitigation standpoint, it might help if we only inject the sidecars if we know we the task is writing a result.

chitrangpatel · 2022-08-08T19:40:02Z

@chitrangpatel Thank you so much for the performance benchmarking numbers. The extra 3s overhead to startup time per taskrun is a bit concerning. Wondering - does the latency go up if say you increase the load (e.g. run 50 pipelineruns concurrently).

Yes, I was very concerned about the 3 s overhead too. Its an increase by ~50% from the average ~6 s overhead without the results sidecar.
I can check the times when I increase the load. Although I will have to figure out a way to accurately measure start-up times instead of eye-balling the k9s screen 50 times (like I did last time 😄 )

From a mitigation standpoint, it might help if we only inject the sidecars if we know we the task is writing a result.

Yes, this is already the case. The task run reconciler will check if the steps in the task are producing any results and only inject a results sidecar in that case.

dibyom · 2022-08-10T15:49:37Z

I can check the times when I increase the load. Although I will have to figure out a way to accurately measure start-up times instead of eye-balling the k9s screen 50 times (like I did last time 😄 )

Given that this is a prototype I think that's ok - we can use these numbers as a baseline to compare to

pritidesai · 2022-08-29T16:18:45Z

API WG - on agenda for further discussion

tekton-robot · 2022-11-27T17:09:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

jerop · 2022-11-30T20:43:17Z

opened a new TEP instead of updating TEP-0086 - #887 - so closing this PR

tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Jun 28, 2022

tekton-robot requested review from dibbles and piyush-garg June 28, 2022 16:03

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 28, 2022

jerop added the area/s3c Issues or PRs that are related to Secure Software Supply Chain (S3C) label Jun 28, 2022

tekton-robot assigned pritidesai Jun 28, 2022

tekton-robot requested a review from wlynch June 28, 2022 16:12

jerop force-pushed the tep-0086-proposal branch 2 times, most recently from 2d825e4 to 74f554f Compare June 28, 2022 16:15

jerop force-pushed the tep-0086-proposal branch from 74f554f to 02971fe Compare June 28, 2022 16:20

wlynch requested changes Jun 28, 2022

View reviewed changes

pritidesai reviewed Jun 28, 2022

View reviewed changes

teps/0086-changing-the-way-result-parameters-are-stored.md Outdated Show resolved Hide resolved

pritidesai reviewed Jun 28, 2022

View reviewed changes

teps/0086-changing-the-way-result-parameters-are-stored.md Outdated Show resolved Hide resolved

pritidesai reviewed Jun 28, 2022

View reviewed changes

teps/0086-changing-the-way-result-parameters-are-stored.md Outdated Show resolved Hide resolved

tekton-robot assigned dibyom Jun 28, 2022

jerop force-pushed the tep-0086-proposal branch 2 times, most recently from 7a4c43a to 50e2db6 Compare June 28, 2022 21:53

jerop force-pushed the tep-0086-proposal branch from 50e2db6 to f987eaa Compare June 29, 2022 12:47

jerop force-pushed the tep-0086-proposal branch from f987eaa to 8f2a03e Compare June 29, 2022 15:24

jerop force-pushed the tep-0086-proposal branch from 1360e83 to 03dbfab Compare June 29, 2022 18:16

wlynch mentioned this pull request Jun 30, 2022

[WIP] POC for handling large results. tektoncd/pipeline#4838

Closed

vdemeester self-assigned this Jul 7, 2022

chitrangpatel reviewed Jul 26, 2022

View reviewed changes

teps/0086-changing-the-way-result-parameters-are-stored.md Outdated Show resolved Hide resolved

jerop force-pushed the tep-0086-proposal branch from 03dbfab to 2267f0d Compare August 1, 2022 16:14

jerop force-pushed the tep-0086-proposal branch 2 times, most recently from 17a493a to 59455fe Compare August 1, 2022 16:20

jerop force-pushed the tep-0086-proposal branch from 59455fe to 08c5999 Compare August 1, 2022 16:54

dibyom reviewed Aug 3, 2022

View reviewed changes

tekton-robot assigned afrittoli Aug 8, 2022

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2022

chitrangpatel mentioned this pull request Nov 30, 2022

TEP-0127: Larger results using sidecar logs: Validation, documentation and examples tektoncd/pipeline#5695

Merged

7 tasks

jerop closed this Nov 30, 2022

jerop mentioned this pull request Dec 1, 2022

TEP-0127: Larger Results via Sidecar Logs #887

Merged

TEP-0086: Larger Results via Sidecar Logs #745

TEP-0086: Larger Results via Sidecar Logs #745

Conversation

jerop commented Jun 28, 2022 • edited Loading

jerop commented Jun 28, 2022

wlynch commented Jun 28, 2022

pritidesai commented Jun 28, 2022

jerop commented Jun 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlynch Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chitrangpatel Jun 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerop Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerop Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerop commented Jun 28, 2022

tlawrie commented Jun 29, 2022

jerop commented Jun 29, 2022 • edited Loading

wlynch commented Jun 29, 2022 • edited Loading

jerop commented Jun 29, 2022

jerop commented Jun 29, 2022

jerop commented Jun 29, 2022 • edited Loading

dibyom commented Jun 30, 2022 • edited Loading

dibyom commented Jun 30, 2022

tekton-robot commented Aug 1, 2022

jerop commented Aug 2, 2022

dibyom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chitrangpatel commented Aug 4, 2022

dibyom commented Aug 4, 2022

chitrangpatel commented Aug 5, 2022 • edited Loading

overhead with sidecar logs

overhead without sidecar logs

overhead pod startup times

afrittoli commented Aug 8, 2022

dibyom commented Aug 8, 2022

chitrangpatel commented Aug 8, 2022 • edited Loading

dibyom commented Aug 10, 2022

pritidesai commented Aug 29, 2022

tekton-robot commented Nov 27, 2022

jerop commented Nov 30, 2022

jerop commented Jun 28, 2022 •

edited

Loading

wlynch Jun 28, 2022 •

edited

Loading

chitrangpatel Jun 30, 2022 •

edited

Loading

jerop Jun 28, 2022 •

edited

Loading

jerop Jun 28, 2022 •

edited

Loading

jerop commented Jun 29, 2022 •

edited

Loading

wlynch commented Jun 29, 2022 •

edited

Loading

jerop commented Jun 29, 2022 •

edited

Loading

dibyom commented Jun 30, 2022 •

edited

Loading

chitrangpatel commented Aug 5, 2022 •

edited

Loading

chitrangpatel commented Aug 8, 2022 •

edited

Loading