-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0086: Larger Results via Sidecar Logs #745
Conversation
/assign @pritidesai |
/cc @wlynch |
2d825e4
to
74f554f
Compare
/assign |
@tlawrie @imjasonh @Tomcli @ScrapCodes - thought this may be of interest to you, please take a look |
|
||
We propose injecting a dedicated `Sidecar` alongside the `Steps` which will watch the `Results` path of the `Steps`. | ||
This `Sidecar` will output the name of the `Result` and its contents to stdout in a parsable pattern. The `TaskRun` | ||
controller will access the stdout logs of the `Sidecar`, extract the `Result` and its contents during reconciliation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What RBAC permissions would the controller need? Is it possible to only grant permissions to the Sidecar pod logs without granting access to the user container logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chitrangpatel please look into whether this is possible in the POC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller needs to access the resource pods/log
and requires get
access. I don't think we can go finer than a pod (but I could be wrong). The sidecar runs in a different container in the the same pod as the other steps so in principle, the controller can access the logs of the other sidecars/steps in the same task. While accessing the logs, we need to provide the pod
and the container
name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty large downside. 😢
User logs can contain PII and other sensitive information, so granting the Tekton Pipeline controller global read access to all Pod logs in the cluster (incl. non-Tekton pods) by default seems overreaching. IIRC, there's also no good way today to scope this dynamically or to only resources you create without granting the controller global RBAC modify permissions. 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. This configuration is orthogonal to the feature flag. Let me investigate if we can update the RBAC permissions on the fly from the controller when we generate the sidecar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably don't want the controller or something in Tekton to have the permission to add RBAC configurations. One way might be that we can add a separate role just for this feature. To enable the feature, users have to bind this role to the controller's service account along with setting the feature flag.
Re: permission to logs - I agree global access is less than ideal but the controller already has read access to secrets across the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dibyom that's a much better way. Thanks!
I did not appreciate that the controller had access to the secrets. They probably contain more sensitive information that the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change in the POC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: permission to logs - I agree global access is less than ideal but the controller already has read access to secrets across the cluster
I don't consider this a compelling reason to further expand our access to the cluster. We should work to reduce our permissions so that Tekton can be used in more environments, not use our existing overly-broad access to justify even more access going forward.
- No guarantee you'll be able to access a log before it disappears e.g. logs will not be available via the k8s API | ||
once a `Pod` is deleted. | ||
- The storage of the result parameter may still be limited by the [1.5 MB CRD size][crd-size]. | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative that might help with some of the sidecar issues mentioned above - OIDC API but using the entrypoint as the client instead of a dedicated sidecar.
- You can reuse the existing Pod resources to handle the uploading (no additional sidecar overhead).
- There's no ambiguity when the user container is complete.
- Avoids concerns about result availability since the upload is done from the user Pod itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the same alternative described in https://github.com/tektoncd/community/blob/main/teps/0086-changing-the-way-result-parameters-are-stored.md#considerations?
in that case, the OIDC approach involves storing the results externally (HTTP server) which is something we can explore alongside the solutions for external storage - which makes this alternative out of scope for this PR
cc @pritidesai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar, but without the Sidecar. You also don't need to store the results externally - the OIDC API can sit local on the cluster and write to the Results to the CRD similar to what's proposed with the log based approach. The OIDC bits are just to ensure that the upload requests are coming from the expected Pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @wlynch who will be responsible for managing OIDC? Do we need a separate controller to manage the API? How would the pipeline controller work without access to openID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the rough plan was to use the built in cluster service account token projection, then use the TokenReview API to verify the token (or worst case verify via OIDC discovery) to authz requests coming from Pods. I think it's reasonable to require this, especially since it's been stable since 1.20 and I think most cloud providers have this enabled by default.
Task output API would be hosted in a container controlled by Tekton itself, probably another Pod in the tekton-pipelines
namespace so that it can have separate RBAC permissions from the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlynch thanks for sharing this alternative - it sounds worthwhile pursing and was wondering if you could add it to the alternatives listed in https://github.com/tektoncd/community/blob/main/teps/0086-changing-the-way-result-parameters-are-stored.md#alternatives 🙏🏾
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this sounds promising, we should add it to the alternatives and see if we can try out a PoC for this approach
|
||
- If a `Result` file is created first and then contents are written to it, and the `Result` loop happens to read the | ||
file in the interim then only partial contents will be considered as `Results`. | ||
- If a `Step` fails to produce a `Result`, the `Sidecar` will continue to look for the `Result` until it times out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very common case. We can impose the restriction mentioned in tektoncd/pipeline#3497 but even with this restriction, we will have to find a way to fail instead of letting it time out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was also my worry while I was thinking through the logic. In practice, when a step errors out, I think the controller kills all the sidecar and step containers and fails the task. I don't think timeout is a worry. I will update the text here. I will add some examples to demonstrate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @vinamra28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction.
If a step fails (ie. crashes), the sidecar will be killed by the controller.
However, if a step runs successfully, just does not write out the results, then the sidecar continues to run until it times out because its looking for the result. I will see if I can send out a kill signal to the sidecar if all the steps have successfully completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I fixed this behaviour. The sidecar will be stopped by the controller with nop image
and error it out immediately instead of waiting for it to timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will provide a suggestion for the text.
/assign @dibyom |
7a4c43a
to
50e2db6
Compare
Hi @jerop I think its fantastic that a PoC was achieved.
I suggest either waiting for the other PoC is finished, or alternatively continuing with a new PoC to determine if one of the other alternatives may be better suited to solving this challenge. |
@tlawrie didn't mean this as the only solution, intended this as a solution that we provide behind a feature flag so that we can experiment with it and find ways to address the concerns - and that we can do the same with the PoCs when they are ready - so that we can have tangible results to help us move this work forward, what do you think? clarified the language to reflect that this is an experiment, and we can provide the other solutions behind feature gates as well - and that they can be changed or removed at any time |
If we're not yet sure which solution we are going to land on as the primary answer long term, perhaps we should just leave this under alternatives? By moving this idea out of alternatives and into a top level proposal section, my reaction was that we were selecting this approach over the other alternatives. I don't think any of this blocks experimentation, and if we didn't want to make changes to the upstream controller until we're settled on a design I think this could be implemented without any changes to the Pipelines controller if we really wanted. |
@wlynch happy to move it to alternatives - what I care the most about is if we can make them implementable while gated behind feature flags so that we can experiment - what do you think? |
@wlynch @tlawrie @pritidesai @dibyom - moved the solution to the alternatives section and left the proposal to experiment behind feature gates - please take a look |
@wlynch it is hard to experiment and gather feedback from forks where visibility is limited. this is why we experimented with the alternatives for remote resolution (TEP-0060) behind feature flags to figure out a way forward before we chose the solution - tektoncd/pipeline#4168. the same experimentation approach i'm suggesting here. i believe this kind of experimentation to get tangible feedback from users and dogfooding/fishfooding is what will give us more confidence. |
I think we still need the controller changes for parsing the logs.
I'm assuming in this approach instead of the pipelinerun controller parsing the logs, it is done by a separate controller? |
+1. I agree that while we have to be careful around not rearchitecting too much for an experiment, I think its worthwhile making it easy to try out the feature and gather feedback, and decide on the way forward. In terms of marking the TEP implementable or not - personally I'm ok with experimenting without the implementable flag given that we haven't selected the way forward. |
03dbfab
to
2267f0d
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
17a493a
to
59455fe
Compare
We propose that we provide the multiple solutions, all guarded behind a `larger-results` feature flag, so that we can experiment and figure out a way forward. These gated solutions will be alpha, and can be changed or removed at any time. In this change, we propose experimenting with Sidecar Logs as a solution for providing larger Results within the CRDs. This will be enabled by setting `larger-results`: `"sidecar-logs"`. This solution can be changed at any time, or removed completely. This will give us an opportunity to gather user feedback and find ways to address the concerns, as we figure out a way forward. /kind tep
59455fe
to
08c5999
Compare
@dibyom @pritidesai @vdemeester please take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jerop
Have we done any measurements around the overhead of 1. adding a sidecar to each taskRun that requires a result and 2. having the reconciler parse potentially a few MBs worth of results in the controller itself
the status of `TaskRuns` and `Runs` to improve performance, reduce memory bloat and improve extensibility. Now that | ||
those changes have been implemented, the `PipelineRun` status is set up to handle larger `Results` without | ||
exacerbating the performance and storage issues that were there before. For `ChildReferences` to be populated, the | ||
`embedded-status` must be set to `"minimal"`. Thus, will require that minimal embedded status is enabled during the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence seems a bit out of place
- when a `Result` is found, it prints it to stdout in a parsable pattern. | ||
- When all the expected `Results` are found, it breaks out of the periodic loop and exits. | ||
|
||
#### Caveats of Sidecar Logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also note the RBAC considerations mentioned above
#### Feature Gates for Sidecar Logs | ||
|
||
This solution will be gated using a `larger-results` feature flag which users can set to `"sidecar-logs"` to enable it. | ||
This provides an opportunity to experiment with this solution to provide `Results` within the CRDs as we figure out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 In general, I'd like us to decouple the storing results outside the CRD problem from the results can only be a few KB problem. I think being able to store results above a few KB will unlock quite a few use cases regardless of whether the result is stored in the CRD or not. And storing it in the CRD is a great first step since we don't have to figure out the interfaces for fetching results from external storage either.
That being said, we should also experiment with other ways of getting these larger results from the TaskRun onto the CRD e.g. sending results over HTTPS
@dibyom I was supposed to do that but I completely missed it. When you say |
No worries :) I think pod/run startup time would also be a good thing to measure |
@wlynch @dibyom @jerop I did the overhead tests
overhead with sidecar logsThe figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs enabled. overhead without sidecar logsThe figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs disabled. overhead pod startup timesThe figure below shows measurements of start up times of ~ 50 pods where startup times is the time to get to Let me know if anything is unclear 🙂. I can share the scripts I made to do these measurements. |
/assign @afrittoli |
@chitrangpatel Thank you so much for the performance benchmarking numbers. The extra 3s overhead to startup time per taskrun is a bit concerning. Wondering - does the latency go up if say you increase the load (e.g. run 50 pipelineruns concurrently). From a mitigation standpoint, it might help if we only inject the sidecars if we know we the task is writing a result. |
Yes, I was very concerned about the 3 s overhead too. Its an increase by ~50% from the average ~6 s overhead without the results sidecar.
Yes, this is already the case. The task run reconciler will check if the steps in the task are producing any results and only inject a results sidecar in that case. |
Given that this is a prototype I think that's ok - we can use these numbers as a baseline to compare to |
API WG - on agenda for further discussion |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
opened a new TEP instead of updating TEP-0086 - #887 - so closing this PR |
We propose that we provide the multiple solutions, all guarded behind a
larger-results
feature flag, so that we can experiment and figure out a way forward. These gated solutions will be alpha, and can be changed or removed at any time.In this change, we propose experimenting with Sidecar Logs as a solution for providing larger Results within the CRDs. This will be enabled by setting
larger-results
:"sidecar-logs"
.This solution can be changed at any time, or removed completely. This will give us an opportunity to gather user feedback and find ways to address the concerns, as we figure out a way forward.
Many thanks to @chitrangpatel for implementing the proof of concept - demo.
/kind tep