status | title | creation-date | last-updated | authors | |||
---|---|---|---|---|---|---|---|
implementable |
Non-falsifiable provenance support |
2021-10-04 |
2022-01-18 |
|
- Summary
- Background
- SPIRE Concepts
- Proposed Solution
- Motivation
- Proposal
- Implementation Plan
- Design Details - Signed Results
- Design Details - Signed TaskRuns
- Test Plan
- Design Details - API changes
- Design Details - Performance Implications
- Design Details - Failure Conditions
- Design Details - Verification policy
- Design Details - Verfication of data
- Threat Model
- Alternatives
- Infrastructure Needed (optional)
- References (optional)
This TEP covers integrating Tekton and Tekton Chains with SPIFFE/SPIRE, which would provide a more secure supply chain for Tekton users. It would also guarantee non-falsifiable provenance, which is a requirement for SLSA Level 3. With this integration, Tekton will be one step closer to SLSA Level 3 compliance.
Here is a demo of this functionality.
Currently, Tekton Chains observes Tekton and waits for TaskRuns to complete. Once it sees that a TaskRun has completed it tries to sign any artifacts that were built and also generates provenance.
There are a couple issues with this:
- Tekton Chains has no way to verify that the TaskRun it received wasn't modified by anybody other than Tekton, during or after execution
- Tekton Pipelines can't verify that the results it reads weren't modified
This is where SPIRE comes in! SPIRE can be used to request JWTs and certificates (SVIDs) for a given workload (pod for kubernetes).
We're going to use SPIRE to mitigate both of the issues mentioned above.
Before discussing how SPIRE is going to fix these issues, here's a very basic overview of how it works:
A SPIRE deployment relies on two key components, the SPIRE server, and its associated SPIRE agents.
- The SPIRE server serves as a central management system for SPIRE, responsible for interfacing with any key material, authorities, and databases. It is also responsible for attesting the agents that are part of its trust domain.
- The SPIRE server acts as local endpoints for each compute node that is part of the SPIRE deployment.
Pods running in a cluster can interact with SPIRE through SPIFFE api via the local SPIRE agent socket.
Getting pod identity
- A pod can request for an identity by talking to the local agent socket. The SPIRE agent attests the identity of the pod, and request for an identity for the pod to the SPIRE server.
- The workload then receives one or more SPIFFE Verifiable Identity Document, or SVID, an X509 certificate together with the associated private key for the pod's identity. The x509 certificate is signed by the SPIRE server's certificate authority.
Verifying pod identity
- A pod can request for a trust bundle of a SPIRE server, which will include the public key to validate the pod's certificate.
- This trust bundle can be used to verify the SVID of a pod
Note: Alternatively, using the JWT SPIFFE endpoint, one can request a JWT with a specified audience field. Although possible, it may not be idiomatic to use the audience for payload signing purposes. However, there is an open issue discussing adding arbitrary claims as well.
Identity registration is an important aspect of getting SPIRE identities. This defines the subject name of identities that are minted to pods, as well as the attestation requirements of the workloads. This can be done specific to usecase, or there is a kubernetes workload registrar that creates identity based on the fully qualified canonical pod name.
SPIRE can be used for signing provenance using the private key provided from the workload's SVID and signing a payload. The signature will be verifiable by the x509 certificate of the SVID together with the trust bundle of the SPIRE deployment. The verification would ensure that the payload's provenance comes from the Tekton entrypointer image (running in Pods) or the tekton-pipelines-controller.
- SPIRE server runs as a deployment in kubernetes (for simplicity, we'll assume a single cluster SPIRE deployment).
- SPIRE agents run as a daemonset in the kubernetes cluster, listening on a Unix domain socket on each k8s node.
- SPIRE kubernetes workload registrar would be optionally installed to provide automatic registration of SPIRE workloads. This is optional and could be handled by the tekton-pipelines-controller as well.
- We can use spiffe-csi to mount the SPIRE socket into Pods as a
csi
type Volume so that we don't have to rely on thehostPath
volume. Users will be responsible for installing this themselves. When creating Pods, we would automatically mount this volume in as appropriate.
This volume mount would look something like this on an arbitrary Pod created by the tekton-pipelines-controller:
containers:
- name: my-image
volumeMounts:
- name: spiffe-workload-api
mountPath: /spiffe-workload-api
readOnly: true
env:
- name: SPIFFE_ENDPOINT_SOCKET
value: unix:///spiffe-workload-api/spire-agent.sock
volumes:
- name: spiffe-workload-api
csi:
driver: "csi.spiffe.io"
It is possible to run the SPIRE server external to the cluster, and would be desired in certain threat models. However, this comes with operational cost. The SPIRE server has a plugin architecture which makes this easier to reason about.
The main plugins of a SPIRE server (aside from the Attestors) for moving the SPIRE server out of the cluster would be the DataStore plugin as well as the KeyManager and UpstreamAuthority plugins. A list of all these plugins implemented today can be found here.
- The base requirements of a production deployment of SPIRE would be at least an SQL database (by default it uses local storage if not configured). The CA can then be on disk or as another service.
- If higher security assurance are needed for the operation of the SPIRE server, there are a few options to adopt KeyManager and UpstreamAuthority plugins.
- UpstreamAuthority: The upstream CA where all identities would be a part of - the root authority. The plugin system allows the SPIRE server to be configured to talk to existing CA services.
- KeyManager: The location where the signing keys for the intermediate CA to mint the SVIDs are stored.
Note that is may not be as critical in this case to use an upstream CA or/and remote key manager, if the deployment can be locked down well enough. This is because we are using it more as ephermeral keys for signing (the cost may outweigh the risk) - whereas most SPIRE deployments are about end to end authorization of an organization's entire fleet. It is always recommended to be more secure, but worth the consideration of this point when evaluating against other SPIRE usecases.
1. Tekton Chains has no way to verify that the TaskRun it received wasn't modified by anybody other than Tekton, during or after execution
The solution to this is Signed TaskRuns, where the TaskRun has to be signed and verified every time it has been modified. That way, we can prove that the TaskRun hasn't been tampered with during or after execution.
The tekton-pipelines-controller will need to sign the TaskRun whenever it updates it to prove that it hasn't been tampered with during execution. Roughly, this will look something like this:
- tekton-pipelines-controller initiates a TaskRun, and stores a signature over its status contents as an annotation (on the status)
Each time the TaskRun is being reconciled,
- tekton-pipelines-controller verifies the TaskRun status hasn't been tampered by checking the signature
- tekton-pipelines-controller requests a SPIRE SVID and then uses it to sign the new modified TaskRun
- tekton-pipelines-controller updates the new SVID x509 and signature annotation on the TaskRun
Things to keep in mind when designing this feature further (details are discussed later in the document):
- There could potentially be a performance implication here, if we need to request signatures every time the TaskRun is modified during execution.
- What are the fields which are important to detect tampering, ensuring a level of flexibility for operational usecases - i.e. other operators.
- What if mutating admission controllers are intentionally changing TaskRuns (e.g. Solarwinds injecting Tasks into a Pipeline)
- Can we assert that we are verifying the signatures against the proper authority, what are potential threat vectors here.
The solution to this is Signed Results. We will modifiy the entrypointer image to sign results with SPIRE once they're available. The signature and SVID provided by SPIRE will be emitted by the pod, via its termination message, which will then be consumed by the tekton-pipelines-controller to validate before updating the TaskRun status.
Here is a brief overview of the architecture for Tekton pipelines and SPIRE.
┌─────────────┐ Register TaskRun Workload Identity ┌──────────┐
│ ├──────────────────────────────────────────────►│ │
│ Tekton │ │ SPIRE │
│ Pipelines │◄───────────┐ │ Server │
│ Controller │ │ Listen on TaskRun │ │
└────────────┬┘◄┐ │ └──────────┘
▲ │ │ ┌───────┴───────────────────────────────┐ ▲
│ │ │ │ Tekton TaskRun │ │
│ │ │ └───────────────────────────────────────┘ │
│ Configure│ │ │ Attest
│ Pod & │ └─────────────────┐ TaskRun Entrypointer │ +
│ check │ │ Sign Result and update │ Request
│ ready │ ┌───────────┐ │ the status with the │ SVIDs
│ └────►│ TaskRun ├──┘ signature + cert │
│ │ Pod │ which will be used by │
│ └───────────┘ tekton-pipelines-controller │
│ ▲ to update TaskRun. │
│ Get │ Get SVID │
│ SPIRE │ │
│ server │ │
│ Credentials │ ▼
┌┴───────────────────┴─────────────────────────────────────────────────────┐
│ │
│ SPIRE Agent ( Runs as ) │
│ + CSI Driver ( Daemonset ) │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Here is a brief overview of the architecture for chains and SPIRE:
┌─────────────┐ ┌──────────┐
│ │ │ │
│ Tekton │ Listen on TaskRun │ SPIRE │
│ Chains │◄────┐ On completion, check if TaskRun │ Server │
│ │ │ is signed and verify that it is │ │
└─────────────┘ │ untampered. └──────────┘
▲ │ ▲
│ ┌──┴────────────────────────────────────┐ │
│ │ Tekton TaskRun │ │
│ └───────────────────────────────────────┘ │ Attest
│ │ +
│ │ Request
│ │ SVIDs
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ Obtain SPIRE Trust Bundle ▼
┌┴─────────────────────────────────────────────────────────────────────────┐
│ │
│ SPIRE Agent ( Runs as ) │
│ + CSI Driver ( Daemonset ) │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Security! =)
We'll also need this feature for Tekton to achieve SLSA 3, which requires Non-falsifiable Provenance.
- SPIRE is configured to only issue SVIDs to the tekton-pipelines-controller via a k8s Workload Attestor
- Results can be verified as non-falsifiabe
- A TaskRun resource can be verified as non-falsifiable
- Clients are confident that TaskRuns were initiated and monitored by the tekton-pipelines-controller
- Protect against a malicious cluster admin (this will be a goal for SLSA 4)
As mentioned above, the basic design looks like this (this is meant to be high level and still needs to be fleshed out a bit):
- Tekton Pipelines receives a TaskRun config, and generates the Pod for it with SPIRE mounted in
- The Pod executes, and the entrypointer requests an SVID and signature over the Results
- Tekton Pipelines verifies the Results against the SPIRE SVID and Trust Bundle and sets the
SignedResultsVerified
condition toTrue
. - Meanwhile, Tekton Pipelines has been verifying that the TaskRun status hasn't been modified during execution
- Tekton Pipelines requests a signature and SVID over the completed TaskRun status from SPIRE
- Tekton Pipelines stores the SVID and signature as annotations on the TaskRun status
- Tekton Chains observes the TaskRun, and verifies that SVID and signature against the TaskRun status
- If verification is successful, Chains proceeds normally. Otherwise, it stops and doesn't sign anything!
Then, Chains gets the TaskRun. Chains will:
- Verify the signature against the SPIRE SVID and Trust Bundle
- If verification fails, Chains will set the
chains.tekton.dev/signed
annotation on the TaskRun to "failed" and move on. - Otherwise, continue signing stuff!
The plan to achieve non-falsifiable provenance will be implemented in phases.
Phase 1
- Add support for Signed Results with SPIRE (this will primarily involve modifications to the entrypointer image)
- Add support for tekton-pipelines-controller verifying Signed Results
Phase 2
- Implement Signed TaskRuns with SPIRE (requires further design)
- Determine an alternate release process for the SPIRE feature (since the tekton-pipelines-controller will need the SPIRE volume mounted in)
- Add support for Chains verifying Signed TaskRuns
Some ideas that have been mentioned for the release process:
- Having a separate release yaml with the SPIRE volume mounted in (suggested by @bobcatwilson)
- Using the Tekton operator to add the volume in (suggested by @vdemeester)
In parallel with this work, we should:
- Confirm that this meets the SLSA defintion of "non-falsifiable", which might require a security audit
We'll need to depend on the github.com/spiffe/go-spiffe library to interact with the SPIRE agent, request SVIDs and signatures.
- The pkg/spire library would have to be keep up to date with new SPIRE releases, and to maintain a matrix of feature compatibility with SPIRE versions
- In terms of deployments, this will be under a feature flag and deployment of SPIRE and adding it to deployment model will be optional, so minimal in this regard
As we depend on a SPIRE deployment for the feature, e2e testing will need to include spinning up for a SPIRE deployment. For this, we will use an in-cluster SPIRE kubernetes deployment (with SPIRE images) for e2e testing. Thus, there are no additional infrastructure required, and will take up minimal footprint in a kubernetes cluster.
For now, this design section will focus on Part 1 of the implementation plan: Signed Results. This TEP will be updated as we flesh out the design for Signed Taskruns.
We can add a feature flag --enforce-nonfalsifiablity=spire
as described in Customizing the Pipelines Controller behavior as an alpha feature.
If the feature is enabled, then Pipelines would mount in the csi
Volume into all Pods.
The entrypointer image should also be able to see that this flag is set and accordingly sign Results.
Once results are available to the entrypointer image, it will request a signature and SVID over each Results. These signed results are verified by the tekton-pipelines-controller and stored as part of the TaskRun status.
For now, signatures of the results will be contained within the termination message of the pod, alongside any additional material required to perform verification. One consideration of this is the size of the additional fields required. The size of the cert needed for verification is about 800 bytes, and the size of the signatures is about 100 bytes * (number of result fields + 1). The current termination message size is 4K, but there is TEP-0086 looking at supporitng larger results.
The scope of signing would be result data itself. Signing of other aspects of pod execution is not something that is in control of Tekton.
An example termination message would be:
message: '[{"key":"RESULT_MANIFEST","value":"foo,bar","type":1},{"key":"RESULT_MANIFEST.sig","value":"MEQCIB4grfqBkcsGuVyoQd9KUVzNZaFGN6jQOKK90p5HWHqeAiB7yZerDA+YE3Af/ALG43DQzygiBpKhTt8gzWGmpvXJFw==","type":1},{"key":"SVID","value":"-----BEGIN
CERTIFICATE-----\nMIICCjCCAbCgAwIBAgIRALH94zAZZXdtPg97O5vG5M0wCgYIKoZIzj0EAwIwHjEL\nMAkGA1UEBhMCVVMxDzANBgNVBAoTBlNQSUZGRTAeFw0yMjAzMTQxNTUzNTlaFw0y\nMjAzMTQxNjU0MDlaMB0xCzAJBgNVBAYTAlVTMQ4wDAYDVQQKEwVTUElSRTBZMBMG\nByqGSM49AgEGCCqGSM49AwEHA0IABPLzFTDY0RDpjKb+eZCIWgUw9DViu8/pM8q7\nHMTKCzlyGqhaU80sASZfpkZvmi72w+gLszzwVI1ZNU5e7aCzbtSjgc8wgcwwDgYD\nVR0PAQH/BAQDAgOoMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjAMBgNV\nHRMBAf8EAjAAMB0GA1UdDgQWBBSsUvspy+/Dl24pA1f+JuNVJrjgmTAfBgNVHSME\nGDAWgBSOMyOHnyLLGxPSD9RRFL+Yhm/6qzBNBgNVHREERjBEhkJzcGlmZmU6Ly9l\neGFtcGxlLm9yZy9ucy9kZWZhdWx0L3Rhc2tydW4vbm9uLWZhbHNpZmlhYmxlLXBy\nb3ZlbmFuY2UwCgYIKoZIzj0EAwIDSAAwRQIhAM4/bPAH9dyhBEj3DbwtJKMyEI56\n4DVrP97ps9QYQb23AiBiXWrQkvRYl0h4CX0lveND2yfqLrGdVL405O5NzCcUrA==\n-----END
CERTIFICATE-----\n","type":1},{"key":"bar","value":"world","type":1},{"key":"bar.sig","value":"MEUCIQDOtg+aEP1FCr6/FsHX+bY1d5abSQn2kTiUMg4Uic2lVQIgTVF5bbT/O77VxESSMtQlpBreMyw2GmKX2hYJlaOEH1M=","type":1},{"key":"foo","value":"hello","type":1},{"key":"foo.sig","value":"MEQCIBr+k0i7SRSyb4h96vQE9hhxBZiZb/2PXQqReOKJDl/rAiBrjgSsalwOvN0zgQay0xQ7PRbm5YSmI8tvKseLR8Ryww==","type":1}]'
Parsed, the fields would be:
∙ RESULT_MANIFEST foo,bar
∙ RESULT_MANIFEST.sig MEQCIB4grfqBkcsGuVyoQd9KUVzNZaFGN6jQOKK90p5HWHqeAiB7yZerDA+YE3Af/ALG43DQzygiBpKhTt8gzWGmpvXJFw==
∙ SVID -----BEGIN CERTIFICATE-----
MIICCjCCAbCgAwIBAgIRALH94zAZZXdtPg97O5vG5M0wCgYIKoZIzj0EAwIwHjEL
MAkGA1UEBhMCVVMxDzANBgNVBAoTBlNQSUZGRTAeFw0yMjAzMTQxNTUzNTlaFw0y
MjAzMTQxNjU0MDlaMB0xCzAJBgNVBAYTAlVTMQ4wDAYDVQQKEwVTUElSRTBZMBMG
ByqGSM49AgEGCCqGSM49AwEHA0IABPLzFTDY0RDpjKb+eZCIWgUw9DViu8/pM8q7
HMTKCzlyGqhaU80sASZfpkZvmi72w+gLszzwVI1ZNU5e7aCzbtSjgc8wgcwwDgYD
VR0PAQH/BAQDAgOoMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjAMBgNV
HRMBAf8EAjAAMB0GA1UdDgQWBBSsUvspy+/Dl24pA1f+JuNVJrjgmTAfBgNVHSME
GDAWgBSOMyOHnyLLGxPSD9RRFL+Yhm/6qzBNBgNVHREERjBEhkJzcGlmZmU6Ly9l
eGFtcGxlLm9yZy9ucy9kZWZhdWx0L3Rhc2tydW4vbm9uLWZhbHNpZmlhYmxlLXBy
b3ZlbmFuY2UwCgYIKoZIzj0EAwIDSAAwRQIhAM4/bPAH9dyhBEj3DbwtJKMyEI56
4DVrP97ps9QYQb23AiBiXWrQkvRYl0h4CX0lveND2yfqLrGdVL405O5NzCcUrA==
-----END CERTIFICATE-----
∙ bar world
∙ bar.sig MEUCIQDOtg+aEP1FCr6/FsHX+bY1d5abSQn2kTiUMg4Uic2lVQIgTVF5bbT/O77VxESSMtQlpBreMyw2GmKX2hYJlaOEH1M=
∙ foo hello
∙ foo.sig MEQCIBr+k0i7SRSyb4h96vQE9hhxBZiZb/2PXQqReOKJDl/rAiBrjgSsalwOvN0zgQay0xQ7PRbm5YSmI8tvKseLR8Ryww==
However, the verification material be removed from the results as part of the TaskRun status:
$ tkn tr describe non-falsifiable-provenance
Name: non-falsifiable-provenance
Namespace: default
Service Account: default
Timeout: 1m0s
Labels:
app.kubernetes.io/managed-by=tekton-pipelines
🌡️ Status
STARTED DURATION STATUS
38 seconds ago 36 seconds Succeeded
📝 Results
NAME VALUE
∙ bar world
∙ foo hello
🦶 Steps
NAME STATUS
∙ non-falsifiable Completed
An indication that verification has taken place will be as a condition of the TaskRun status:
conditions:
- lastTransitionTime: "2022-03-14T15:54:11Z"
message: All Steps have completed executing
reason: Succeeded
status: "True"
type: Succeeded
- lastTransitionTime: "2022-03-14T15:54:11Z"
message: Successfully verified all spire signed taskrun results
reason: TaskRunResultsVerified
status: 'True'
type: SignedResultsVerified
Each TaskRun status that is written by the tekton-pipelines-controller will be signed to ensure that there is no external tampering of the TaskRun status. Upon each retrieval of the TaskRun, the tekton-pipelines-controller checks if the status is initialized, and that the signature validates the current status. The signature and SVID will be stored as annotations on the TaskRun Status field, and can be verified by a client.
The verification is done on every consumption of the TaskRun except when the TaskRun is uninitialized. When uninitialized, the tekton-pipelines-controller is not influenced by fields in the status and thus will not sign incorrect reflections of the TaskRun.
The spec and TaskRun annotations/labels are not signed as there are valid interactions from other controllers or users (i.e. cancelling taskrun). This is fine as the controller encodes all the necessary information that we care about in the status during initialization. Editing the object annotations/labels or spec will not result in any unverifiable outcome of the status field.
$ tkn tr describe non-falsifiable-provenance -oyaml
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
annotations:
pipeline.tekton.dev/release: 3ee99ec
creationTimestamp: "2022-03-04T19:10:46Z"
generation: 1
labels:
app.kubernetes.io/managed-by: tekton-pipelines
name: non-falsifiable-provenance
namespace: default
resourceVersion: "23088242"
uid: 548ebe99-d40b-4580-a9bc-afe80915e22e
spec:
serviceAccountName: default
taskSpec:
results:
- description: ""
name: foo
- description: ""
name: bar
steps:
- image: ubuntu
name: non-falsifiable
resources: {}
script: |
#!/usr/bin/env bash
sleep 30
printf "%s" "hello" > "$(results.foo.path)"
printf "%s" "world" > "$(results.bar.path)"
timeout: 1m0s
status:
annotations:
tekton.dev/controller-svid: |
-----BEGIN CERTIFICATE-----
MIIB7jCCAZSgAwIBAgIRAI8/08uXSn9tyv7cRN87uvgwCgYIKoZIzj0EAwIwHjEL
MAkGA1UEBhMCVVMxDzANBgNVBAoTBlNQSUZGRTAeFw0yMjAzMDQxODU0NTlaFw0y
MjAzMDQxOTU1MDlaMB0xCzAJBgNVBAYTAlVTMQ4wDAYDVQQKEwVTUElSRTBZMBMG
ByqGSM49AgEGCCqGSM49AwEHA0IABL+e9OjkMv+7XgMWYtrzq0ESzJi+znA/Pm8D
nvApAHg3/rEcNS8c5LgFFRzDfcs9fxGSSkL1JrELzoYul1Q13XejgbMwgbAwDgYD
VR0PAQH/BAQDAgOoMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjAMBgNV
HRMBAf8EAjAAMB0GA1UdDgQWBBR+ma+yZfo092FKIM4F3yhEY8jgDDAfBgNVHSME
GDAWgBRKiCg5+YdTaQ+5gJmvt2QcDkQ6KjAxBgNVHREEKjAohiZzcGlmZmU6Ly9l
eGFtcGxlLm9yZy90ZWt0b24vY29udHJvbGxlcjAKBggqhkjOPQQDAgNIADBFAiEA
8xVWrQr8+i6yMLDm9IUjtvTbz9ofjSsWL6c/+rxmmRYCIBTiJ/HW7di3inSfxwqK
5DKyPrKoR8sq8Ne7flkhgbkg
-----END CERTIFICATE-----
tekton.dev/status-hash: 76692c9dcd362f8a6e4bda8ccb4c0937ad16b0d23149ae256049433192892511
tekton.dev/status-hash-sig: MEQCIFv2bW0k4g0Azx+qaeZjUulPD8Ma3uCUn0tXQuuR1FaEAiBHQwN4XobOXmC2nddYm04AZ74YubUyNl49/vnbnR/HcQ==
completionTime: "2022-03-04T19:11:22Z"
conditions:
- lastTransitionTime: "2022-03-04T19:11:22Z"
message: All Steps have completed executing
reason: Succeeded
status: "True"
type: Succeeded
- lastTransitionTime: "2022-03-04T19:11:22Z"
message: Spire verified
reason: TaskRunResultsVerified
status: "True"
type: SignedResultsVerified
podName: non-falsifiable-provenance-pod
startTime: "2022-03-04T19:10:46Z"
steps:
...
<TRUNCATED>
Once Signed TaksRuns are available, we'll add verification of Signed TaskRun to Chains. If verification fails for a TaskRun then Chains will not sign it.
Tests for Signed Results:
- Enabling the alpha feature for SPIRE in Tekton
- Verify that a necessary fields are present in the pod status result
- Verify that a TaskRun pod status being modified with incorrect results isn't verified by the tekton-pipelines-controller
Tests for Signed TaskRuns:
- Verify that a TaskRun that's been modified during execution isn't verified by the tekton-pipelines-controller
- Verify that a TaskRun that's been modified after execution isn't verified by the tekton-pipelines-controller
- Verify that a TaskRun that's been modified during execution isn't verified by the Chains
- Verify that a TaskRun that's been modified after execution isn't verified by the Chains
At the moment, this TEP does not introduce any structural API changes.
We add the condition type SignedResultVerified
as a way for the tekton-pipelines-controller to indicate that the TaskRun pod step results are verified.
- lastTransitionTime: '2022-03-14T12:51:00Z'
message: Successfully verified all spire signed taskrun results
reason: TaskRunResultsVerified
status: 'True'
type: SignedResultsVerified
For the condition: SignedResultVerified
, it has the following the behavior:
status |
reason |
completionTime is set |
Description |
---|---|---|---|
True | TaskRunResultsVerified | Yes | The TaskRun results have been verified through validation of its signatures |
False | TaskRunResultsVerificationFailed | Yes | The TaskRun results' signatures failed to verify |
Unknown | AwaitingTaskRunResults | No | Waiting upon TaskRun results and signatures to verify |
It stores the necessary data as part of the termination message of TaskRun pods (as results), and TaskRun signatures are included as part of the embedded Status
annotation object in the TaskRun.Status
field. Examples are as shown in the above sections.
As the feature matures and graduates to beta/GA, embeding signature meatadata into the status object can be considered:
// TaskRunStatus defines the observed state of TaskRun
type TaskRunStatus struct {
duckv1beta1.Status `json:",inline"`
// SignatureMetadata would include the necessary information for signing of the fields of the TaskRun status.
SignatureMetadata `json:",inline"`
// TaskRunStatusFields inlines the status fields.
TaskRunStatusFields `json:",inline"`
}
There are several operations that are considered:
- Signing and Verification operations
- Creating/Deleting a SPIRE entry
- Obtaining an SVID
- Obtaining a Trust Bundle
Performance analysis:
- Signing and verification operations are local operations and should be negligible (in microseconds).
- Creating and deleting a SPIRE entry takes about a round trip time (RTT) to the SPIRE server (and RTT from SPIRE server to DataStore - disk or network). This is done once during the initialization of the TaskRun(in 100-500ms).
- Obtaining an SVID, this takes a RTT to the SPIRE server (and RTT from SPIRE server to KeyManager - disk or network).
- For the tekton-pipelines-controller, this is done whenever the certificate expires (usually hour(s) - configurable). Rather negligible.
- For each TaskRun pod, the entrypointer obtains the SVID. This should take a RTT with the SPIRE server. However, there is a caveat here, that the SPIRE server needs to recognize that an entry has been created in order to fulfill the SVID request. The consistency of this can take up to a maximum of 15 seconds (and this happens because entries are created just in time (JIT)), and is in the critical path of the entrypointer's signing mechanism.
- Obtaining a Trust Bundle takes a RTT to the SPIRE server. The SPIRE server should have this ready to serve and only needs to refresh it once in a long while (~hours if not days) with the UpstreamAuthority.
Overall, there is minimal performance impact, with a slight consideration for TaskRun pod cold start-up latency.
The main single point of failure (SPOF) around the signing and verification ecosystem is the SPIRE server. Prolonged downtime of a SPIRE server would lead to the inability to sign/verify.
However, if the SPIRE server is only down momentarily, workloads who have obtained information from the SPIRE server would be able to continue operations for as long as their certificates are valid. In our particular implementation, this implies that:
- The tekton-pipelines-controller and TektonChains would be able to continue verification of TaskRuns for certificate validity time
- The TaskRun pods would be able to sign if they already obtained their SVIDs
- No new TaskRun pods would be able to sign when the SPIRE server goes down - as new entries can't be created, and new SVIDs can't be minted
Another failure could be the SPIRE agents (daemonset), which would result in not being able to obtains SVIDs for that particular node while it is down (with the similar caveats for temporary failure and cert lifetime).
When the verifiers are unable to verify a document, either because the hashes don't match up or it is unable to obtain the trust bundle, what should the desirable action be? There are several options here:
- Stop execution of a TaskRun
- Indicate that the TaskRun is no longer verifiable but continue execution
While this is at an early stage, we opt for indicating that a document is not verifiable, future additions can include ability to configure the action to be taken.
The following details how verification is done with relation to the verification authority (SPIRE server), as well as the materials produced by the signing process. The following are needed in order to perform verification, along with their purpose in the verification:
- Trust Bundle: Verification authority (CA) - provided by SPIRE server
- Cert (x509): Creates a metadata that a key-pair K belongs to a workload/pod X, and this information is endorsed by the verification authority (CA)
- Signature: This content was verified, and the evidence was produced by keypair K
The verification process is as follows:
- Obtain the Trust Bundle independently from the SPIRE server
- Obtain the x509 cert and the signature from the signed object metadata
- Verify that the x509 cert is endorsed/signed by the authority in the Trust Bundle
- Verify that the x509 cert belongs to the right workload, i.e. if signed by the tekton-pipelines-controller, the cert should indicate the URI of the tekton-pipelines-controller, and the same for each individual TaskRun
- Verify that the signature was signed by the key as indicated in the x509 cert
As we saw above, when an object is signed, we create a signature, which we accompany with an x509 certificate which links the signature to the signer (in this case, the workload that signs it - whether it be a TaskRun pod or the tekton-pipelines-controller). These keys which are used to generate the signature are short-lived keys, and the SPIRE server does not keep track of each individual certificate generated, therefore, the x509 certificate is required to be stored with the signature data. It must be possible for the controller to verify signatures generated by the short lived key by the executing pods. Therefore, the public key / cert correponding to the short-lived key must be included with the signature as it is not stored by the authority (due to its short-lived nature).
There are 5 main components in this threat model:
- tekton-pipelines-controller (Signer/Verifier)
- Chains (Verifier)
- TaskRun Pod/Entrypointer (Signer)
- SPIRE Server (Authority)
- SPIRE Agent (Workload Attestor)
There are several capabilities that we want to ensure are correct:
- Verification of signed Results are signed by the correct TaskRun Pod/Entrypointer step
- Verified by tekton-pipelines-controller through SPIRE Authority
- Verification of signed TaskRuns are signed by the tekton-pipelines-controller
- Verified by tekton-pipelines-controller and Chains through SPIRE Authority
Due to the implementation of these different components being similar in nature of how they interact with the architecture (k8s, SPIRE, etc.), the two cases can be evaluated against the same threats.
Threats vectors around verification fall into several categories:
- Ability to create a verifiable false signature
- Able to influence a skip of verification step
- Able to influence verification to verify against a false authority
Potential threats:
- Access to SPIRE upstream authority lets an attacker sign arbitrary values.
- Access to pod execution environment allows minting false signatures for that TaskRun step
- Through exec'ing into a pod (via api-server).
- Vertical attack from host.
- Ability to trick the minting of a false identity, or underspecified measurement/verification of identity (two signing entities are the same which should not be).
Potential mitigations:
- To prevent upstream authority from being breached, the SPIRE server should be located external to the cluster, and use a non-file upstream authority plugin (i.e. vault, gcp_cas, etc.).
- To prevent pod execution environment access:
- k8s should be configured to disallow exec'ing into pods.
- Underlying host should be hardened and protected.
- Memory introspection features should be disabled to prevent memory introspection of keys.
- Disallow ptrace in pods.
- Ensure the task runs created are uniquely identifiable. SPIFFE ID used should be uniquely idenfiable. e.g. Pod IDs can be included in TaskRun identifiers. Discussion around this is fairly open-ended as of now.
- Ensure protection and monitoring of host and SPIRE agents.
- Ensure that SPIRE attestation of cluster nodes, and tekton-pipelines-controller/Chains workloads are properly configured
Potential threats:
- Ability to influence execution of the verifier.
- Ability to create a denial of service to verifier external service to skip verification step.
Potential mitigations:
- Ensure that the verifier binaries are immutable and verifiable (i.e. binary authorization on controllers).
- Proper error handling on failure cases in the verifier.
Potential Threats:
- Ability to intercept and mutate the Trust Bundles obtained by verifier (MITM attack).
- Ability to modify Trust Bundles used by verifier.
- Ability to modify upstream authority.
Potential mitigations:
- Ensure MTLS between SPIRE server with correct authorities
- To prevent Trust Bundles form from being modified, the SPIRE server should be located external to the cluster.
- To better protect the upstream authority, use a non-file upstream authority plugin (i.e. vault, gcp_cas, etc.), and lock down the upstream authority service.
Instead of SPIRE, we could potentially use Kubernertes Service Account Token Volume Projection for signing. This is a form of keyless signing, which is described in detail in Zero-friction “keyless signing” with Kubernetes by mattmoor@.
Instead of requesting an SVID and signature from SPIRE, the Tekton Pipelines Controller would use this keyless signing to request a certificate from Fulcio. Chains would verify the signature against this certificate instead.
Pros:
- Much easier to set up, since it wouldn't require installation of a new tool
- Wouldn't require any changes to the
release.yaml
for Tekton Chains
Cons:
- The cert is tied to a service account rather than to a specific workload (so we could prove the certificate was requested by something running under the tekton-pipelines-controller service account, but not the controller itself)
- SPIRE has much more control around the policy for granting SVIDs
If we decide to enable non-falsifiability with this method instead of, or in addition to, SPIRE, then we can add it in as another option, e.g. --enforce-nonfalsifiablity=sa-volume-projection
.
Service meshes such as istio or linkerd have the ability to, among many other things, be able to provide an identity and inject policy and verification at the workload level, which is part of what we want to achieve.
Pros:
- Integrates the workload identity and attestation aspects into the installation
- Many different features that could be used by other aspects of Tekton in the future
Cons:
- Many of the other features of service meshes would be unused by Tekton
- Most service meshes rely on an underlying workload identity framework like SPIRE, and so would be a heavy weight solution to what we are trying to carry out
- Increases the necessary Trusted Computing Base (TCB) by a significant amount, with not that much gain since we don't utilize the other features (like sidecars, mTLS, etc.)
- Less control of the attestation process handled by the service mesh
Instead of running SPIRE, we could potentially set up a similar minimal infrastructure around Tekton by integrating with secret stores and identity providers individually, and building attestation into the process. There is very little upside to doing this, as we would essentially be re-creating SPIRE, and since SPIRE is already pluggable, there isn't much of an incentive to re-build a similar solution to fit our use case.
More information on each invidiaul component and how they relate to SPIRE here.
Other workload identity providers fall into two categories, other generic self-deployable solutions like SPIRE, and vendor-specific solutions that are tied to a cloud provider.
One of the only other known technologies that does fits the same space of SPIRE is Anthenz. Here is a comparison of the technologies done by Anthenz themselves. The main competitive edge that Anthenz presents is the management of workload identities. However, this is not a feature that is needed in our case since most of the management required is automated and handled by the tekton-pipelines-controller.
Cloud providers also have workload identities built in, for example, GKE, Azure, etc. These workload identity offer well-integrated strong attestation into the workloads of their platforms.
Pros:
- Infrastructure is integrated and provided
- Generally strong attestation since cloud provider is primed to query infrastructure APIs and out of band authentication
Cons:
- Provider identity schema and APIs may not match the requirements to attest certain properties on the TaskRun pod level
- Not a one-size fits all and would need to have per provider integration
- Need to perform federation if working between clusters
We'll probably need a persistent k8s cluster with SPIRE installed to run tests against.
Test clusters will also need Pipelines and Chains installed (currently they only have Pipelines installed).
- Zero-Trust Supply Chains by dlorenc@
- Zero-friction “keyless signing” with Kubernetes by mattmoor@