-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 68: Session recording modes #11631
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me
rfd/0064-audit-modes.md
Outdated
**Note**: Currently, the option `record_session.desktop` uses a boolean value to | ||
disable/enable session recording. It will have to be converted into a new | ||
format: `true` will be charged to the default audit mode, and `false` will be | ||
`off`. Disabling session recording is out of the scope of this RFD, meaning that | ||
the `off` value will only be present on the `desktop` kind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if switching desktop field type from bool to string will be backward comparable change.
rfd/0064-audit-modes.md
Outdated
app: strict|best_effort | ||
db: strict|best_effort | ||
k8s: strict|best_effort | ||
desktop: strict|best_effort|off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a technical note, probably not relevant here but may cause some confusion during development
on/off
will be treated as bool in yaml: https://yaml.org/type/bool.html
We must surround it with "
to ensure it is treated as string: "off"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
A few questions:
- Can we change
all
todefault
? Then it would look and feel more like a switch statement:
options:
record_sessions:
desktop: strict
ssh: strict
default: best-effort
-
Is
all
/default
required? What happens if it isn't specified? I guess we could assume there's an implicit default of best-effort (today's behavior) if not specified. -
I think we should be clear about precedence, since this is role-based and a user can have multiple roles. For example, what would the behavior be if my user has these two roles:
role1:
options:
record_sessions:
desktop: best-effort
ssh: best-effort
default: strict
role2:
options:
record_sessions:
kube: strict
ssh: strict
My guess - if they:
- start a kube session: role2's strict option should apply
- start an ssh session: role2's strict option should apply, since it is stricter than role1's best-effort option
- start an app session, role1's default of strict applies, because it is stricter than role2's "implicit" default of "best-effort"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree about the change and making it like a switch statement;
- It is not going to be required. Currently, the current behavior is kind of mixed. It is strict when starting new sessions, but it is the best effort for ongoing sessions. So, I'm not sure what value should be the default. Do you have any suggestions?
- That's correct. Thanks for your example. I'll add to the RFD to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the RFD with the explanation and example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit concerned that we're taking an option that was meant to enable/disable and making it mean something else.
There's already a lot of complexity for reconciling different roles with the strict and best-effort options. If/when we add 'off' as a state, things get even trickier.
rfd/0064-audit-modes.md
Outdated
| `node` and `proxy` | I/O errors while writing audit logs/session recording data on the server disk. | | ||
| `node-async` and `proxy-async` | Connection errors with Proxy/Auth server while streaming audit logs/session recording data. | | ||
|
||
There are going to be two audit modes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this RFD should really be called "recording modes" not "audit modes".
Audit events are always emitted directly to the auth server and never written to disk first, right? If this is true, then running out of disk isn't so much of an issue for audit events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Audit events are always emitted directly to the auth server and never written to disk first, right?
Not really. In some cases, they're also written into the disk. We use a TeeStreamer
when the recording mode is not sync (like in the SSH sessions), so those events are sent to the disk and auth server (except for a few events). For example, looking at the session recording we can see the start and end events:
$ tsh play -f json <session-id>
{"ei":0,"event":"session.start", ...}
{"ei":1,"event":"print", ...}
...
{"ei":10,"event":"print", ...}
{"ei":11,"event":"session.end", ...}
However, failing to emit an audit event will only log an error message since they're already sent to auth server.
If this is true, then running out of disk isn't so much of an issue for audit events.
Right, this is not an audit events issue. But the RFD mentions it will also cover errors while emitting events to auth server (we can consider removing this part if it makes sense).
rfd/0064-audit-modes.md
Outdated
### Configuration | ||
|
||
The configuration is going to be done at the role level. The role option | ||
`record_session` will be extended to hold the audit mode values: `strict` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`record_session` will be extended to hold the audit mode values: `strict` and | |
`record_sessions` will be extended to hold the audit mode values: `strict` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option is called record_session
. Do you think we should rename it?
rfd/0064-audit-modes.md
Outdated
app: strict|best_effort | ||
db: strict|best_effort | ||
k8s: strict|best_effort | ||
desktop: strict|best_effort|off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
A few questions:
- Can we change
all
todefault
? Then it would look and feel more like a switch statement:
options:
record_sessions:
desktop: strict
ssh: strict
default: best-effort
-
Is
all
/default
required? What happens if it isn't specified? I guess we could assume there's an implicit default of best-effort (today's behavior) if not specified. -
I think we should be clear about precedence, since this is role-based and a user can have multiple roles. For example, what would the behavior be if my user has these two roles:
role1:
options:
record_sessions:
desktop: best-effort
ssh: best-effort
default: strict
role2:
options:
record_sessions:
kube: strict
ssh: strict
My guess - if they:
- start a kube session: role2's strict option should apply
- start an ssh session: role2's strict option should apply, since it is stricter than role1's best-effort option
- start an app session, role1's default of strict applies, because it is stricter than role2's "implicit" default of "best-effort"
The more I think about this, I wonder if it should be tied to roles at all. Roles are user-based (and since a user can have many roles the behavior quickly gets complicated and hard to understand, see my previous comment about precedence). Whether or not to abort a session if we can't continue recording seems like maybe it should be more based on the resource, not the role (or roles) of the user. I wonder what it would look like if this was based on labels instead. For example, maybe it's a cluster-wide setting with label selectors. recording_preference:
# recording must be strictly enforced for all resources in the prod env
- mode: strict
resources: all
selectors:
env: prod
# recording must also be strictly enforced for desktops and kube clusters
# in the staging env
- mode: strict
resources: [ desktop, kube ]
selectors:
env: staging
# for everything else, best effort is okay
# note: maybe we don't need this, and we always assume best-effort unless
# an explicit match for a strict policy is found..
- mode: best-effort
resources: all
selectors:
'*': '*' Having this in one place might be easier for users to reason about, rather than having to understand our precedence rules and manually compare each of a user's roles when the behavior doesn't match their expectations. This has the added advantage of not conflicting with the enable/disable behavior we have for desktops today. What do you think, @klizhentas? |
I don't think making this a part of cluster recording preference fully solves the original problem tbh. We need to let users specify "strict" recording/audit mode but also provide a "best_effort" escape hatch for certain cases. So the behavior needs to be dynamic and I think roles are a good place for this kind of stuff. Esp. paired with things like access requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me, I only have a couple of general comments:
- Be more specific about "audit logging" vs "session recording" terminology.
- Make sure that any changes to existing settings record_session.desktop are backwards compatible.
rfd/0064-audit-modes.md
Outdated
|
||
## Details | ||
|
||
Audit modes will define how Teleport proceeds in case of audit failures. Those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be specific with "audit" vs "session recording" terminology.
For audit events I think we only do "best effort" now since IIRC we basically ignore all errors from EmitAuditEvent (unless I'm mistaken). I would include it in the RFD also but focus on the session recording first to fix the original problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in all cases. For example, in application access, failing to emit events causes the request to fail.
Do you think we should also cover these cases?
Co-authored-by: Zac Bergquist <[email protected]>
@marcoandredinis @zmb3 @r0mant Thanks for your review. After speaking with @smallinsky, we agreed to reduce the scope of this RFD to only session recordings (removing the audit logging part). I've made some updates to the contents, and some comments might be no longer valid. The idea is to focus this RFD on the initial issue regarding accessing nodes that have disk issues. After covering and implementing it, we can work on supporting more protocols on the session recording modes (for example, desktop) and, if we feel it is necessary, work on another RFD to cover audit logging (events emitted by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Rendered version.