Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 68: Session recording modes #11631

Merged
merged 11 commits into from
Jun 6, 2022
Merged

Conversation

gabrielcorado
Copy link
Contributor

@gabrielcorado gabrielcorado commented Mar 31, 2022

@gabrielcorado gabrielcorado added the rfd Request for Discussion label Mar 31, 2022
@gabrielcorado gabrielcorado self-assigned this Mar 31, 2022
Copy link
Contributor

@klizhentas klizhentas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me

Comment on lines 61 to 65
**Note**: Currently, the option `record_session.desktop` uses a boolean value to
disable/enable session recording. It will have to be converted into a new
format: `true` will be charged to the default audit mode, and `false` will be
`off`. Disabling session recording is out of the scope of this RFD, meaning that
the `off` value will only be present on the `desktop` kind.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if switching desktop field type from bool to string will be backward comparable change.

app: strict|best_effort
db: strict|best_effort
k8s: strict|best_effort
desktop: strict|best_effort|off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a technical note, probably not relevant here but may cause some confusion during development
on/off will be treated as bool in yaml: https://yaml.org/type/bool.html
We must surround it with " to ensure it is treated as string: "off"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

A few questions:

  1. Can we change all to default? Then it would look and feel more like a switch statement:
options:
  record_sessions:
    desktop: strict
    ssh: strict
    default: best-effort
  1. Is all / default required? What happens if it isn't specified? I guess we could assume there's an implicit default of best-effort (today's behavior) if not specified.

  2. I think we should be clear about precedence, since this is role-based and a user can have multiple roles. For example, what would the behavior be if my user has these two roles:

role1:

options:
  record_sessions:
    desktop: best-effort
    ssh: best-effort
    default: strict

role2:

options:
  record_sessions:
    kube: strict
    ssh: strict

My guess - if they:

  • start a kube session: role2's strict option should apply
  • start an ssh session: role2's strict option should apply, since it is stricter than role1's best-effort option
  • start an app session, role1's default of strict applies, because it is stricter than role2's "implicit" default of "best-effort"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree about the change and making it like a switch statement;
  2. It is not going to be required. Currently, the current behavior is kind of mixed. It is strict when starting new sessions, but it is the best effort for ongoing sessions. So, I'm not sure what value should be the default. Do you have any suggestions?
  3. That's correct. Thanks for your example. I'll add to the RFD to make it more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the RFD with the explanation and example.

Copy link
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit concerned that we're taking an option that was meant to enable/disable and making it mean something else.

There's already a lot of complexity for reconciling different roles with the strict and best-effort options. If/when we add 'off' as a state, things get even trickier.

rfd/0064-audit-modes.md Outdated Show resolved Hide resolved
| `node` and `proxy` | I/O errors while writing audit logs/session recording data on the server disk. |
| `node-async` and `proxy-async` | Connection errors with Proxy/Auth server while streaming audit logs/session recording data. |

There are going to be two audit modes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this RFD should really be called "recording modes" not "audit modes".

Audit events are always emitted directly to the auth server and never written to disk first, right? If this is true, then running out of disk isn't so much of an issue for audit events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit events are always emitted directly to the auth server and never written to disk first, right?

Not really. In some cases, they're also written into the disk. We use a TeeStreamer when the recording mode is not sync (like in the SSH sessions), so those events are sent to the disk and auth server (except for a few events). For example, looking at the session recording we can see the start and end events:

$ tsh play -f json <session-id>
{"ei":0,"event":"session.start", ...}
{"ei":1,"event":"print", ...}
...
{"ei":10,"event":"print", ...}
{"ei":11,"event":"session.end", ...}

However, failing to emit an audit event will only log an error message since they're already sent to auth server.

If this is true, then running out of disk isn't so much of an issue for audit events.

Right, this is not an audit events issue. But the RFD mentions it will also cover errors while emitting events to auth server (we can consider removing this part if it makes sense).

### Configuration

The configuration is going to be done at the role level. The role option
`record_session` will be extended to hold the audit mode values: `strict` and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`record_session` will be extended to hold the audit mode values: `strict` and
`record_sessions` will be extended to hold the audit mode values: `strict` and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is called record_session. Do you think we should rename it?

app: strict|best_effort
db: strict|best_effort
k8s: strict|best_effort
desktop: strict|best_effort|off
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

A few questions:

  1. Can we change all to default? Then it would look and feel more like a switch statement:
options:
  record_sessions:
    desktop: strict
    ssh: strict
    default: best-effort
  1. Is all / default required? What happens if it isn't specified? I guess we could assume there's an implicit default of best-effort (today's behavior) if not specified.

  2. I think we should be clear about precedence, since this is role-based and a user can have multiple roles. For example, what would the behavior be if my user has these two roles:

role1:

options:
  record_sessions:
    desktop: best-effort
    ssh: best-effort
    default: strict

role2:

options:
  record_sessions:
    kube: strict
    ssh: strict

My guess - if they:

  • start a kube session: role2's strict option should apply
  • start an ssh session: role2's strict option should apply, since it is stricter than role1's best-effort option
  • start an app session, role1's default of strict applies, because it is stricter than role2's "implicit" default of "best-effort"

rfd/0064-audit-modes.md Outdated Show resolved Hide resolved
@zmb3
Copy link
Collaborator

zmb3 commented Apr 1, 2022

The more I think about this, I wonder if it should be tied to roles at all.

Roles are user-based (and since a user can have many roles the behavior quickly gets complicated and hard to understand, see my previous comment about precedence).

Whether or not to abort a session if we can't continue recording seems like maybe it should be more based on the resource, not the role (or roles) of the user.

I wonder what it would look like if this was based on labels instead.

For example, maybe it's a cluster-wide setting with label selectors.

recording_preference:
  # recording must be strictly enforced for all resources in the prod env
  - mode: strict
    resources: all
    selectors:
      env: prod
  # recording must also be strictly enforced for desktops and kube clusters
  # in the staging env
  - mode: strict
    resources: [ desktop, kube ]
    selectors:
      env: staging
  # for everything else, best effort is okay
  # note: maybe we don't need this, and we always assume best-effort unless
  # an explicit match for a strict policy is found..
  - mode: best-effort
    resources: all
    selectors:
      '*': '*'

Having this in one place might be easier for users to reason about, rather than having to understand our precedence rules and manually compare each of a user's roles when the behavior doesn't match their expectations.

This has the added advantage of not conflicting with the enable/disable behavior we have for desktops today.

What do you think, @klizhentas?

@r0mant
Copy link
Collaborator

r0mant commented Apr 1, 2022

I don't think making this a part of cluster recording preference fully solves the original problem tbh. We need to let users specify "strict" recording/audit mode but also provide a "best_effort" escape hatch for certain cases. So the behavior needs to be dynamic and I think roles are a good place for this kind of stuff. Esp. paired with things like access requests.

Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, I only have a couple of general comments:

  • Be more specific about "audit logging" vs "session recording" terminology.
  • Make sure that any changes to existing settings record_session.desktop are backwards compatible.


## Details

Audit modes will define how Teleport proceeds in case of audit failures. Those
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be specific with "audit" vs "session recording" terminology.

For audit events I think we only do "best effort" now since IIRC we basically ignore all errors from EmitAuditEvent (unless I'm mistaken). I would include it in the RFD also but focus on the session recording first to fix the original problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in all cases. For example, in application access, failing to emit events causes the request to fail.

Do you think we should also cover these cases?

rfd/0064-audit-modes.md Outdated Show resolved Hide resolved
rfd/0064-audit-modes.md Outdated Show resolved Hide resolved
@gabrielcorado gabrielcorado changed the title RFD: Audit modes RFD 68: Session recording modes Apr 8, 2022
@gabrielcorado
Copy link
Contributor Author

gabrielcorado commented Apr 8, 2022

@marcoandredinis @zmb3 @r0mant Thanks for your review. After speaking with @smallinsky, we agreed to reduce the scope of this RFD to only session recordings (removing the audit logging part). I've made some updates to the contents, and some comments might be no longer valid.

The idea is to focus this RFD on the initial issue regarding accessing nodes that have disk issues. After covering and implementing it, we can work on supporting more protocols on the session recording modes (for example, desktop) and, if we feel it is necessary, work on another RFD to cover audit logging (events emitted by the EmitAuditEvent function).

Copy link
Contributor

@marcoandredinis marcoandredinis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rfd/0068-session-recording-modes.md Outdated Show resolved Hide resolved
rfd/0068-session-recording-modes.md Outdated Show resolved Hide resolved
@gabrielcorado gabrielcorado enabled auto-merge (squash) June 6, 2022 20:39
@gabrielcorado gabrielcorado merged commit fdc60b2 into master Jun 6, 2022
@gabrielcorado gabrielcorado deleted the gabrielcorado/rfd-audit-modes branch November 2, 2022 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants