Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add session cleanup audit logging #122419

Merged
merged 8 commits into from
Jan 12, 2022

Conversation

thomheymann
Copy link
Contributor

@thomheymann thomheymann commented Jan 6, 2022

Resolves: #119490

Summary

Add audit logging events for session cleanup

Testing

  1. Configure Kibana:
xpack.security.audit.enabled: true
xpack.security.session.lifespan: "30s"
xpack.security.session.cleanupInterval: "10s"
  1. Log in and immediately close the browser window
  2. You should see the following event get logged in logs/audit.log after 40s
{
  "@timestamp": "2022-01-04T18:56:29.541+00:00",
  "message": "Removing invalid or expired session for user [hash=eb28536c8ead72bf81a0a9226e38fc9bad81f5e07c2081bb801b2a5c8842924e]",
  "event": {
    "action": "session_cleanup",
    "category": [
      "authentication"
    ],
    "outcome": "unknown"
  },
  "user": {
    "hash": "eb28536c8ead72bf81a0a9226e38fc9bad81f5e07c2081bb801b2a5c8842924e"
  },
  "kibana": {
    "session_id": "o6SMKslvp1Ab9Edhm118PqM/zmz9+0m4mWLIc5tA4pQ=",
    "authentication_provider": "basic",
    "authentication_type": "basic"
  }
}

@thomheymann thomheymann added release_note:enhancement Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! v8.0.0 Feature:Security/Audit Platform Security - Audit Logging feature auto-backport Deprecated - use backport:version if exact versions are needed v8.1.0 v7.17.0 v8.2.0 labels Jan 6, 2022
@thomheymann thomheymann requested a review from a team January 6, 2022 09:46
@thomheymann thomheymann marked this pull request as ready for review January 6, 2022 10:01
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-security (Team:Security)

@jportner jportner removed the v8.2.0 label Jan 6, 2022
Copy link
Contributor

@jportner jportner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below.

Nice job on this so far, and I like the new audit-related TSdocs!

Comment on lines 494 to 501
const { body: searchResponse } =
await this.options.elasticsearchClient.search<SessionIndexValue>(
{
index: this.indexName,
body: { query: { bool: { should: deleteQueries } } },
_source_includes: 'usernameHash,provider',
},
{ ignore: [409, 404] }
);
}
Copy link
Contributor

@jportner jportner Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we spoke about this last week we talked a bit about scale:

It's not very probable, but if an operator has Kibana configured to clean up sessions very infrequently and/or there are tons of users in a multi-Kibana setup, we could have lots of sessions to clean up, potentially hundreds of thousands.

I believe I suggested generating some test sessions in ES and, in a single task, attempting to delete up to 100k sessions (paging through 10k at a time, which is the max for any given search). This sounded like a reasonable starting point and I thought we should see how long it takes Kibana to do that -- we don't want to "clog up" the task manager, so we might need to dial that down depending on how long it takes (could ask folks in the Alerting team for input regarding this). Anything over that limit would simply be left for the next cleanup run to deal with.

Here are the docs for pagination:

By default, searches return the top 10 matching hits. To page through a larger set of results, you can use the search API's from and size parameters. The from parameter defines the number of hits to skip, defaulting to 0. The size parameter is the maximum number of hits to return. Together, these two parameters define a page of results.
...
By default, you cannot use from and size to page through more than 10,000 hits. This limit is a safeguard set by the index.max_result_window index setting. If you need to page through more than 10,000 hits, use the search_after parameter instead.

The way this code is currently written, Kibana will only find/delete at most 10 sessions during the cleanup phase.


This might be overkill, we could consider just finding/deleting 10k sessions per cleanup (240k per day by default) and leaving it at that. It would probably be fine for 99.99% of our users today, but I wonder if that will still be true 1, 2, or 5 years from now. Also, we can never predict exactly how users will (ab)use our configuration options 😄 we might have some folks who change their session cleanup interval to 30 days!

The pagination approach is more work, but it should allow us to delete up to an order of magnitude more expired sessions during each cleanup, and my gut feeling is that is the better thing to do.

Also, it might not be a bad idea to collect usage data on how many sessions are getting created / deleted every day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, thanks! I mixed up the default with the maximum pagination size.

I've added pagination now so we're removing sessions in batches.

Copy link
Contributor

@jportner jportner Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion on this...

Unfortunately you haven't actually added pagination since you're just using the size option, if there are 10,001 hits you've got an infinite loop because 1. you'll keep fetching the first 10k hits each time it loops, and 2. you are using hits.total which is the grand total for all hits (it's not the total for what was returned in this page).

You have to use either of these combinations:

  • size and from (for < 10k hits)
  • size and search_after (for any number of hits)

In either case you should really use a point-in-time (PIT), as specified in the search docs for pagination using search_after. If you don't do that, the index state could change between pages and your results will be incorrect.

In this situation we are concerned with there potentially being hundreds of thousands of results, so you should use size and search_after.

Also, I don't think you should use a page size of 1k, the size can be up to 10k. Since you aren't fetching a lot of fields, and this isn't a hot code path, I'd suggest using a size of 10k.

So you'll need to change this to:

  1. Create a PIT
  2. Page over 10k results at a time (up to 100k)
  3. Log a warning if there are over 100k total hits
  4. Clean up the PIT
  5. Delete the expired sessions

I'd also like to see data on how this actually performs; we assume that 100k is a reasonable limit (I'd say that shouldn't take more than ~20s total) but we aren't certain. If you can generate some fake session data and run it locally to see how long the cleanup would take, that would be great. You could use console.time and console.timeEnd for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, I'm deleting the items that are being returned by the first batch. So next time I run the search I'm getting the next 1,000 sessions. I'm repeating that until there are no more sessions left in the index. Am I misunderstanding this?

Copy link
Contributor

@jportner jportner Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I skimmed your changes and misread -- you're searching and deleting in each loop.

I had it in my head to page through all the objects then delete. Sorry about that. That's what I was implying with refresh: false, because if you don't refresh the index then your subsequent searches won't return fresh results.

You don't want to loop it like it's currently structured (search + delete in a single iteration) because then you'll be forced to use refresh: 'wait_for' each time you delete, which will really hurt performance.

EDIT: for that matter, it would probably be a good idea to add a manual index refresh after the bulk-delete.

EDIT 2: the docs state:

By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.

Given that, I'm actually not sure if there will be any benefit to adding a manual index refresh, since we're the only ones searching the session index, and we only do this during cleanup. The index appears to automatically refresh shortly after the cleanup process completes.

@thomheymann thomheymann requested a review from jportner January 10, 2022 16:09
Copy link
Contributor

@jportner jportner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that this needs to be tested with a large amount of expired sessions to make sure it works. I put together a sample archive of 100,005 expired sessions:
expired-sessions-archive.zip

First, I added some console logs to the cleanup code.

Then, I created this kibana.yml file:

elasticsearch.hosts: ["http://localhost:9200"]
elasticsearch.username: kibana
elasticsearch.password: changeme
xpack.security.session.cleanupInterval: "2m"

Then I started ES with yarn es snapshot and started Kibana with node scripts/kibana -c /path/to/kibana.yml (to run this in a production-like mode, without the base path proxy server)

Kibana immediately cleans up sessions when it starts, but I had to wait for it to finish doing fleet setup (it indexes a large amount of data into ES when it does that).

I waited about 2m and the cleanup happened again, at that point I loaded in the test archive:

node scripts/es_archiver.js --es-url http://elastic:changeme@localhost:9200 --kibana-url http://elastic:changeme@localhost:5601 load /path/to/expired-sessions-archive --docs-only

It take a while (~40-50 seconds) for that to complete.

Finally after another 2m had passed, the session cleanup happened again and 100k expired sessions got deleted. Another 2m later and the remaining 5 expired sessions got deleted:



[2022-01-12T11:47:59.367-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.012-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.361-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.694-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:00.956-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:01.234-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:01.502-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:01.764-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:02.071-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
[2022-01-12T11:48:02.654-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 10000 sessions...
SESSION CLEANUP: elapsed time 3.671s
[2022-01-12T11:50:02.284-05:00][INFO ][plugins.security.session.index] SESSION CLEANUP: deleting 5 sessions...
SESSION CLEANUP: elapsed time 71.2ms

I had to make some tweaks to get this working (see comments below). Overall nice job on this, it works really well and it's even faster than I expected 🎉

Comment on lines 556 to 560
// Create point in time snapshot to paginate through sessions
const { body: pitResponse } = await this.options.elasticsearchClient.openPointInTime({
index: this.indexName,
keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to close the PIT after you are finished searching (elasticsearchClient.closePointInTime)

Comment on lines 562 to 579
// We don't know the total number of sessions until we fetched the first batch so assume we have at least one session to clean.
let total = 1;
let searchAfter: SortResults | undefined;
for (let i = 0; i < total / SESSION_INDEX_CLEANUP_BATCH_SIZE; i++) {
const { body: searchResponse } =
await this.options.elasticsearchClient.search<SessionIndexValue>({
pit: { id: pitResponse.id, keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE },
_source_includes: 'usernameHash,provider',
query: { bool: { should: deleteQueries } },
search_after: searchAfter,
size: SESSION_INDEX_CLEANUP_BATCH_SIZE,
sort: '_shard_doc',
});
if (searchResponse.hits.hits.length > 0) {
yield searchResponse.hits.hits;
total = (searchResponse.hits.total as SearchTotalHits).value;
searchAfter = searchResponse.hits.hits[searchResponse.hits.hits.length - 1].sort;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and if you have > 10k sessions, this loop terminates after one iteration, because the total is > 10k.

I don't think you need to use the total here at all, in fact after looking into this a bit more, the total hits is not accurate for amounts > 10k anyway and we get a performance improvement if we disable it.

Furthermore, this loop has no limit, as I mentioned above we need some sort of limit.

I suggest changing the loop like so, I tested this and it worked well:

diff --git a/x-pack/plugins/security/server/session_management/session_index.ts b/x-pack/plugins/security/server/session_management/session_index.ts
index 037c47ca791..554d917b834 100644
--- a/x-pack/plugins/security/server/session_management/session_index.ts
+++ b/x-pack/plugins/security/server/session_management/session_index.ts
@@ -7,7 +7,6 @@
 
 import type {
   BulkOperationContainer,
-  SearchTotalHits,
   SortResults,
 } from '@elastic/elasticsearch/lib/api/typesWithBodyKey';
 
@@ -48,6 +47,12 @@ const SESSION_INDEX_TEMPLATE_VERSION = 1;
  */
 const SESSION_INDEX_CLEANUP_BATCH_SIZE = 10_000;
 
+/**
+ * Maximum number of batches per cleanup.
+ * If the batch size is 10,000 and this limit is 10, then Kibana will remove up to 100k sessions per cleanup.
+ */
+const SESSION_INDEX_CLEANUP_BATCH_LIMIT = 10;
+
 /**
  * Number of sessions to remove per batch during cleanup.
  */
@@ -559,10 +564,8 @@ export class SessionIndex {
       keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE,
     });
 
-    // We don't know the total number of sessions until we fetched the first batch so assume we have at least one session to clean.
-    let total = 1;
     let searchAfter: SortResults | undefined;
-    for (let i = 0; i < total / SESSION_INDEX_CLEANUP_BATCH_SIZE; i++) {
+    for (let i = 0; i < SESSION_INDEX_CLEANUP_BATCH_LIMIT; i++) {
       const { body: searchResponse } =
         await this.options.elasticsearchClient.search<SessionIndexValue>({
           pit: { id: pitResponse.id, keep_alive: SESSION_INDEX_CLEANUP_KEEP_ALIVE },
@@ -571,11 +574,15 @@ export class SessionIndex {
           search_after: searchAfter,
           size: SESSION_INDEX_CLEANUP_BATCH_SIZE,
           sort: '_shard_doc',
+          track_total_hits: false, // for performance
         });
-      if (searchResponse.hits.hits.length > 0) {
-        yield searchResponse.hits.hits;
-        total = (searchResponse.hits.total as SearchTotalHits).value;
-        searchAfter = searchResponse.hits.hits[searchResponse.hits.hits.length - 1].sort;
+      const { hits } = searchResponse.hits;
+      if (hits.length > 0) {
+        yield hits;
+        searchAfter = hits[hits.length - 1].sort;
+        if (hits.length < SESSION_INDEX_CLEANUP_BATCH_SIZE) {
+          break;
+        }
       }
     }
   }

In addition: can you please add two unit tests -- one for 10,001 sessions to demonstrate the loop terminates early, and one for 100,001 sessions to demonstrate that the loop doesn't exceed the configured limit.

Comment on lines 51 to 54
/**
* Number of sessions to remove per batch during cleanup.
*/
const SESSION_INDEX_CLEANUP_KEEP_ALIVE = '5m';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Number of sessions to remove per batch during cleanup.
*/
const SESSION_INDEX_CLEANUP_KEEP_ALIVE = '5m';
/**
* How long the session cleanup search point-in-time should be kept alive.
*/
const SESSION_INDEX_CLEANUP_KEEP_ALIVE = '5m';

Copy link
Contributor

@jportner jportner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it again with the existing debug logs enabled. This time I also enabled audit logs. It was almost as fast, took about 5s to delete 100k sessions.

[2022-01-12T15:13:27.195-05:00][DEBUG][plugins.security.session.index] Running cleanup routine.
[2022-01-12T15:13:28.214-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:28.768-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:29.197-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:29.614-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:30.004-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:30.453-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:30.857-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:31.301-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:31.714-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:13:32.167-05:00][DEBUG][plugins.security.session.index] Cleaned up 10000 invalid or expired sessions.
[2022-01-12T15:15:30.237-05:00][DEBUG][plugins.security.session.index] Running cleanup routine.
[2022-01-12T15:15:30.302-05:00][DEBUG][plugins.security.session.index] Cleaned up 5 invalid or expired sessions.

Again, nice job 😄

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
security 104 102 -2
Unknown metric groups

API count

id before after diff
security 180 181 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@thomheymann thomheymann merged commit 39cef8b into elastic:main Jan 12, 2022
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jan 12, 2022
* Add session cleanup audit logging

* Update snapshots

* Added suggestions from code review

* Clean up sessions in batches

* Added suggestions form code review

(cherry picked from commit 39cef8b)
@kibanamachine
Copy link
Contributor

💔 Some backports could not be created

Status Branch Result
8.0
7.17 Backport failed because of merge conflicts

How to fix

Re-run the backport manually:

node scripts/backport --pr 122419

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

@thomheymann
Copy link
Contributor Author

We are going to keep hitting issues attempting to back-port any audit log changes to 7.x branch due to:

  • legacy audit logger having been removed from 8.x but not 7.x
  • user access agreement not having been back-ported to 7.x

If it's a business priority to back port to 7.x it can be done but I'd suggest to focus on 8.x for new audit logging capabilities.

@kibanamachine
Copy link
Contributor

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jan 14, 2022
@jportner jportner removed the v7.17.0 label Jan 14, 2022
@kibanamachine
Copy link
Contributor

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.

@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jan 18, 2022
thomheymann added a commit that referenced this pull request Jan 18, 2022
* Add session cleanup audit logging (#122419)

* Add session cleanup audit logging

* Update snapshots

* Added suggestions from code review

* Clean up sessions in batches

* Added suggestions form code review

(cherry picked from commit 39cef8b)

* Fix types

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Security/Audit Platform Security - Audit Logging feature release_note:enhancement Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Investigate] Kibana audit logging: Capture session expiration
5 participants