[Monitoring][Alerting] CCR read exceptions alert #85908

igoristic · 2020-12-15T05:47:17Z

Resolves: #79990

The alerting query groups remote clusters with their relative follow indices. This isn't a per node type an alert nor is it a threshold type of alert. It can either be enabled or disabled.

UI/UX:

Testing:

Setup a basic CCR environment, doc link | x-pack ver
Confirm you have a CCR setup with a leader/follower relationship via Kibana:
.../app/management/data/cross_cluster_replication and that the CRR monitoring metrics are working
I don't know a "legitimate" way of triggering the read exceptions, so I'm doing it manually, eg:

PUT .monitoring-es-7-2020.12.15/_doc/abc123
{
  "cluster_uuid": "BcK-0pmsQniyPQfZuauuXw",
  "timestamp": "2020-12-15T04:36:44.402Z",
  "interval_ms": 10000,
  "type": "ccr_stats",
  "source_node": {
    "uuid": "WJaWz2XIR8mqsZDy7eeykA",
    "host": "10.46.8.145",
    "transport_address": "10.46.8.145:19157",
    "ip": "10.46.8.145",
    "name": "instance-0000000000",
    "timestamp": "2020-12-10T07:01:52.391Z"
  },
  "ccr_stats": {
    "remote_cluster": "BcK-0pmsQniyPQfZuauuXw_remote_cluster_1",
    "leader_index": ".leader_index_1",
    "follower_index": ".follower_index_1",
    "shard_id": 1,
    "leader_global_checkpoint": 2,
    "leader_max_seq_no": 3,
    "follower_global_checkpoint": 4,
    "follower_max_seq_no": 5,
    "last_requested_seq_no": 6,
    "outstanding_read_requests": 7,
    "outstanding_write_requests": 8,
    "write_buffer_operation_count": 9,
    "write_buffer_size_in_bytes": 10,
    "follower_mapping_version": 11,
    "follower_settings_version": 12,
    "follower_aliases_version": 13,
    "total_read_time_millis": 14,
    "total_read_remote_exec_time_millis": 15,
    "read_exceptions": [
      {
        "exception": {
          "type": "read_exceptions_type_1",
          "reason": "read_exceptions_reason_1"
        }
      }
    ],
    "successful_read_requests": 1,
    "failed_read_requests": 1,
    "operations_read": 3,
    "bytes_read": 4,
    "total_write_time_millis": 5,
    "successful_write_requests": 6,
    "failed_write_requests": 7,
    "operations_written": 8,
    "time_since_last_read_millis": 9,
    "fatal_exception": {
      "type": "fatal_exception_type",
      "reason": "fatal_exception_reason"
    }
  }
}

Be sure to update the timestamp and the index name to be more recent

elasticmachine · 2020-12-15T05:47:19Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

chrisronline

Great work here! Code looks solid!

A couple of things I noticed off the bat:

The server log needs proper escaping: Server log: CCR read exceptions alert is firing for the following remote clusters: Monitoring. Verify follower/leader index relationships across the affected remote clusters.
I'm not seeing any presence of the alerts on the CCR monitoring pages, even though we link to them from the next steps. I feel like we should follow the same pattern as other alerts where we surface the presence of the firing alert on the CCR listing page, as well as the individual shard pages.
The non-server log actions look off. Typically we have a condensed message in the context.internalShortMessage and something longer (with a deep link back to Kibana) in context.internalFullMessage

I'm going to kick it back to you to start looking into these while I dig more into the review.

chrisronline

Code looks great! Made a few comments

x-pack/plugins/monitoring/server/lib/alerts/fetch_ccr_read_exceptions.ts

x-pack/plugins/monitoring/server/alerts/ccr_read_exceptions_alert.ts

chrisronline · 2020-12-15T15:09:36Z

x-pack/plugins/monitoring/server/lib/alerts/fetch_ccr_read_exceptions.ts

+          ],
+        },
+      },
+      aggs: {


I don't know if we need to do any aggs actually.

The alert description reads:

Alert if any CCR read exceptions have been detected

I think we could get away with just searching for the past {duration} and only looking at documents that have read_exceptions (see my other comment in this file) and return that.

We'd just have to make sure we de-dupe the list, which seems like a fairly easy task (using a Set or creating a byId object and only taking the values)

I don't see any benefits of doing it locally. With aggs we get "less" data back which I think makes up for its performance degradation (if there are any). Maybe it's fine since we're doing aggregation on filtered data anyways

Or, maybe something I'm missing. I'm willing to explore this option, but perhaps as a post/separate task

x-pack/plugins/monitoring/server/alerts/ccr_read_exceptions_alert.ts

igoristic · 2020-12-15T21:46:42Z

@chrisronline Thank you for the quick review!

I'm not seeing any presence of the alerts on the CCR monitoring pages

I agree this is a problem, and we have other areas in our app currently that are missing this level of granularity.

I tried addressing it here, but there's just too much involved. We basically have to make the "by nodes" logic more generic, and I'm worried about causing regressions on other listing/detail pages (with all the "unrelated" code changes). Maybe we can address these UI/UX enhancements in separate/post PRs?

...Verify follower/leader index

I fixed this by changing it to ...follower and leader, since there's no way to encode a forward slash (afaik)

…ead_exceptions_alert

chrisronline · 2020-12-16T16:14:13Z

I tried addressing it here, but there's just too much involved. We basically have to make the "by nodes" logic more generic, and I'm worried about causing regressions on other listing/detail pages (with all the "unrelated" code changes). Maybe we can address these UI/UX enhancements in separate/post PRs?

This confuses me a little. I'd hoped the work done in #83681 would make these sorts of things more manageable in a smaller time window, but maybe I misunderstood.

FWIW, I spent a little bit of time this morning making the changes I imagine are necessary for this and it looks like it isn't too bad. See https://gist.github.com/chrisronline/4fb0534c0d6ba803af56c42c07b2bc97

WDYT about that approach? FWIW, I think there are things to make that code a bit better but it should work for this PR.

…ead_exceptions_alert

igoristic · 2020-12-17T02:16:06Z

@chrisronline This is ready for another review

I think we have indeed improved our alerting development flow, but I agree there's always room for improvements.

Thank you for the example gist, it helped my out a lot! I took a slightly different approach by changing our node* terminology to a more "dumb" item/component ideology and using their respective meta keys for the relevant filters.

I have state my reasons for splitting this up, but mainly it was to also make the FF

chrisronline · 2020-12-17T15:05:06Z

Thank you for the example gist, it helped my out a lot! I took a slightly different approach by changing our node* terminology to a more "dumb" item/component ideology and using their respective meta keys for the relevant filters.

FWIW, in another PR, I mentioned:

Originally, the missing monitoring data alert did, as we alerted on more than one stack product. I understand your desire to revert this, but it feels a mistake to have the base alert assume the stack product for the alert. Here and here are examples of this.

This is basically what I meant. We'd run into a scenario that node* didn't match.

igoristic · 2020-12-17T15:17:57Z

This is basically what I meant. We'd run into a scenario that node* didn't match

Yeah, I see what you mean now, but I still don't think "products" would be the best approach here. I was thinking something like: ui.label: 'Node 001' and ui.key: 'nodeId' (would be used as the identifier to filter on). This way we can make everything generic.

I think we should reserve this discussion outside the PR though

chrisronline

This is looking great! Thanks for all the hard work here @igoristic!

Code looks pretty good and I think I'm done with that part of the review.

Functionally, I can see us adding one thing:

It'd be nice to somewhere list the actual exception. I think it is available in the monitoring documents and it will help the user understand what the root problem might be.

…ead_exceptions_alert

chrisronline

I reached out to the ES team and the easiest way to simulate an exception is to close the follower index. See this doc

Here is what I see when I do this:

Perhaps there is a better way to show this error than in a <EuiCode> block, since it's a bit more structured? Perhaps it can be baked into the original messaging somehow?

Also, we should enable setup mode on the CCR pages, as they feature alerts and users should be able to access the config without an alert firing. This is an example of supporting this

…ead_exceptions_alert

igoristic · 2020-12-18T18:19:57Z

@chrisronline

Perhaps there is a better way to show this error than in a block, since it's a bit more structured? Perhaps it can be baked into the original messaging somehow?

I agree the structure is simple, but I decided not to bake it into "our" description for several reasons:

We can't localize it
Your specific example is simple, but what if some of them get "big" and include the trace/dump in the reason

Also, I think the code style here expresses that this is something that came from the server, and not something we assumed (or made generic). I explicitly decided to add ...ui.code so we can throw traces in there and any other body of text that we know nothing about (talking about future alerts)

chrisronline

LGTM! Great work here!

kibanamachine · 2020-12-18T20:24:18Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: b39bd69

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`monitoring`	616	617	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`monitoring`	959.9KB	979.5KB	+19.5KB

Distributable file count

id	before	after	diff
`default`	48057	48062	+5

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`monitoring`	36.4KB	37.5KB	+1.2KB

Unknown metric groups

async chunk count

id	before	after	diff
`monitoring`	7	8	+1

History

💚 Build #95333 succeeded e724db1
💚 Build #95281 succeeded 39de03b
💔 Build #95066 failed 5abba03
💚 Build #94704 succeeded ef76de1
💚 Build #94343 succeeded 55a5adc

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* CCR read exceptions all branches * cleanup * CR feedback * Added UI/UX to ccr/shards listing and details * Fixed snaps * Added reason for the exception * Added setup mode funtionality and alert status # Conflicts: # x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js

* [Monitoring][Alerting] CCR read exceptions alert (#85908) * CCR read exceptions all branches * cleanup * CR feedback * Added UI/UX to ccr/shards listing and details * Fixed snaps * Added reason for the exception * Added setup mode funtionality and alert status # Conflicts: # x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js * Update ccr.js * Update ccr.test.js.snap

) * [Monitoring][Alerting] CCR read exceptions alert (#85908) * CCR read exceptions all branches * cleanup * CR feedback * Added UI/UX to ccr/shards listing and details * Fixed snaps * Added reason for the exception * Added setup mode funtionality and alert status # Conflicts: # x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js * Update ccr.js * Update ccr.test.js.snap

igoristic · 2020-12-19T16:20:18Z

Backport:
7.x: ae4307f
7.4: cd163b2

* master: (48 commits) Fix request with disabled aggregation (elastic#85696) [Security Solution][Detections][Threshold Rules] Threshold Rule Bug Fixes (elastic#84918) Removed a possibility to define two different names for Alert types on API and UI level. (elastic#86236) Bump Node.js from version 14.15.2 to 14.15.3 (elastic#86593) [index patterns] Fleep app - Keep saved object field list until field caps provides fields (elastic#85370) [Security Solutions] fix timeline tabs + layout (elastic#86581) Upgrade to hapi version 20 (elastic#85406) App Services: Remove remaining uiActions, expressions, data, embeddable circular dependencies. (elastic#82791) Rename chartLibrary setting to legacyChartsLibrary (elastic#86529) [CI] TeamCity updates (elastic#85843) [Maps] Use Json for mvt-tests (elastic#86492) [Rollup Jobs] Added autofocus to cron editor (elastic#86324) [Monitoring][Alerting] CCR read exceptions alert (elastic#85908) [CI] Bump memory for main CI workers (elastic#86541) Explicitly set Elasticsearch heap size during CI and local development (elastic#86513) [App Search] Updates to results on the documents view (elastic#86181) [Discover] Change default sort handling (elastic#85561) [App Search] Convert DocumentCreationModal to DocumentCreationFlyout (elastic#86508) [App Search] Sample Engines should have access to the Crawler (elastic#86502) Fixed duplication of create new modal (elastic#86489) ...

igoristic added 2 commits December 15, 2020 00:09

CCR read exceptions all branches

38c7cf2

cleanup

55a5adc

igoristic added release_note:enhancement Team:Monitoring Stack Monitoring team v8.0.0 v7.11.0 labels Dec 15, 2020

igoristic added this to the Stack Monitoring UI 7.11 milestone Dec 15, 2020

igoristic requested a review from a team December 15, 2020 05:47

chrisronline suggested changes Dec 15, 2020

View reviewed changes

CR feedback

ef76de1

igoristic requested a review from chrisronline December 15, 2020 21:46

Merge branch 'master' of https://github.com/elastic/kibana into ccr_r…

65b4825

…ead_exceptions_alert

igoristic added 3 commits December 16, 2020 13:27

Merge branch 'master' of https://github.com/elastic/kibana into ccr_r…

ff1a65d

…ead_exceptions_alert

Added UI/UX to ccr/shards listing and details

f71e922

Merge branch 'master' of https://github.com/elastic/kibana into ccr_r…

5abba03

…ead_exceptions_alert

igoristic added the v7.12.0 label Dec 17, 2020

Fixed snaps

f6455f6

chrisronline suggested changes Dec 17, 2020

View reviewed changes

igoristic added 3 commits December 17, 2020 17:05

Merge branch 'master' of https://github.com/elastic/kibana into ccr_r…

39de03b

…ead_exceptions_alert

Merge branch 'master' of https://github.com/elastic/kibana into ccr_r…

8c3f409

…ead_exceptions_alert

Added reason for the exception

e724db1

igoristic requested a review from chrisronline December 18, 2020 07:08

chrisronline suggested changes Dec 18, 2020

View reviewed changes

igoristic added 2 commits December 18, 2020 13:10

Added setup mode funtionality and alert status

d708b21

Merge branch 'master' of https://github.com/elastic/kibana into ccr_r…

b39bd69

…ead_exceptions_alert

igoristic requested a review from chrisronline December 18, 2020 18:20

chrisronline approved these changes Dec 18, 2020

View reviewed changes

igoristic merged commit 94b4945 into elastic:master Dec 18, 2020

igoristic deleted the ccr_read_exceptions_alert branch December 18, 2020 23:09

This was referenced Dec 18, 2020

[7.x] [Monitoring][Alerting] CCR read exceptions alert (#85908) #86583

Merged

[7.11] [Monitoring][Alerting] CCR read exceptions alert (#85908) #86584

Merged

igoristic added the backported label Dec 19, 2020

This was referenced Jun 24, 2021

[Stack Monitoring] Change SM rule types to generate discrete alert instances per node #100136

Closed

[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring][Alerting] CCR read exceptions alert #85908

[Monitoring][Alerting] CCR read exceptions alert #85908

igoristic commented Dec 15, 2020

elasticmachine commented Dec 15, 2020

chrisronline left a comment

chrisronline left a comment

chrisronline Dec 15, 2020

chrisronline Dec 15, 2020

igoristic Dec 15, 2020

igoristic Dec 15, 2020

igoristic commented Dec 15, 2020

chrisronline commented Dec 16, 2020

igoristic commented Dec 17, 2020

chrisronline commented Dec 17, 2020

igoristic commented Dec 17, 2020 •

edited

Loading

chrisronline left a comment

chrisronline left a comment

igoristic commented Dec 18, 2020

chrisronline left a comment

kibanamachine commented Dec 18, 2020

async chunk count

igoristic commented Dec 19, 2020

[Monitoring][Alerting] CCR read exceptions alert #85908

[Monitoring][Alerting] CCR read exceptions alert #85908

Conversation

igoristic commented Dec 15, 2020

elasticmachine commented Dec 15, 2020

chrisronline left a comment

Choose a reason for hiding this comment

chrisronline left a comment

Choose a reason for hiding this comment

chrisronline Dec 15, 2020

Choose a reason for hiding this comment

chrisronline Dec 15, 2020

Choose a reason for hiding this comment

igoristic Dec 15, 2020

Choose a reason for hiding this comment

igoristic Dec 15, 2020

Choose a reason for hiding this comment

igoristic commented Dec 15, 2020

chrisronline commented Dec 16, 2020

igoristic commented Dec 17, 2020

chrisronline commented Dec 17, 2020

igoristic commented Dec 17, 2020 • edited Loading

chrisronline left a comment

Choose a reason for hiding this comment

chrisronline left a comment

Choose a reason for hiding this comment

igoristic commented Dec 18, 2020

chrisronline left a comment

Choose a reason for hiding this comment

kibanamachine commented Dec 18, 2020

💚 Build Succeeded

Metrics [docs]

Module Count

Async chunks

Distributable file count

Page load bundle

async chunk count

History

igoristic commented Dec 19, 2020

igoristic commented Dec 17, 2020 •

edited

Loading