Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring][Alerting] CCR read exceptions alert #85908

Merged
merged 13 commits into from
Dec 18, 2020

Conversation

igoristic
Copy link
Contributor

Resolves: #79990

The alerting query groups remote clusters with their relative follow indices. This isn't a per node type an alert nor is it a threshold type of alert. It can either be enabled or disabled.

UI/UX:
Screen Shot 2020-12-15 at 12 39 02 AM

Screen Shot 2020-12-15 at 12 24 51 AM

Screen Shot 2020-12-15 at 12 24 59 AM

Screen Shot 2020-12-15 at 12 25 09 AM

Screen Shot 2020-12-15 at 12 25 32 AM


Testing:

  1. Setup a basic CCR environment, doc link | x-pack ver

  2. Confirm you have a CCR setup with a leader/follower relationship via Kibana:
    .../app/management/data/cross_cluster_replication and that the CRR monitoring metrics are working

  3. I don't know a "legitimate" way of triggering the read exceptions, so I'm doing it manually, eg:

PUT .monitoring-es-7-2020.12.15/_doc/abc123
{
  "cluster_uuid": "BcK-0pmsQniyPQfZuauuXw",
  "timestamp": "2020-12-15T04:36:44.402Z",
  "interval_ms": 10000,
  "type": "ccr_stats",
  "source_node": {
    "uuid": "WJaWz2XIR8mqsZDy7eeykA",
    "host": "10.46.8.145",
    "transport_address": "10.46.8.145:19157",
    "ip": "10.46.8.145",
    "name": "instance-0000000000",
    "timestamp": "2020-12-10T07:01:52.391Z"
  },
  "ccr_stats": {
    "remote_cluster": "BcK-0pmsQniyPQfZuauuXw_remote_cluster_1",
    "leader_index": ".leader_index_1",
    "follower_index": ".follower_index_1",
    "shard_id": 1,
    "leader_global_checkpoint": 2,
    "leader_max_seq_no": 3,
    "follower_global_checkpoint": 4,
    "follower_max_seq_no": 5,
    "last_requested_seq_no": 6,
    "outstanding_read_requests": 7,
    "outstanding_write_requests": 8,
    "write_buffer_operation_count": 9,
    "write_buffer_size_in_bytes": 10,
    "follower_mapping_version": 11,
    "follower_settings_version": 12,
    "follower_aliases_version": 13,
    "total_read_time_millis": 14,
    "total_read_remote_exec_time_millis": 15,
    "read_exceptions": [
      {
        "exception": {
          "type": "read_exceptions_type_1",
          "reason": "read_exceptions_reason_1"
        }
      }
    ],
    "successful_read_requests": 1,
    "failed_read_requests": 1,
    "operations_read": 3,
    "bytes_read": 4,
    "total_write_time_millis": 5,
    "successful_write_requests": 6,
    "failed_write_requests": 7,
    "operations_written": 8,
    "time_since_last_read_millis": 9,
    "fatal_exception": {
      "type": "fatal_exception_type",
      "reason": "fatal_exception_reason"
    }
  }
}

Be sure to update the timestamp and the index name to be more recent

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here! Code looks solid!

A couple of things I noticed off the bat:

  1. The server log needs proper escaping: Server log: CCR read exceptions alert is firing for the following remote clusters: Monitoring. Verify follower/leader index relationships across the affected remote clusters.

  2. I'm not seeing any presence of the alerts on the CCR monitoring pages, even though we link to them from the next steps. I feel like we should follow the same pattern as other alerts where we surface the presence of the firing alert on the CCR listing page, as well as the individual shard pages.

  3. The non-server log actions look off. Typically we have a condensed message in the context.internalShortMessage and something longer (with a deep link back to Kibana) in context.internalFullMessage

Screen Shot 2020-12-15 at 9 40 59 AM

I'm going to kick it back to you to start looking into these while I dig more into the review.

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great! Made a few comments

],
},
},
aggs: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need to do any aggs actually.

The alert description reads:

Alert if any CCR read exceptions have been detected

I think we could get away with just searching for the past {duration} and only looking at documents that have read_exceptions (see my other comment in this file) and return that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd just have to make sure we de-dupe the list, which seems like a fairly easy task (using a Set or creating a byId object and only taking the values)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any benefits of doing it locally. With aggs we get "less" data back which I think makes up for its performance degradation (if there are any). Maybe it's fine since we're doing aggregation on filtered data anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, maybe something I'm missing. I'm willing to explore this option, but perhaps as a post/separate task

@igoristic
Copy link
Contributor Author

@chrisronline Thank you for the quick review!

I'm not seeing any presence of the alerts on the CCR monitoring pages

I agree this is a problem, and we have other areas in our app currently that are missing this level of granularity.

I tried addressing it here, but there's just too much involved. We basically have to make the "by nodes" logic more generic, and I'm worried about causing regressions on other listing/detail pages (with all the "unrelated" code changes). Maybe we can address these UI/UX enhancements in separate/post PRs?

...Verify follower/leader index

I fixed this by changing it to ...follower and leader, since there's no way to encode a forward slash (afaik)

@chrisronline
Copy link
Contributor

I tried addressing it here, but there's just too much involved. We basically have to make the "by nodes" logic more generic, and I'm worried about causing regressions on other listing/detail pages (with all the "unrelated" code changes). Maybe we can address these UI/UX enhancements in separate/post PRs?

This confuses me a little. I'd hoped the work done in #83681 would make these sorts of things more manageable in a smaller time window, but maybe I misunderstood.

FWIW, I spent a little bit of time this morning making the changes I imagine are necessary for this and it looks like it isn't too bad. See https://gist.github.com/chrisronline/4fb0534c0d6ba803af56c42c07b2bc97

WDYT about that approach? FWIW, I think there are things to make that code a bit better but it should work for this PR.

@igoristic
Copy link
Contributor Author

@chrisronline This is ready for another review

I think we have indeed improved our alerting development flow, but I agree there's always room for improvements.

Thank you for the example gist, it helped my out a lot! I took a slightly different approach by changing our node* terminology to a more "dumb" item/component ideology and using their respective meta keys for the relevant filters.

I have state my reasons for splitting this up, but mainly it was to also make the FF

@chrisronline
Copy link
Contributor

Thank you for the example gist, it helped my out a lot! I took a slightly different approach by changing our node* terminology to a more "dumb" item/component ideology and using their respective meta keys for the relevant filters.

FWIW, in another PR, I mentioned:

Originally, the missing monitoring data alert did, as we alerted on more than one stack product. I understand your desire to revert this, but it feels a mistake to have the base alert assume the stack product for the alert. Here and here are examples of this.

This is basically what I meant. We'd run into a scenario that node* didn't match.

@igoristic
Copy link
Contributor Author

igoristic commented Dec 17, 2020

This is basically what I meant. We'd run into a scenario that node* didn't match

Yeah, I see what you mean now, but I still don't think "products" would be the best approach here. I was thinking something like: ui.label: 'Node 001' and ui.key: 'nodeId' (would be used as the identifier to filter on). This way we can make everything generic.

I think we should reserve this discussion outside the PR though

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Thanks for all the hard work here @igoristic!

Code looks pretty good and I think I'm done with that part of the review.

Functionally, I can see us adding one thing:

Screen Shot 2020-12-17 at 10 46 27 AM

It'd be nice to somewhere list the actual exception. I think it is available in the monitoring documents and it will help the user understand what the root problem might be.

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reached out to the ES team and the easiest way to simulate an exception is to close the follower index. See this doc

Here is what I see when I do this:
Screen Shot 2020-12-18 at 10 28 10 AM

Perhaps there is a better way to show this error than in a <EuiCode> block, since it's a bit more structured? Perhaps it can be baked into the original messaging somehow?

Also, we should enable setup mode on the CCR pages, as they feature alerts and users should be able to access the config without an alert firing. This is an example of supporting this

@igoristic
Copy link
Contributor Author

@chrisronline

Perhaps there is a better way to show this error than in a block, since it's a bit more structured? Perhaps it can be baked into the original messaging somehow?

I agree the structure is simple, but I decided not to bake it into "our" description for several reasons:

  • We can't localize it
  • Your specific example is simple, but what if some of them get "big" and include the trace/dump in the reason

Also, I think the code style here expresses that this is something that came from the server, and not something we assumed (or made generic). I explicitly decided to add ...ui.code so we can throw traces in there and any other body of text that we know nothing about (talking about future alerts)

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work here!

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
monitoring 616 617 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
monitoring 959.9KB 979.5KB +19.5KB

Distributable file count

id before after diff
default 48057 48062 +5

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
monitoring 36.4KB 37.5KB +1.2KB
Unknown metric groups

async chunk count

id before after diff
monitoring 7 8 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@igoristic igoristic merged commit 94b4945 into elastic:master Dec 18, 2020
@igoristic igoristic deleted the ccr_read_exceptions_alert branch December 18, 2020 23:09
igoristic added a commit to igoristic/kibana that referenced this pull request Dec 18, 2020
* CCR read exceptions all branches

* cleanup

* CR feedback

* Added UI/UX to ccr/shards listing and details

* Fixed snaps

* Added reason for the exception

* Added setup mode funtionality and alert status
# Conflicts:
#	x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js
igoristic added a commit that referenced this pull request Dec 19, 2020
* [Monitoring][Alerting] CCR read exceptions alert (#85908)

* CCR read exceptions all branches

* cleanup

* CR feedback

* Added UI/UX to ccr/shards listing and details

* Fixed snaps

* Added reason for the exception

* Added setup mode funtionality and alert status
# Conflicts:
#	x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js

* Update ccr.js

* Update ccr.test.js.snap
igoristic added a commit that referenced this pull request Dec 19, 2020
)

* [Monitoring][Alerting] CCR read exceptions alert (#85908)

* CCR read exceptions all branches

* cleanup

* CR feedback

* Added UI/UX to ccr/shards listing and details

* Fixed snaps

* Added reason for the exception

* Added setup mode funtionality and alert status
# Conflicts:
#	x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js

* Update ccr.js

* Update ccr.test.js.snap
@igoristic
Copy link
Contributor Author

Backport:
7.x: ae4307f
7.4: cd163b2

gmmorris added a commit to gmmorris/kibana that referenced this pull request Dec 21, 2020
* master: (48 commits)
  Fix request with disabled aggregation (elastic#85696)
  [Security Solution][Detections][Threshold Rules] Threshold Rule Bug Fixes (elastic#84918)
  Removed a possibility to define two different names for Alert types on API and UI level. (elastic#86236)
  Bump Node.js from version 14.15.2 to 14.15.3 (elastic#86593)
  [index patterns] Fleep app - Keep saved object field list until field caps provides fields (elastic#85370)
  [Security Solutions] fix timeline tabs + layout (elastic#86581)
  Upgrade to hapi version 20 (elastic#85406)
  App Services: Remove remaining uiActions, expressions, data, embeddable circular dependencies. (elastic#82791)
  Rename chartLibrary setting to legacyChartsLibrary (elastic#86529)
  [CI] TeamCity updates (elastic#85843)
  [Maps] Use Json for mvt-tests (elastic#86492)
  [Rollup Jobs] Added autofocus to cron editor (elastic#86324)
  [Monitoring][Alerting] CCR read exceptions alert (elastic#85908)
  [CI] Bump memory for main CI workers (elastic#86541)
  Explicitly set Elasticsearch heap size during CI and local development (elastic#86513)
  [App Search] Updates to results on the documents view (elastic#86181)
  [Discover] Change default sort handling  (elastic#85561)
  [App Search] Convert DocumentCreationModal to DocumentCreationFlyout (elastic#86508)
  [App Search] Sample Engines should have access to the Crawler (elastic#86502)
  Fixed duplication of create new modal (elastic#86489)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CCR Read Exceptions stack monitoring alert
5 participants