-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring][Alerting] CCR read exceptions alert #85908
[Monitoring][Alerting] CCR read exceptions alert #85908
Conversation
Pinging @elastic/stack-monitoring (Team:Monitoring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here! Code looks solid!
A couple of things I noticed off the bat:
-
The server log needs proper escaping:
Server log: CCR read exceptions alert is firing for the following remote clusters: Monitoring. Verify follower/leader index relationships across the affected remote clusters.
-
I'm not seeing any presence of the alerts on the CCR monitoring pages, even though we link to them from the next steps. I feel like we should follow the same pattern as other alerts where we surface the presence of the firing alert on the CCR listing page, as well as the individual shard pages.
-
The non-server log actions look off. Typically we have a condensed message in the
context.internalShortMessage
and something longer (with a deep link back to Kibana) incontext.internalFullMessage
I'm going to kick it back to you to start looking into these while I dig more into the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great! Made a few comments
x-pack/plugins/monitoring/server/lib/alerts/fetch_ccr_read_exceptions.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/monitoring/server/alerts/ccr_read_exceptions_alert.ts
Outdated
Show resolved
Hide resolved
], | ||
}, | ||
}, | ||
aggs: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we need to do any aggs actually.
The alert description reads:
Alert if any CCR read exceptions have been detected
I think we could get away with just searching for the past {duration}
and only looking at documents that have read_exceptions
(see my other comment in this file) and return that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd just have to make sure we de-dupe the list, which seems like a fairly easy task (using a Set
or creating a byId
object and only taking the values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any benefits of doing it locally. With aggs
we get "less" data back which I think makes up for its performance degradation (if there are any). Maybe it's fine since we're doing aggregation on filtered data anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, maybe something I'm missing. I'm willing to explore this option, but perhaps as a post/separate task
x-pack/plugins/monitoring/server/alerts/ccr_read_exceptions_alert.ts
Outdated
Show resolved
Hide resolved
@chrisronline Thank you for the quick review!
I agree this is a problem, and we have other areas in our app currently that are missing this level of granularity. I tried addressing it here, but there's just too much involved. We basically have to make the "by nodes" logic more generic, and I'm worried about causing regressions on other listing/detail pages (with all the "unrelated" code changes). Maybe we can address these UI/UX enhancements in separate/post PRs?
I fixed this by changing it to |
…ead_exceptions_alert
This confuses me a little. I'd hoped the work done in #83681 would make these sorts of things more manageable in a smaller time window, but maybe I misunderstood. FWIW, I spent a little bit of time this morning making the changes I imagine are necessary for this and it looks like it isn't too bad. See https://gist.github.com/chrisronline/4fb0534c0d6ba803af56c42c07b2bc97 WDYT about that approach? FWIW, I think there are things to make that code a bit better but it should work for this PR. |
@chrisronline This is ready for another review I think we have indeed improved our alerting development flow, but I agree there's always room for improvements. Thank you for the example gist, it helped my out a lot! I took a slightly different approach by changing our I have state my reasons for splitting this up, but mainly it was to also make the FF |
FWIW, in another PR, I mentioned:
This is basically what I meant. We'd run into a scenario that |
Yeah, I see what you mean now, but I still don't think "products" would be the best approach here. I was thinking something like: I think we should reserve this discussion outside the PR though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! Thanks for all the hard work here @igoristic!
Code looks pretty good and I think I'm done with that part of the review.
Functionally, I can see us adding one thing:
It'd be nice to somewhere list the actual exception. I think it is available in the monitoring documents and it will help the user understand what the root problem might be.
…ead_exceptions_alert
…ead_exceptions_alert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reached out to the ES team and the easiest way to simulate an exception is to close the follower index. See this doc
Here is what I see when I do this:
Perhaps there is a better way to show this error than in a <EuiCode>
block, since it's a bit more structured? Perhaps it can be baked into the original messaging somehow?
Also, we should enable setup mode on the CCR pages, as they feature alerts and users should be able to access the config without an alert firing. This is an example of supporting this
I agree the structure is simple, but I decided not to bake it into "our" description for several reasons:
Also, I think the code style here expresses that this is something that came from the server, and not something we assumed (or made generic). I explicitly decided to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work here!
💚 Build SucceededMetrics [docs]Module Count
Async chunks
Distributable file count
Page load bundle
History
To update your PR or re-run it, just comment with: |
* CCR read exceptions all branches * cleanup * CR feedback * Added UI/UX to ccr/shards listing and details * Fixed snaps * Added reason for the exception * Added setup mode funtionality and alert status # Conflicts: # x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js
* [Monitoring][Alerting] CCR read exceptions alert (#85908) * CCR read exceptions all branches * cleanup * CR feedback * Added UI/UX to ccr/shards listing and details * Fixed snaps * Added reason for the exception * Added setup mode funtionality and alert status # Conflicts: # x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js * Update ccr.js * Update ccr.test.js.snap
) * [Monitoring][Alerting] CCR read exceptions alert (#85908) * CCR read exceptions all branches * cleanup * CR feedback * Added UI/UX to ccr/shards listing and details * Fixed snaps * Added reason for the exception * Added setup mode funtionality and alert status # Conflicts: # x-pack/plugins/monitoring/public/components/elasticsearch/ccr/ccr.js * Update ccr.js * Update ccr.test.js.snap
* master: (48 commits) Fix request with disabled aggregation (elastic#85696) [Security Solution][Detections][Threshold Rules] Threshold Rule Bug Fixes (elastic#84918) Removed a possibility to define two different names for Alert types on API and UI level. (elastic#86236) Bump Node.js from version 14.15.2 to 14.15.3 (elastic#86593) [index patterns] Fleep app - Keep saved object field list until field caps provides fields (elastic#85370) [Security Solutions] fix timeline tabs + layout (elastic#86581) Upgrade to hapi version 20 (elastic#85406) App Services: Remove remaining uiActions, expressions, data, embeddable circular dependencies. (elastic#82791) Rename chartLibrary setting to legacyChartsLibrary (elastic#86529) [CI] TeamCity updates (elastic#85843) [Maps] Use Json for mvt-tests (elastic#86492) [Rollup Jobs] Added autofocus to cron editor (elastic#86324) [Monitoring][Alerting] CCR read exceptions alert (elastic#85908) [CI] Bump memory for main CI workers (elastic#86541) Explicitly set Elasticsearch heap size during CI and local development (elastic#86513) [App Search] Updates to results on the documents view (elastic#86181) [Discover] Change default sort handling (elastic#85561) [App Search] Convert DocumentCreationModal to DocumentCreationFlyout (elastic#86508) [App Search] Sample Engines should have access to the Crawler (elastic#86502) Fixed duplication of create new modal (elastic#86489) ...
Resolves: #79990
The alerting query groups remote clusters with their relative follow indices. This isn't a per node type an alert nor is it a threshold type of alert. It can either be enabled or disabled.
UI/UX:
Testing:
Setup a basic CCR environment, doc link | x-pack ver
Confirm you have a CCR setup with a leader/follower relationship via Kibana:
.../app/management/data/cross_cluster_replication
and that the CRR monitoring metrics are workingI don't know a "legitimate" way of triggering the read exceptions, so I'm doing it manually, eg:
Be sure to update the
timestamp
and the index name to be more recent