Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[responseOps] execution duration > 112 DAYS! causes error writing event log doc #127802

Closed
pmuellr opened this issue Mar 15, 2022 · 6 comments · Fixed by #130819
Closed

[responseOps] execution duration > 112 DAYS! causes error writing event log doc #127802

pmuellr opened this issue Mar 15, 2022 · 6 comments · Fixed by #130819
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Mar 15, 2022

Kibana version: 7.17.1

Describe the bug:

Seen in user logs:

invalid event logged: [event.duration]: "event.duration" must be a safe number; {
"@timestamp":"2022-03-15T17:11:57.912Z",
"event":{"provider":"alerting","action":"active-instance","kind":"alert","category":["infrastructure"],
"start":"2021-11-22T14:00:33.597Z",
"duration":9774684315000000},

That's one long rule execution! Back-of-the-envelope calculation on how long 9774684315000000 nanoseconds are: 113 days, and the GOOG concurs:

image

Steps to reproduce:

  1. not clear

Expected behavior:

An error should not occur.

Seems like we need to catch numbers bigger than Number.MAX_SAFE_INTEGER and either send that max value, or figure out if we can send a floating point or some other representation.

@pmuellr pmuellr added the bug Fixes for quality problems that affect the customer experience label Mar 15, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Mar 15, 2022
@pmuellr pmuellr added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Mar 15, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Mar 15, 2022
@pmuellr pmuellr added Feature:Alerting needs-team Issues missing a team label labels Mar 15, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Mar 15, 2022
@pmuellr
Copy link
Member Author

pmuellr commented Mar 15, 2022

Seems highly unlikely it was running this long, and I think it could be because of some faulty logic during retries where we count the duration from the first try. But ... wow, that still seems off.

Figured I'd open the issue to track it anyway ...

@mikecote mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Mar 17, 2022
@mikecote
Copy link
Contributor

mikecote commented Apr 6, 2022

I came across this issue as well, in my case, it happened whenever an alert has been active for longer than 112 days.

@mikecote
Copy link
Contributor

Linking similar issue: #130119

I've seen a large number of these errors in server logs:

invalid event logged: [kibana.task.schedule_delay]: "kibana.task.schedule_delay" must be a safe number; {"@timestamp":"2022-04-13T13:07:38.183Z","event":{"provider":"alerting","action":"execute-start","kind":"alert","category":["monitoring"],"start":"2022-04-13T13:07:38.183Z"},"kibana":{"saved_objects":[{"rel":"primary","type":"alert","id":"{uuid}","type_id":"monitoring_alert_kibana_version_mismatch"}],"task":{"scheduled":"2021-06-25T14:30:35.011Z","schedule_delay":25223823172000000},"server_uuid":"{uuid}","version":"7.16.1"},"rule":{"id":"{uuid}","license":"basic","category":"monitoring_alert_kibana_version_mismatch","ruleset":"monitoring"},"message":"alert execution start: \"{uuid}\"","ecs":{"version":"1.8.0"}})

As we can see the schedule_delay field has the number 25223823172000000 which is larger than JS's Max Number (2^53 - 1). Oops.

@mikecote mikecote self-assigned this Apr 18, 2022
@mikecote mikecote moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Apr 18, 2022
@pmuellr
Copy link
Member Author

pmuellr commented Apr 20, 2022

Thought I'd drop a few notes on this.

The duration field is a mapped long:

"duration": {
"type": "long"
},

In ES, a long can have a maximum value of 2^63-1.

I was curious what the absolute limit in days we can store in that long:

> node
Welcome to Node.js v16.14.2.
Type ".help" for more information.

> 2**63 /   // limit for nanoseconds
... 1000 /  // in microseconds
... 1000 /  // in milliseconds
... 1000 /  // in seconds
... 60 /    // in minutes
... 60 /    // in hours
... 24      // in days
106751.99116730063

That's almost 300 years, so ... seems safe.

I think we may have to rely on coerce (which is by default true) to help us. Send a string representation of the number. Or perhaps send in some exponential/scientific form, which would also be a string, so probably not a useful path.

We should probably check to see if any other existing fields can overflow like this - maybe we should have a common "stringization" of long fields , that we just use over all those fields. It looks like we only use long for integral numeric types in the mappings right now, and there are 12 of them (some in a PR that I have loaded right now!). The only other immediately suspect one is kibana.task.schedule_delay, but I believe that value is milliseconds, so that would make that limit a million times bigger than 112 days, which seems safe :-)

Since we only every access the source of ES docs, presumably we'll have to "fix" them when we process the results from searches. My guess is that dealing with a plain old query will be easy - we can "fix" the values since we know where they are in the hits - however aggs are going to be ... hard. How would we know what to fix? I'd expect aggs over duration to be fairly common. We may have to provide some "normalizer" function that anyone using an agg could call, against agg result fields they know are based on this field, during their processing of the results.

Perhaps this is also a good time to evaluate whether we should add a ms duration value to the duration field, somehow. It would likely need to go under kibana, but worth investigating how other folks using ECS have dealt with this ...

A runtime field is perhaps worth looking into as well, but I worry about the performance implications.

@mikecote
Copy link
Contributor

I think we may have to rely on coerce (which is by default true) to help us. Send a string representation of the number.

Yeah, that's the path I'm thinking we'll have to go. But you're right, if search responses start replying with string values, we'll have to assess the impact, like aggregations and see if it's still a viable option before it gets too complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
3 participants