[responseOps] execution duration > 112 DAYS! causes error writing event log doc #127802

pmuellr · 2022-03-15T17:24:22Z

Kibana version: 7.17.1

Describe the bug:

Seen in user logs:

invalid event logged: [event.duration]: "event.duration" must be a safe number; {
"@timestamp":"2022-03-15T17:11:57.912Z",
"event":{"provider":"alerting","action":"active-instance","kind":"alert","category":["infrastructure"],
"start":"2021-11-22T14:00:33.597Z",
"duration":9774684315000000},

That's one long rule execution! Back-of-the-envelope calculation on how long 9774684315000000 nanoseconds are: 113 days, and the GOOG concurs:

Steps to reproduce:

not clear

Expected behavior:

An error should not occur.

Seems like we need to catch numbers bigger than Number.MAX_SAFE_INTEGER and either send that max value, or figure out if we can send a floating point or some other representation.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-15T17:26:27Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2022-03-15T17:27:59Z

Seems highly unlikely it was running this long, and I think it could be because of some faulty logic during retries where we count the duration from the first try. But ... wow, that still seems off.

Figured I'd open the issue to track it anyway ...

mikecote · 2022-04-06T17:20:48Z

I came across this issue as well, in my case, it happened whenever an alert has been active for longer than 112 days.

mikecote · 2022-04-13T13:35:22Z

Linking similar issue: #130119

I've seen a large number of these errors in server logs:

invalid event logged: [kibana.task.schedule_delay]: "kibana.task.schedule_delay" must be a safe number; {"@timestamp":"2022-04-13T13:07:38.183Z","event":{"provider":"alerting","action":"execute-start","kind":"alert","category":["monitoring"],"start":"2022-04-13T13:07:38.183Z"},"kibana":{"saved_objects":[{"rel":"primary","type":"alert","id":"{uuid}","type_id":"monitoring_alert_kibana_version_mismatch"}],"task":{"scheduled":"2021-06-25T14:30:35.011Z","schedule_delay":25223823172000000},"server_uuid":"{uuid}","version":"7.16.1"},"rule":{"id":"{uuid}","license":"basic","category":"monitoring_alert_kibana_version_mismatch","ruleset":"monitoring"},"message":"alert execution start: \"{uuid}\"","ecs":{"version":"1.8.0"}})

As we can see the schedule_delay field has the number 25223823172000000 which is larger than JS's Max Number (2^53 - 1). Oops.

pmuellr · 2022-04-20T14:33:53Z

Thought I'd drop a few notes on this.

The duration field is a mapped long:

kibana/x-pack/plugins/event_log/generated/mappings.json

Lines 82 to 84 in c0c6c9f

    
           "duration": { 
        
               "type": "long" 
        
           },

In ES, a long can have a maximum value of 2^63-1.

I was curious what the absolute limit in days we can store in that long:

> node
Welcome to Node.js v16.14.2.
Type ".help" for more information.

> 2**63 /   // limit for nanoseconds
... 1000 /  // in microseconds
... 1000 /  // in milliseconds
... 1000 /  // in seconds
... 60 /    // in minutes
... 60 /    // in hours
... 24      // in days
106751.99116730063

That's almost 300 years, so ... seems safe.

I think we may have to rely on coerce (which is by default true) to help us. Send a string representation of the number. Or perhaps send in some exponential/scientific form, which would also be a string, so probably not a useful path.

We should probably check to see if any other existing fields can overflow like this - maybe we should have a common "stringization" of long fields , that we just use over all those fields. It looks like we only use long for integral numeric types in the mappings right now, and there are 12 of them (some in a PR that I have loaded right now!). The only other immediately suspect one is kibana.task.schedule_delay, but I believe that value is milliseconds, so that would make that limit a million times bigger than 112 days, which seems safe :-)

Since we only every access the source of ES docs, presumably we'll have to "fix" them when we process the results from searches. My guess is that dealing with a plain old query will be easy - we can "fix" the values since we know where they are in the hits - however aggs are going to be ... hard. How would we know what to fix? I'd expect aggs over duration to be fairly common. We may have to provide some "normalizer" function that anyone using an agg could call, against agg result fields they know are based on this field, during their processing of the results.

Perhaps this is also a good time to evaluate whether we should add a ms duration value to the duration field, somehow. It would likely need to go under kibana, but worth investigating how other folks using ECS have dealt with this ...

A runtime field is perhaps worth looking into as well, but I worry about the performance implications.

mikecote · 2022-04-20T20:12:02Z

I think we may have to rely on coerce (which is by default true) to help us. Send a string representation of the number.

Yeah, that's the path I'm thinking we'll have to go. But you're right, if search responses start replying with string values, we'll have to assess the impact, like aggregations and see if it's still a viable option before it gets too complex.

pmuellr added the bug Fixes for quality problems that affect the customer experience label Mar 15, 2022

botelastic bot added the needs-team Issues missing a team label label Mar 15, 2022

pmuellr added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Mar 15, 2022

botelastic bot removed the needs-team Issues missing a team label label Mar 15, 2022

pmuellr added Feature:Alerting needs-team Issues missing a team label labels Mar 15, 2022

botelastic bot removed the needs-team Issues missing a team label label Mar 15, 2022

pmuellr added this to AppEx: ResponseOps - Execution & Connectors Mar 15, 2022

pmuellr moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Mar 15, 2022

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Mar 17, 2022

mikecote mentioned this issue Apr 13, 2022

[E&C] Event Log sometimes errors when logging an event with an unsafe number under kibana.task.schedule_delay #130119

Closed

mikecote self-assigned this Apr 18, 2022

mikecote moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Apr 18, 2022

pmuellr added the Feature:EventLog label Apr 20, 2022

mikecote mentioned this issue Apr 21, 2022

Convert event log's duration from number to string in Kibana (keep as "long" in Elasticsearch) #130819

Merged

mikecote moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Apr 26, 2022

mikecote closed this as completed in #130819 May 3, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors May 3, 2022

pmuellr mentioned this issue Oct 18, 2022

[ResponseOps] backport to 7.17 fix for: "event.duration" must be a safe number #143562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[responseOps] execution duration > 112 DAYS! causes error writing event log doc #127802

[responseOps] execution duration > 112 DAYS! causes error writing event log doc #127802

pmuellr commented Mar 15, 2022 •

edited

Loading

elasticmachine commented Mar 15, 2022

pmuellr commented Mar 15, 2022

mikecote commented Apr 6, 2022

mikecote commented Apr 13, 2022

pmuellr commented Apr 20, 2022

mikecote commented Apr 20, 2022

[responseOps] execution duration > 112 DAYS! causes error writing event log doc #127802

[responseOps] execution duration > 112 DAYS! causes error writing event log doc #127802

Comments

pmuellr commented Mar 15, 2022 • edited Loading

elasticmachine commented Mar 15, 2022

pmuellr commented Mar 15, 2022

mikecote commented Apr 6, 2022

mikecote commented Apr 13, 2022

pmuellr commented Apr 20, 2022

mikecote commented Apr 20, 2022

pmuellr commented Mar 15, 2022 •

edited

Loading