watcher fails with "Error response when trying to trigger reporting generation" #41617

LeeDr · 2019-07-19T21:39:20Z

Kibana version: 7.3.0 BC5

Elasticsearch version: 7.3.0 BC5

Server OS version: CentOS 7

Browser version: N/A

Browser OS version: N/A

Original install method (e.g. download page, yum, from source, etc.): rpm default dist

Describe the bug: 2 tests in the integration-test repo create an interval watch to email a Kibana dashboard PDF and PNG report (2 separate tests). On 7.3.0 BC4 and earlier both tests pass. But on 7.3.0 BC5 the second test always fails. I can switch the PDF and PNG reports and it's always the second test that fails.

These watches are sending the reports as email attachments but I don't think that is relevant to the problem based on the error message;

An example of the .watcher-history doc for one of these failures;

00:25:12.817 .watcher-history*={"took":67,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.6931472,"hits":[{"_index":".watcher-history-10-2019.07.19","_type":"_doc","_id":"watcher_report1563554002370_44c82301-867c-4638-a051-5c5090211b9c-2019-07-19T16:45:37.861919Z","_score":0.6931472,"_source":{"watch_id":"watcher_report1563554002370","node":"thvOhxZ2SpqAk0pC-aWvdQ","state":"executed","status":{"state":{"active":true,"timestamp":"2019-07-19T16:45:37.861Z"},"last_checked":"2019-07-19T16:45:37.861Z","last_met_condition":"2019-07-19T16:45:37.861Z","actions":{"email_admin":{"ack":{"timestamp":"2019-07-19T16:45:37.861Z","state":"awaits_successful_execution"},"last_execution":{"timestamp":"2019-07-19T16:45:37.861Z","successful":false,"reason":""}}},"execution_state":"executed","version":-1},"trigger_event":{"type":"schedule","triggered_time":"2019-07-19T16:45:37.861Z","schedule":{"scheduled_time":"2019-07-19T16:45:37.542Z"}},"input":{"none":{}},"condition":{"always":{}},"result":{"execution_time":"2019-07-19T16:45:37.861Z","execution_duration":4,"input":{"type":"none","status":"success","payload":{}},"condition":{"type":"always","status":"success","met":true},"actions":[{"id":"email_admin","type":"email","status":"failure","error":{"root_cause":[{"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}],"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}}]},"messages":[]}}]}}

with the important part being;
Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"

On a local install, I kept Elasticsearch 7.3.0 BC5 and changed Kibana back to BC4 and the tests pass. Change it back to BC5 and they fail again so it really seems like this is caused by a change between those 2 BCs.

Steps to reproduce:

Starting with Kibana default distribution (mine is installed via rpm packages on CentOS 7)
We have Elasticsearch configured with an email address, security enabled, self-signed certificates but I'll try to reproduce without the email part.
We use the watcher API to create the 1st watch which has a 5 minute interval. Then we wait 5 minutes and start checking if the watch has been successfully executed. The we delete that watch.
We create the next watch with the only difference being the report generation url so we get a PNG instead of a PDF. wait 5 minutes, check until it's successful, and delete the watch.

Expected behavior: Both watches should work as they have before.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant): I haven't found any errors in either Kibana or Elasticsearch logs related to the timeout triggering report generation.

Any additional context:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-19T21:42:58Z

Pinging @elastic/kibana-stack-services

joelgriffith · 2019-07-19T21:51:06Z

I'll take a look at this first thing Monday.

LeeDr · 2019-07-19T22:00:26Z

For timing purposes on a failing test, here's the watch being added;

16:46:30.247: { _index: '.watches',
  _type: '_doc',
  _id: 'watcher_report1563572456661',
  _version: 1,
  result: 'created',
  _shards: { total: 1, successful: 1, failed: 0 },
  _seq_no: 15,
  _primary_term: 1 }

The watch has a 300s interval (we want it to complete before it gets triggered a second time).
At this point the watch hasn't run yet (just short or 5 minutes);

16:51:24.486: 
.watcher-history*=
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

And the next check (about 15 seconds later), it has failed;

16:51:40.061: 
.watcher-history*=
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.5404451,
    "hits": [
      {
        "_index": ".watcher-history-10-2019.07.19",
        "_type": "_doc",
        "_id": "watcher_report1563572456661_7dd99997-b417-40c1-937a-06554dc8e742-2019-07-19T21:51:30.409402Z",
        "_score": 1.5404451,
        "_source": {
          "watch_id": "watcher_report1563572456661",
          "node": "k9oLryI0R5-Vy68ht9wG6g",
          "state": "executed",
          "status": {
            "state": {
              "active": true,
              "timestamp": "2019-07-19T21:51:30.409Z"
            },
            "last_checked": "2019-07-19T21:51:30.409Z",
            "last_met_condition": "2019-07-19T21:51:30.409Z",
            "actions": {
              "email_admin": {
                "ack": {
                  "timestamp": "2019-07-19T21:51:30.409Z",
                  "state": "awaits_successful_execution"
                },
                "last_execution": {
                  "timestamp": "2019-07-19T21:51:30.409Z",
                  "successful": false,
                  "reason": ""
                }
              }
            },
            "execution_state": "executed",
            "version": -1
          },
          "trigger_event": {
            "type": "schedule",
            "triggered_time": "2019-07-19T21:51:30.409Z",
            "schedule": {
              "scheduled_time": "2019-07-19T21:51:30.243Z"
            }
          },
          "input": {
            "none": {}
          },
          "condition": {
            "always": {}
          },
          "result": {
            "execution_time": "2019-07-19T21:51:30.409Z",
            "execution_duration": 15,
            "input": {
              "type": "none",
              "status": "success",
              "payload": {}
            },
            "condition": {
              "type": "always",
              "status": "success",
              "met": true
            },
            "actions": [
              {
                "id": "email_admin",
                "type": "email",
                "status": "failure",
                "error": {
                  "root_cause": [
                    {
                      "type": "exception",
                      "reason": "Watch[watcher_report1563572456661] reporting[test_report.png] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/png], status[408]"
                    }
                  ],
                  "type": "exception",
                  "reason": "Watch[watcher_report1563572456661] reporting[test_report.png] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/png], status[408]"
                }
              }
            ]
          },
          "messages": []
        }
      }
    ]
  }
}

LeeDr · 2019-07-19T22:22:03Z

I ran again with logging.events.log: ['warning', 'error', 'fatal', 'info', 'reporting'] in my kibana.yml
timeout3.log

I don't see anything revealing in that log. The timestamps in the log are UTC which is +5 hours to my time and the test logging. So with the above log, here's when the watch was created;

17:08:05.440: { _index: '.watches',
  _type: '_doc',
  _id: 'watcher_report1563573751941',
  _version: 1,
  result: 'created',
  _shards: { total: 1, successful: 1, failed: 0 },
  _seq_no: 21,
  _primary_term: 1 }
>> PASS: chrome 11 on any platform - reporting app - watcher app - PNG Reporting watch - should successfully add a new watch for png Reporting (9ms)

Here's my theory now by looking at the logs which were just at the end of the previous test and just before this test;

{"type":"log","@timestamp":"2019-07-19T22:07:44Z","tags":["reporting","screenshots","debug"],"pid":28493,"message":"screenshot took 839ms"}
{"type":"log","@timestamp":"2019-07-19T22:07:45Z","tags":["reporting","browser-driver","chromium-driver-factory","debug"],"pid":28493,"message":"deleting chromium user data directory at [/tmp/chromium-L2BcNI]"}

Maybe this cleanup step of deleting a directory is interfering with the creation of the next watch. I'll try putting some sleep between the tests and see if that changes the results.

LeeDr · 2019-07-19T22:51:19Z

Update: Adding a 30 second sleep between deleting the first watch and creating the second watch didn't have any impact on the failure.

LeeDr · 2019-07-20T18:44:25Z

Could it possibly be something like the es client connection I use for the first test gets disconnected for some reason and can't post the second reporting URL?

Now that I think about it more, I don't think this is it. The Es client is used to post the watch to elasticsearch and that works. It's not to generate the report.

tsullivan · 2019-07-22T17:32:06Z

There's definitely a bug here, with the error being Error response when trying to trigger reporting generation, but from the logs it looks like Kibana did generate a report

joelgriffith · 2019-07-22T18:51:22Z

The timing is interesting as well, as the time between messages is so close:07:44Z vs 07:4Z. There isn't enough time between steps to even run the job.

I'm taking a look now, but my guess is that there's an issue with a timeout set someplace.

tsullivan · 2019-07-22T22:13:24Z

I've been running a watch that generates a PDF locally for awhile, and not hitting any issues.

It repeats every 5 minutes, which makes it look like Watcher is not trying to hold a connection open between triggers.

LeeDr · 2019-07-22T22:16:44Z

If I set server.keepaliveTimeout: 1200000 (20 minutes) in my kibana.yml then my tests pass. Without it they fail.

peterschretlen · 2019-07-23T17:16:06Z

Here’s the snippet in watcher for the initial report generation. So the request must have been returning 200 before but returning 408 now? This happens before the polling / attachment retries begin.

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/notification/email/attachment/ReportingAttachmentParser.java#L182-L194

peterschretlen · 2019-07-24T04:42:02Z

Quoting an email from @jasontedor

There is a breaking difference in behavior from Kibana before this change was integrated and after this change was integrated.

The timeout is indeed unchanged from 120s to 120s (which is why this was seen as a safe change), but what is different is how Kibana handles the socket timeout.

Prior to the change being integrated, Kibana would quietly close the connection on a socket that had timed out by sending a TCP FIN.

After the change was integrated, Kibana now closes the connection on a socket that has timed out first by sending an HTTP response (literally “HTTP/1.1 408 Request Timeout”), and then proceeding to close the connection by sending a TCP FIN.

This behavior change is causing the HTTP client to treat the response from Kibana as an actual response which is delivered to the application (Watcher) which treats the non-200 response a failure.

Prior to this change, the HTTP client would treat the FIN as the connection had been closed, and a new one needs to be re-established to then send the POST request to. In this case, the application (again, Watcher) would be none the wiser to the fact that a managed HTTP connection had been closed, and then re-opened on its behalf.

It can be reproduced without needing watcher, below confirms what he describes. These steps reproduce the problem in 6.8, 7.2, and 7.3 branches.

start a permanent tcp connection

$ socat TCP:localhost:5601 UNIX-LISTEN:/tmp/curl.sock,fork

you should be able to see a new conection in netstat

$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp4       0      0  127.0.0.1.5601         127.0.0.1.53524        ESTABLISHED
tcp4       0      0  127.0.0.1.53524        127.0.0.1.5601         ESTABLISHED
tcp4       0      0  127.0.0.1.5601         *.*                    LISTEN

after the connection times out, it looks like this (notice the 32 bytes sitting there in the receive ‘queue’)

$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp4       0      0  127.0.0.1.5601         127.0.0.1.53524        FIN_WAIT_2 
tcp4      32      0  127.0.0.1.53524        127.0.0.1.5601         CLOSE_WAIT 
tcp4       0      0  127.0.0.1.5601         *.*                    LISTEN

now send a curl request (using that tcp connection), and you’ll get a 408

$ curl -v --unix-socket /tmp/curl.sock http://localhost:5601/fov/api
*   Trying /tmp/curl.sock...
* Connected to localhost (/tmp/curl.sock`ҥ??) port 80 (#0)
> GET /fov/api HTTP/1.1
> Host: localhost:5601
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 408 Request Timeout
* no chunk, no close, no size. Assume close to signal end
< 
* Closing connection 0

Without the change, the difference is those 32 bytes aren’t there in the ‘recv-q’ and when you send the curl request, you get Empty reply from server instead of an HTTP response

joelgriffith · 2019-07-24T18:30:11Z

I've removed myself from this since it's not in the reporting territory, however I'm still confused/concerned about the 408 response timing out. It seems to me that we shouldn't be "holding" this response on a keep-alive socket, and I'm not sure if this is something we can change since it appears to be how node behaves. From my perspective we're treating the underlying socket as a timeout vs the HTTP request itself. Ideally we'd only have this timeout per-http request and not per-socket.

peterschretlen · 2019-07-24T19:47:30Z

@jbudz What would be the drawback of removing just the 408 response on timeout - then it should behave as it did before? Do we use / intend to use the 408 to provide messaging in the Kibana app?

We'll potentially see this anywhere an idle client connection can be kept alive longer than the kibana server connection. If we configure the kibana server timeout value very low, then we should be able to see this in the Kibana app even (some will be legitimate 408s, others not, it should be clear in the network view).

joelgriffith · 2019-07-24T20:04:21Z

In the light amount of reading I've done on the NodeJS docs, I think @peterschretlen's comment is what I'm leaning towards as well. Seems that our client(s) in Kibana seem to have the current behavior built-in, and wonder what other users might find if the 408 starts showing up. I think simply removing that timeout handler would work.

pmuellr · 2019-07-25T04:38:56Z

I'm always a bit leery of "unusual" http responses - I had to look up 408, I've never seen that one before! I would think most clients would treat it as a generic 4XX response, meaning, user error, don't try this again. The reaction from a "smart" client seems like it should be to try the request again, but request via some option a longer timeout, if that client thought the timeout was transient and success might be possible with a longer timeout.

I don't know enough about the actual scenario to suggest alternatives tho.

jbudz · 2019-07-30T18:55:39Z

The drawback is mostly around messaging and debugging. The messier side of this is browser -> server interaction that result in undefined errors. Moreso when there's a proxy in front leaving another layer for interpretation.

I'd argue the watcher client should be managing its own long polling timeout (although maybe it is, timestamps look iffy). Anything in between Kibana and watcher could have different behavior and its own timeout, whether its 504 or 502 or 408. I'll remove the message and we can bring that up for discussion later.

jasontedor · 2019-07-31T03:49:49Z

I'd argue the watcher client should be managing its own long polling timeout (although maybe it is, timestamps look iffy).

As I mentioned in my email that @peterschretlen posted here, the underlying client library does transparently as long as this is handled at the socket level, but once it’s done at the HTTP level I think we are doing the right thing by treating 408 as an error (the underlying client library has no idea for sure). Right now we treat the report creation as binary: it either succeeded or it didn’t. I am not a fan of making this more complex on the client side.

LeeDr added bug Fixes for quality problems that affect the customer experience regression (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead Team:Stack Services labels Jul 19, 2019

joelgriffith self-assigned this Jul 19, 2019

This was referenced Jul 24, 2019

[7.3] revert configurable global socket timeouts (#31603) #41831

Merged

[7.2] revert configurable global socket timeouts #41834

Merged

[6.8] revert configurable global socket timeouts (#31603) #41837

Merged

joelgriffith removed their assignment Jul 24, 2019

jbudz self-assigned this Jul 30, 2019

jbudz mentioned this issue Aug 1, 2019

close socket timeouts without message #42456

Merged

jbudz closed this as completed in #42456 Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watcher fails with "Error response when trying to trigger reporting generation" #41617

watcher fails with "Error response when trying to trigger reporting generation" #41617

LeeDr commented Jul 19, 2019

elasticmachine commented Jul 19, 2019

joelgriffith commented Jul 19, 2019

LeeDr commented Jul 19, 2019 •

edited by jasontedor

Loading

LeeDr commented Jul 19, 2019

LeeDr commented Jul 19, 2019

LeeDr commented Jul 20, 2019 •

edited

Loading

tsullivan commented Jul 22, 2019

joelgriffith commented Jul 22, 2019

tsullivan commented Jul 22, 2019

LeeDr commented Jul 22, 2019

peterschretlen commented Jul 23, 2019

peterschretlen commented Jul 24, 2019

joelgriffith commented Jul 24, 2019

peterschretlen commented Jul 24, 2019 •

edited

Loading

joelgriffith commented Jul 24, 2019

pmuellr commented Jul 25, 2019

jbudz commented Jul 30, 2019 •

edited

Loading

jasontedor commented Jul 31, 2019

watcher fails with "Error response when trying to trigger reporting generation" #41617

watcher fails with "Error response when trying to trigger reporting generation" #41617

Comments

LeeDr commented Jul 19, 2019

elasticmachine commented Jul 19, 2019

joelgriffith commented Jul 19, 2019

LeeDr commented Jul 19, 2019 • edited by jasontedor Loading

LeeDr commented Jul 19, 2019

LeeDr commented Jul 19, 2019

LeeDr commented Jul 20, 2019 • edited Loading

tsullivan commented Jul 22, 2019

joelgriffith commented Jul 22, 2019

tsullivan commented Jul 22, 2019

LeeDr commented Jul 22, 2019

peterschretlen commented Jul 23, 2019

peterschretlen commented Jul 24, 2019

joelgriffith commented Jul 24, 2019

peterschretlen commented Jul 24, 2019 • edited Loading

joelgriffith commented Jul 24, 2019

pmuellr commented Jul 25, 2019

jbudz commented Jul 30, 2019 • edited Loading

jasontedor commented Jul 31, 2019

LeeDr commented Jul 19, 2019 •

edited by jasontedor

Loading

LeeDr commented Jul 20, 2019 •

edited

Loading

peterschretlen commented Jul 24, 2019 •

edited

Loading

jbudz commented Jul 30, 2019 •

edited

Loading