-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watcher fails with "Error response when trying to trigger reporting generation" #41617
Comments
Pinging @elastic/kibana-stack-services |
I'll take a look at this first thing Monday. |
For timing purposes on a failing test, here's the watch being added;
The watch has a 300s interval (we want it to complete before it gets triggered a second time).
And the next check (about 15 seconds later), it has failed;
|
I ran again with I don't see anything revealing in that log. The timestamps in the log are UTC which is +5 hours to my time and the test logging. So with the above log, here's when the watch was created;
Here's my theory now by looking at the logs which were just at the end of the previous test and just before this test;
Maybe this cleanup step of deleting a directory is interfering with the creation of the next watch. I'll try putting some sleep between the tests and see if that changes the results. |
Update: Adding a 30 second sleep between deleting the first watch and creating the second watch didn't have any impact on the failure. |
Could it possibly be something like the es client connection I use for the first test gets disconnected for some reason and can't post the second reporting URL? Now that I think about it more, I don't think this is it. The Es client is used to post the watch to elasticsearch and that works. It's not to generate the report. |
There's definitely a bug here, with the error being |
The timing is interesting as well, as the time between messages is so close: I'm taking a look now, but my guess is that there's an issue with a timeout set someplace. |
I've been running a watch that generates a PDF locally for awhile, and not hitting any issues. It repeats every 5 minutes, which makes it look like Watcher is not trying to hold a connection open between triggers. |
If I set |
Here’s the snippet in watcher for the initial report generation. So the request must have been returning |
Quoting an email from @jasontedor
It can be reproduced without needing watcher, below confirms what he describes. These steps reproduce the problem in 6.8, 7.2, and 7.3 branches.
Without the change, the difference is those 32 bytes aren’t there in the ‘recv-q’ and when you send the curl request, you get |
I've removed myself from this since it's not in the reporting territory, however I'm still confused/concerned about the 408 response timing out. It seems to me that we shouldn't be "holding" this response on a keep-alive socket, and I'm not sure if this is something we can change since it appears to be how node behaves. From my perspective we're treating the underlying socket as a timeout vs the HTTP request itself. Ideally we'd only have this timeout per-http request and not per-socket. |
@jbudz What would be the drawback of removing just the 408 response on timeout - then it should behave as it did before? Do we use / intend to use the 408 to provide messaging in the Kibana app? We'll potentially see this anywhere an idle client connection can be kept alive longer than the kibana server connection. If we configure the kibana server timeout value very low, then we should be able to see this in the Kibana app even (some will be legitimate 408s, others not, it should be clear in the network view). |
In the light amount of reading I've done on the NodeJS docs, I think @peterschretlen's comment is what I'm leaning towards as well. Seems that our client(s) in Kibana seem to have the current behavior built-in, and wonder what other users might find if the 408 starts showing up. I think simply removing that timeout handler would work. |
I'm always a bit leery of "unusual" http responses - I had to look up 408, I've never seen that one before! I would think most clients would treat it as a generic 4XX response, meaning, user error, don't try this again. The reaction from a "smart" client seems like it should be to try the request again, but request via some option a longer timeout, if that client thought the timeout was transient and success might be possible with a longer timeout. I don't know enough about the actual scenario to suggest alternatives tho. |
The drawback is mostly around messaging and debugging. The messier side of this is browser -> server interaction that result in undefined errors. Moreso when there's a proxy in front leaving another layer for interpretation. I'd argue the watcher client should be managing its own long polling timeout (although maybe it is, timestamps look iffy). Anything in between Kibana and watcher could have different behavior and its own timeout, whether its 504 or 502 or 408. I'll remove the message and we can bring that up for discussion later. |
As I mentioned in my email that @peterschretlen posted here, the underlying client library does transparently as long as this is handled at the socket level, but once it’s done at the HTTP level I think we are doing the right thing by treating 408 as an error (the underlying client library has no idea for sure). Right now we treat the report creation as binary: it either succeeded or it didn’t. I am not a fan of making this more complex on the client side. |
Kibana version: 7.3.0 BC5
Elasticsearch version: 7.3.0 BC5
Server OS version: CentOS 7
Browser version: N/A
Browser OS version: N/A
Original install method (e.g. download page, yum, from source, etc.): rpm default dist
Describe the bug: 2 tests in the integration-test repo create an interval watch to email a Kibana dashboard PDF and PNG report (2 separate tests). On 7.3.0 BC4 and earlier both tests pass. But on 7.3.0 BC5 the second test always fails. I can switch the PDF and PNG reports and it's always the second test that fails.
These watches are sending the reports as email attachments but I don't think that is relevant to the problem based on the error message;
An example of the .watcher-history doc for one of these failures;
00:25:12.817 .watcher-history*={"took":67,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.6931472,"hits":[{"_index":".watcher-history-10-2019.07.19","_type":"_doc","_id":"watcher_report1563554002370_44c82301-867c-4638-a051-5c5090211b9c-2019-07-19T16:45:37.861919Z","_score":0.6931472,"_source":{"watch_id":"watcher_report1563554002370","node":"thvOhxZ2SpqAk0pC-aWvdQ","state":"executed","status":{"state":{"active":true,"timestamp":"2019-07-19T16:45:37.861Z"},"last_checked":"2019-07-19T16:45:37.861Z","last_met_condition":"2019-07-19T16:45:37.861Z","actions":{"email_admin":{"ack":{"timestamp":"2019-07-19T16:45:37.861Z","state":"awaits_successful_execution"},"last_execution":{"timestamp":"2019-07-19T16:45:37.861Z","successful":false,"reason":""}}},"execution_state":"executed","version":-1},"trigger_event":{"type":"schedule","triggered_time":"2019-07-19T16:45:37.861Z","schedule":{"scheduled_time":"2019-07-19T16:45:37.542Z"}},"input":{"none":{}},"condition":{"always":{}},"result":{"execution_time":"2019-07-19T16:45:37.861Z","execution_duration":4,"input":{"type":"none","status":"success","payload":{}},"condition":{"type":"always","status":"success","met":true},"actions":[{"id":"email_admin","type":"email","status":"failure","error":{"root_cause":[{"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}],"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}}]},"messages":[]}}]}}
with the important part being;
Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"
On a local install, I kept Elasticsearch 7.3.0 BC5 and changed Kibana back to BC4 and the tests pass. Change it back to BC5 and they fail again so it really seems like this is caused by a change between those 2 BCs.
Steps to reproduce:
Expected behavior: Both watches should work as they have before.
Screenshots (if relevant):
Errors in browser console (if relevant):
Provide logs and/or server output (if relevant): I haven't found any errors in either Kibana or Elasticsearch logs related to the timeout triggering report generation.
Any additional context:
The text was updated successfully, but these errors were encountered: