-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add a timeout option to file structure finder #34117
[ML] Add a timeout option to file structure finder #34117
Conversation
This can be used to restrict the amount of CPU a single structure finder request can use. The timeout is not implemented precisely, so requests may run for slightly longer than the timeout before aborting. The default is 25 seconds, which is a little below Kibana's default timeout of 30 seconds for calls to Elasticsearch APIs.
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to avoid the volatile reads in TimeoutChecker.check
. Have to you evaluated if there is a performance impact?
I don't see anything wrong in this change however, I wonder if so many checks need to be made. The description mentions that the default timeout is 25s so it will reasonably timeout in under the ES default of 30s which provides 5s to play with.
"type" : "timeout_exception", | ||
"reason" : "Aborting structure analysis during [delimited record parsing] as it has taken longer than the timeout of [1s]" | ||
}, | ||
"status" : 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think timeout should be 500 error. Do you know what ES returns for timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransportCloseJobAction
throws IllegalStateException
on timeout and TransportOpenJobAction
throws ElasticsearchException
. Both these map to status 500.
For an example outside of X-Pack, TransportRestoreSnapshotAction
throws ProcessClusterEventTimeoutException
, and this maps to 503. But the description of 503 says it's for temporary conditions which will be alleviated after some delay, so I'm not sure that's right for the file structure finder, because if an analysis is too time consuming first time it's likely to stay that way (at least for the same hardware).
Also, whoever originally wrote the ElasticsearchTimeoutException
class had the opportunity to specify the most appropriate status for it and chose 500.
So although 500 seems suboptimal, it's probably the best there is and most consistent with the rest of ES.
/** | ||
* Stops the timer if running. | ||
*/ | ||
public void close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing @Override
@@ -283,6 +308,7 @@ FileStructureFinder makeBestStructureFinder(List<String> explanation, String sam | |||
String line; | |||
while ((line = bufferedReader.readLine()) != null && ++lineCount <= maxLines) { | |||
sample.append(line).append('\n'); | |||
timeoutChecker.check("sample line splitting"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a thin loop to be doing the check every time. There is a similar problem in DelimitedFileStructureFinder
I don't want to get into premature optimisations but it would be nice if the check wasn't run every iteration. Possibly the overhead of the logic to check that would be heavier than the fast call to timeoutChecker.check(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly the overhead of the logic to check that would be heavier
That's my assumption - see #34117 (comment)
I'll profile it to confirm.
if (grok.match(sampleMessage) == false) { | ||
return false; | ||
} | ||
timeoutChecker.check("full message Grok pattern matching"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved outside the loop? How many sample messages does the call expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grok pattern matching is one of the most expensive operations. In Ingest pipeline a watchdog that's a little bit similar to the TimeoutChecker
of this PR is installed to possibly interrupt every single cal to match()
:
threadWatchdog.register(); |
With the structure finder's use of Grok that inner watchdog is the no-op watchdog, so we're not double checking for slow Grok patterns. But it still seems reasonable to check once per Grok pattern match given that it's such an expensive operation.
...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimeoutChecker.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This can be used to restrict the amount of CPU a single structure finder request can use. The timeout is not implemented precisely, so requests may run for slightly longer than the timeout before aborting. The default is 25 seconds, which is a little below Kibana's default timeout of 30 seconds for calls to Elasticsearch APIs.
This change fixes a potential deadlock problem in the unit test introduced in elastic#34117. It also removes a piece of debug code and corrects a docs formatting problem that were both added in that same PR.
This change fixes a potential deadlock problem in the unit test introduced in #34117. It also removes a piece of debug code and corrects a docs formatting problem that were both added in that same PR.
This change fixes a potential deadlock problem in the unit test introduced in #34117. It also removes a piece of debug code and corrects a docs formatting problem that were both added in that same PR.
This can be used to restrict the amount of CPU a single structure finder request can use. The timeout is not implemented precisely, so requests may run for slightly longer than the timeout before aborting. The default is 25 seconds, which is a little below Kibana's default timeout of 30 seconds for calls to Elasticsearch APIs.
This change fixes a potential deadlock problem in the unit test introduced in #34117. It also removes a piece of debug code and corrects a docs formatting problem that were both added in that same PR.
This can be used to restrict the amount of CPU a single
structure finder request can use.
The timeout is not implemented precisely, so requests
may run for slightly longer than the timeout before
aborting.
The default is 25 seconds, which is a little below
Kibana's default timeout of 30 seconds for calls to
Elasticsearch APIs.