-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query resilience to failures #2909
Comments
from David on slack: |
qubole has this - https://docs.qubole.com/en/latest/user-guide/engines/presto/presto-query-retry.html and https://www.qubole.com/blog/spot-nodes-in-presto-on-qubole/ "Using the Presto Query Retrying Mechanism Qubole has added a query retry mechanism to handle query failures (if possible). It is useful in cases when Qubole adds nodes to the cluster during autoscaling or after a spot node loss (that is when the cluster composition contains Spot nodes). The new query retry mechanism: Retries a query which is failed with the LocalMemoryExceeded error when the new nodes are added to the cluster or in the process of being added to the cluster. Uses of the Query Retry Mechanism The query retry mechanism is useful in these two cases: When a query triggers upscaling but fails with the LocalMemoryExceeded error as it is run on a smaller-size cluster. The retry mechanism ensures that the failed query is automatically retried on that upscaled cluster. You can enable this feature at cluster and session levels by using the corresponding properties: At the cluster level: Override retry.autoRetry=true in the Presto cluster overrides. On the Presto Cluster UI, you can override a cluster property under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration. Note The session property is more useful as an option to disable the retry feature at query level when autoRetry is enabled at the cluster level. You can configure these parameters: retrier.max-wait-time-local-memory-exceeded: It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster, if the query has failed with the LocalMemoryExceeded error. Its value is configured in seconds or minutes. For example, its value can be 2s, or 2m, and so on. Its default value is 5m. If a new node does not join the cluster within this time period, Qubole returns the original query failure response. The query retries can occur multiple times. By default, three retries can occur if all conditions are met. The conditions on which the retries happen are: The error is retryable. Currently, LocalMemoryExceeded and node loss errors: REMOTE_HOST_GONE, TOO_MANY_REQUESTS_FAILED, and PAGE_TRANSPORT_TIMEOUT are considered retryable. This list of node loss errors is configurable using the retry.nodeLostErrors property. "Smart Query Retries Even with the Spot termination handling described above, there are still chances for queries to fail due to Spot loss. This loss can occur due to intermediate stages taking a long time, or even leaf stages taking a long time and the node terminating before completion of processing. For all such cases we need a query retry mechanism that is smart enough to decide if the query can be retried and, if so, determine the right time to retry the query to improve its chances of success. To this end, we have built the Smart Query Retry Mechanism inside Presto Server that fulfills the following requirements: Query retries should be transparent to clients and work with all Presto clients: Java client, JDBC Drivers, Ruby client, etc. The Presto server offers a single place to have retry logic rather than replicating it in each client.
Transparent retries imply that no changes should be required on the client side for retries to work. All Presto clients submit the query to the server and then poll for status in a loop until the query completes. The Presto server will internally retry the query as a new query in case of failure, but the client should continue with its polling uninterrupted and eventually should get the results from the new query. The trick to make this work is to change the
As described earlier, Presto can make partial results available to clients while the query is still running due to pipelined execution. Retrying a query when partial results have already been processed by the client would lead to duplicate data in the results. This is a difficult problem to solve, as storing the state of results consumed by a client can be very difficult. In the future, retries will be extended to SELECT queries provided the query has not returned any results before failing.
Blind retry of a failed insert query can lead to duplicate results (as described above), as some partial results may have been written before the query failed. To solve this issue, the Query Retry Mechanism tracks the rollback status of the query and only initiates a retry once the rollback is complete. Only the Presto server is privy to this information about the rollback status as well as information about the retry mechanism, which also sits inside the server.
Retrying in cases where the query would fail again will only lead to resource wastage. For example, retrying a query after the cluster size has shrunk from a failed query due to resource shortage would definitely lead to additional failed queries. The retry mechanism is made aware of the query failure codes, the cluster state in which the query failed, and the current cluster state at which the retry is being considered. Using this information, the mechanism can make informed decisions on whether to retry the query or not." |
prestodb/presto#15857 has a fix |
related to #9101 |
@tooptoop4 I would love to create a parent issue for failure recovery that would encompass everything what we are designing / working on in that direction. Ideally it would be great if the issue summary was well structured and had the overview and a list of work items. Do you mind if I close this issue in favor of #9101 which I can edit and provide more details there? |
Closed in favor of #9101. |
Sometimes worker instances die at any time (spot instances, scale down, infra issue.etc), even with graceful shutdown queries can fail (if query launched just before shutdown and takes longer to run than the grace period). This is about having config (session/server side) so that the presto server transparently reruns the query from the start (no need to resume from certain stages/operators). The user that submits the query should have no idea that * internally * first attempt failed, as they get successful standard query result from the 2nd attempt. ie it should be indistinguishable to the user whether retry happened
Some options:
only-retry-if-error-related-to-worker-gone (false means any type of error triggers rerun, true means just errors related to worker missing/page transport timeout.etc I envision first cut of the PR will retry on all errors, then with further refinement it can be coded to only retry on worker gone type errors)
max-retries
interval-between-retries
max-duration-of-attempt (ie if something took 40mins to error out, might not want to be waiting +80mins)
prestodb/presto#6006
some guesses of where this could be implemented:
/protocol/ExecutingStatementResource
/protocol/Query
QueuedStatementResource
DispatchManager
SqlQueryManager
SqlQueryExecution
StatementClientV1
DispatchQueryFactory
Local DispatchQueryFactory
HttpPageBufferClient (https://github.com/ernestrc/sonicd/blob/627a32576f0417011f25835296e6bca31514b821/server/src/main/scala/build/unstable/sonicd/source/PrestoSource.scala#L318-L335 has some notes)
QueryStateInfoResource
QueryResource
https://github.com/prestosql/presto/pull/1660/files shows many classes likely involved
one complication mentioned in the prestodb issue is that the client may have already consumed a partial resultset from 1st attempt then 2nd attempt re-sends the same results along with the remaining full results - resulting in dupes! But I think this feature for SELECT sql could be targeted to cases where no results have been sent to client yet
_Someone implemented that (Qubole?) by having the coordinator start a new query and using the nextUri to redirect the client to the new query. This works, assuming the query hasn't produced any data yet (the coordinator knows this). It seems a bit hacky and has questions around statistics, query events, etc. But with those caveats it seems fairly straightforward (though I might be missing some things).
We have a different idea for spot instances in particular, which involves adding support for quickly evacuating a worker by moving all the tasks to different nodes. But this is a large project.
For retry, the big question is if spot instance revocations happen in batches. If so, and this happens infrequently relative to the execution duration of your queries, a simple retry policy could work.
Losing a bunch of workers at once is much better than losing one worker at a time._
The text was updated successfully, but these errors were encountered: