Query resilience to failures #2909

tooptoop4 · 2020-02-22T08:02:53Z

Sometimes worker instances die at any time (spot instances, scale down, infra issue.etc), even with graceful shutdown queries can fail (if query launched just before shutdown and takes longer to run than the grace period). This is about having config (session/server side) so that the presto server transparently reruns the query from the start (no need to resume from certain stages/operators). The user that submits the query should have no idea that * internally * first attempt failed, as they get successful standard query result from the 2nd attempt. ie it should be indistinguishable to the user whether retry happened

Some options:
only-retry-if-error-related-to-worker-gone (false means any type of error triggers rerun, true means just errors related to worker missing/page transport timeout.etc I envision first cut of the PR will retry on all errors, then with further refinement it can be coded to only retry on worker gone type errors)
max-retries
interval-between-retries
max-duration-of-attempt (ie if something took 40mins to error out, might not want to be waiting +80mins)

prestodb/presto#6006

some guesses of where this could be implemented:

/protocol/ExecutingStatementResource
/protocol/Query
QueuedStatementResource
DispatchManager
SqlQueryManager
SqlQueryExecution
StatementClientV1
DispatchQueryFactory
Local DispatchQueryFactory
HttpPageBufferClient (https://github.com/ernestrc/sonicd/blob/627a32576f0417011f25835296e6bca31514b821/server/src/main/scala/build/unstable/sonicd/source/PrestoSource.scala#L318-L335 has some notes)
QueryStateInfoResource
QueryResource
https://github.com/prestosql/presto/pull/1660/files shows many classes likely involved

one complication mentioned in the prestodb issue is that the client may have already consumed a partial resultset from 1st attempt then 2nd attempt re-sends the same results along with the remaining full results - resulting in dupes! But I think this feature for SELECT sql could be targeted to cases where no results have been sent to client yet

_Someone implemented that (Qubole?) by having the coordinator start a new query and using the nextUri to redirect the client to the new query. This works, assuming the query hasn't produced any data yet (the coordinator knows this). It seems a bit hacky and has questions around statistics, query events, etc. But with those caveats it seems fairly straightforward (though I might be missing some things).
We have a different idea for spot instances in particular, which involves adding support for quickly evacuating a worker by moving all the tasks to different nodes. But this is a large project.

For retry, the big question is if spot instance revocations happen in batches. If so, and this happens infrequently relative to the execution duration of your queries, a simple retry policy could work.
Losing a bunch of workers at once is much better than losing one worker at a time._

tooptoop4 · 2020-03-07T00:26:27Z

from David on slack:
This is a very complicated issue from a design perspective. If it was easy, we would have done it a long time ago. Some questions to resolve:
How does a retry affect the query ID? Is each attempt a new query ID?
How is retry communicated to a client?
How does retry affect statistics and events?

tooptoop4 · 2020-03-07T00:49:53Z

qubole has this - https://docs.qubole.com/en/latest/user-guide/engines/presto/presto-query-retry.html and https://www.qubole.com/blog/spot-nodes-in-presto-on-qubole/
slide 29 from treasure data - https://www.slideshare.net/treasure-data/2015-0311td-techtalkinternalsofprestoservice

"Using the Presto Query Retrying Mechanism

Qubole has added a query retry mechanism to handle query failures (if possible). It is useful in cases when Qubole adds nodes to the cluster during autoscaling or after a spot node loss (that is when the cluster composition contains Spot nodes). The new query retry mechanism:

Retries a query which is failed with the LocalMemoryExceeded error when the new nodes are added to the cluster or in the process of being added to the cluster.
Retries a query which failed with an error due to the worker node loss.
In the above two scenarios, there is a waiting time period to let new nodes join the cluster before Presto retries the failed query. To avoid an endless waiting time period, Qubole has added appropriate timeouts. Qubole has also ensured that any actions performed by the failed query’s partial execution are rolled back before retrying the failed query.

Uses of the Query Retry Mechanism

The query retry mechanism is useful in these two cases:

When a query triggers upscaling but fails with the LocalMemoryExceeded error as it is run on a smaller-size cluster. The retry mechanism ensures that the failed query is automatically retried on that upscaled cluster.
When a Spot node loss happens during the query execution. The retry mechanism ensures that the failed query is automatically retried when new nodes join the cluster (when there is a Spot node loss, Qubole automatically adds new nodes to stabilize the cluster after it receives a Spot termination notification. Hence, immediately after the Spot node loss, a new node joins the cluster).
Enabling the Query Retry Mechanism

You can enable this feature at cluster and session levels by using the corresponding properties:

At the cluster level: Override retry.autoRetry=true in the Presto cluster overrides. On the Presto Cluster UI, you can override a cluster property under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration.
At the session level: Set auto_retry=true in the specific query’s session.

Note

The session property is more useful as an option to disable the retry feature at query level when autoRetry is enabled at the cluster level.
Configuring the Query Retry Mechanism

You can configure these parameters:

retrier.max-wait-time-local-memory-exceeded: It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster, if the query has failed with the LocalMemoryExceeded error. Its value is configured in seconds or minutes. For example, its value can be 2s, or 2m, and so on. Its default value is 5m. If a new node does not join the cluster within this time period, Qubole returns the original query failure response.
retrier.max-wait-time-node-loss: It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster if the query has failed due to the Spot node loss. Its value is configured in seconds or minutes. For example, its value can be 2s, or 2m, and so on. Its default value is 3m. If a new node does not join the cluster within this configured time period, the failed query is retried on the smaller-sized cluster.
retry.nodeLostErrors: It is a comma-separated list of Presto errors (in a string form) that signify the node loss. The default value of this property is "REMOTE_HOST_GONE","TOO_MANY_REQUESTS_FAILED","PAGE_TRANSPORT_TIMEOUT".
Understanding the Query Retry Mechanism

The query retries can occur multiple times. By default, three retries can occur if all conditions are met. The conditions on which the retries happen are:

The error is retryable. Currently, LocalMemoryExceeded and node loss errors: REMOTE_HOST_GONE, TOO_MANY_REQUESTS_FAILED, and PAGE_TRANSPORT_TIMEOUT are considered retryable. This list of node loss errors is configurable using the retry.nodeLostErrors property.
Only INSERT OVERWRITE DIRECTORY queries are considered retryable.
The actions of a failed query are rolled back successfully. If the rollback fails or if Qubole times out waiting, then it does not retry it.
A failed query has a chance to succeed if retried:
For the LocalMemoryExceeded error: The query has a chance to succeed if the current number of workers is greater than the number of workers handling the Aggregation stage. If Qubole times out waiting to get to this state, it does not retry.
For the node loss errors: The query has a chance to succeed if the current number of workers is greater than or equal to the number of workers that the query ran on earlier. If Qubole times out waiting to get to this state, it goes ahead with the retry as the query may still pass in the smaller-sized cluster."

"Smart Query Retries

Even with the Spot termination handling described above, there are still chances for queries to fail due to Spot loss. This loss can occur due to intermediate stages taking a long time, or even leaf stages taking a long time and the node terminating before completion of processing.

For all such cases we need a query retry mechanism that is smart enough to decide if the query can be retried and, if so, determine the right time to retry the query to improve its chances of success. To this end, we have built the Smart Query Retry Mechanism inside Presto Server that fulfills the following requirements:

Query retries should be transparent to clients and work with all Presto clients: Java client, JDBC Drivers, Ruby client, etc.
The retry should happen only if it is guaranteed that no partial results have been sent to the client.
The retry should happen only if changes (if any) caused by the failed query have been rolled back (e.g. in the case of Inserts, data written by the failed query has been deleted).
The retry should happen only if there is a chance of successful query execution.
We made the Presto server responsible for query retries as opposed to clients for two reasons:

The Presto server offers a single place to have retry logic rather than replicating it in each client.
The server has much better insights into the state of query that is needed for a few of the requirements described below.
In the following sections, we will see how the aforementioned requirements are met in implementation.

Transparent Retries

Transparent retries imply that no changes should be required on the client side for retries to work. All Presto clients submit the query to the server and then poll for status in a loop until the query completes. The Presto server will internally retry the query as a new query in case of failure, but the client should continue with its polling uninterrupted and eventually should get the results from the new query.

The trick to make this work is to change the next_uri field in the Presto server’s response to a client poll request. The ‘next_uri’ field tells the client what URI to poll in the next iteration to get the status of query. Changing this field to point to the new query instead of the original query redirects the client to the new query without any changes to the client.

Retry Only When No Partial Results Have Been Returned

As described earlier, Presto can make partial results available to clients while the query is still running due to pipelined execution. Retrying a query when partial results have already been processed by the client would lead to duplicate data in the results. This is a difficult problem to solve, as storing the state of results consumed by a client can be very difficult.
For this, the INSERT OVERWRITE DIRECTORY (IOD) feature of Qubole comes into play. In Qubole, all user queries are converted into IOD queries and write their data to a directory. Clients get the results only after the query has completely finished. This gives us the flexibility to delete partially written results in the directory and start over for the retry, ensuring no duplicate results will be returned to the client.

In the future, retries will be extended to SELECT queries provided the query has not returned any results before failing.

Retry Only If Rollback Is Successful

Blind retry of a failed insert query can lead to duplicate results (as described above), as some partial results may have been written before the query failed. To solve this issue, the Query Retry Mechanism tracks the rollback status of the query and only initiates a retry once the rollback is complete. Only the Presto server is privy to this information about the rollback status as well as information about the retry mechanism, which also sits inside the server.

Avoid Unnecessary Retries

Retrying in cases where the query would fail again will only lead to resource wastage. For example, retrying a query after the cluster size has shrunk from a failed query due to resource shortage would definitely lead to additional failed queries. The retry mechanism is made aware of the query failure codes, the cluster state in which the query failed, and the current cluster state at which the retry is being considered. Using this information, the mechanism can make informed decisions on whether to retry the query or not."

https://cdn.oreillystatic.com/en/assets/1/event/290/Cost-effective%20Presto%20on%20AWS%20with%20Spot%20nodes%20Presentation.pptx

tooptoop4 · 2021-06-19T10:11:59Z

prestodb/presto#15857 has a fix

tooptoop4 · 2021-09-03T08:57:38Z

related to #9101

arhimondr · 2021-11-09T18:43:12Z

@tooptoop4 I would love to create a parent issue for failure recovery that would encompass everything what we are designing / working on in that direction. Ideally it would be great if the issue summary was well structured and had the overview and a list of work items. Do you mind if I close this issue in favor of #9101 which I can edit and provide more details there?

findepi · 2021-11-12T09:20:14Z

Closed in favor of #9101.

findepi changed the title ~~fault tolerance - config to retry entire query~~ Query resilience to hardware errors Sep 3, 2021

findepi mentioned this issue Sep 3, 2021

Support Failure Recovery #9101

Closed

31 tasks

findepi added enhancement New feature or request roadmap Top level issues for major efforts in the project labels Sep 3, 2021

findepi changed the title ~~Query resilience to hardware errors~~ Query resilience to failures Sep 3, 2021

findepi mentioned this issue Sep 20, 2021

Refactor task scheduling related components #9294

Closed

tooptoop4 closed this as completed Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query resilience to failures #2909

Query resilience to failures #2909

tooptoop4 commented Feb 22, 2020 •

edited

Loading

tooptoop4 commented Mar 7, 2020

tooptoop4 commented Mar 7, 2020 •

edited

Loading

tooptoop4 commented Jun 19, 2021

tooptoop4 commented Sep 3, 2021

arhimondr commented Nov 9, 2021

findepi commented Nov 12, 2021

Query resilience to failures #2909

Query resilience to failures #2909

Comments

tooptoop4 commented Feb 22, 2020 • edited Loading

tooptoop4 commented Mar 7, 2020

tooptoop4 commented Mar 7, 2020 • edited Loading

tooptoop4 commented Jun 19, 2021

tooptoop4 commented Sep 3, 2021

arhimondr commented Nov 9, 2021

findepi commented Nov 12, 2021

tooptoop4 commented Feb 22, 2020 •

edited

Loading

tooptoop4 commented Mar 7, 2020 •

edited

Loading