backupccl: avoid hanging forever on intents #51624

dt · 2020-07-20T21:54:52Z

BACKUP, like any read, cannot export a range if it contains an unresolved intent.
If we hit an intent during iteration, we typically return a WriteIntentError like
any scan and let the store resolve it and retry the request before replying to the
sender.

However blocking intent resolution can mean a single txn left open can cause backups
to hang, sometimes for hours or even days in some observed cases. While operators
typically want bulk, background jobs like BACKUP to have as little impact on their
foreground traffic as possible, even more importantly they want their BACKUPs to
actually complete to maintain their RPO SLA. So while you typically would not want
a BACKUP to immeidately run at high priority and abort any transactions from OLTP
foreground traffic as soon as it starts, you do, eventually, want it to do whatever
it needs to do to finish.

Backdating BACKUPs with AS OF SYSTEM TIME '-30s' is a good best practice to reduce
the number of still-running transactions they may encounter, but an abandoned or very
long-running transaction can still hang such a backup. This change extends BACKUP to
initially start asking individual ranges to backup but bail out if they encounter an
intent, rather than attempting to resolve it and holding up the whole BACKUP while
doing so. If any range replies that it bailed out, it is put at the back of the queue
to come back to later, and BACKUP continues to work on other ranges, in the hopes that
by the time it is done with them, the intent may have already resolved itself.

When it comes back to a range that it has already attempted to backup up to the limit
(default 3), it instead asks the range to backup but not bail out if it hits an intent
and instead do the usual behavior of trying to resolve it. It addtionally sets the
priority to high, so that during that resolution, other transactions will be aborted
so that the backup can finish.

Release note (enterprise change): BACKUP takes priority over other transactions if its initial attempts to export a range fail.

cockroach-teamcity · 2020-07-20T21:54:59Z

This change is

dt · 2020-07-20T21:55:19Z

I'm still working on the test for this, but wanted to get it out for an early look in the meantime.

miretskiy

I like it very much.

I can take another look when you get tests going (though, I assume, it shouldn't be that bad)

Reviewed 5 of 7 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @pbardea)

pkg/ccl/backupccl/backup_processor.go, line 180 at r1 (raw file):

					header.UserPriority = roachpb.MaxUserPriority
				}
				log.Infof(ctx, "sending ExportRequest for span %s (attempt %d, priority %d)", span.span, span.attempts+1, header.UserPriority.String())

wrap very long line(s)?

pkg/ccl/backupccl/backup_processor.go, line 183 at r1 (raw file):

				if pErr != nil {
					if err := pErr.Detail.GetTransactionRetry(); err != nil && err.Reason == roachpb.RETRY_REASON_UNKNOWN {

Should this check be moved & encapsulated in some sort of utility function?
Perhaps move it to export.go -- where it is used?

dt

I was making the test much harder than it needed to be with multiple goroutines, but realized all we need is 4 things: begin, update, backup (with timeout), commit. �Added a nice simple test.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @miretskiy, and @pbardea)

pkg/ccl/backupccl/backup_processor.go, line 180 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

wrap very long line(s)?

Done.

pkg/ccl/backupccl/backup_processor.go, line 183 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

				if pErr != nil {
					if err := pErr.Detail.GetTransactionRetry(); err != nil && err.Reason == roachpb.RETRY_REASON_UNKNOWN {
Should this check be moved & encapsulated in some sort of utility function?
Perhaps move it to export.go -- where it is used?

eh, I realized it could just be simplified in place.

ajwerner

with some nits. I'd like somebody from KV to chime in on the Reason.

Reviewed 1 of 7 files at r1, 3 of 4 files at r2.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @miretskiy, and @pbardea)

pkg/ccl/backupccl/backup_processor.go, line 184 at r2 (raw file):

				rawRes, pErr := kv.SendWrappedWith(ctx, flowCtx.Cfg.DB.NonTransactionalSender(), header, req)
				if pErr != nil {
					if err := pErr.Detail.GetTransactionRetry(); err != nil {

I'd love to see a brief comment here pointing at the FailOnIntents field of the request and how it lead to this retry error.

pkg/ccl/backupccl/backup_test.go, line 4367 at r2 (raw file):

	// observe that the backup aborted our txn.
	require.Error(t, tx.Commit())

nit: pick this error apart and validate that it has the right code inside.

pkg/ccl/storageccl/export.go, line 160 at r2 (raw file):

		if err != nil {
			if err, ok := err.(*roachpb.WriteIntentError); ok && args.FailOnIntents {
				wrapped := roachpb.NewTransactionRetryError(roachpb.RETRY_REASON_UNKNOWN, err.Error())

I'd support creating a new retry reason though I'd like to have somebody on the KV team aware and supportive of that too and so I'm nominating @andreimatei to help decide if that's a good idea.

miretskiy

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @dt and @pbardea)

pkg/ccl/backupccl/backup_test.go, line 4359 at r2 (raw file):

	// limit how long we'll try to backup so if it hangs we find out.
	timeoutCtx, fn := context.WithTimeout(ctx, time.Second*10)

perhaps a bit higher? particularly w/ stressrace it might take longer?
Or, just disable under stress/race?

nvanbenschoten

I'd like somebody from KV to chime in on the Reason

I'd encourage us to avoid a TransactionRetryError here. We're not running in a transaction so it's entirely unexpected for a TransactionRetryError to be returned. I'm actually surprised this doesn't blow up on the client when passing through a non-transactional sender. TransactionRetryError has a specific meaning, it's not just a generic RetryError, so if we're going to go with this approach, introducing a new error type would be preferable.

That said, I still don't think I fully understand the approach here and what it's optimizing for. We discussed on slack the alternative of simply running all BACKUPS as high-priority, which avoids all of this complexity and is much easier to think about. Of course, this could be disruptive if users don't run BACKUP with AS OF SYSTEM TIME '-30s', which David mentioned was "a good best practice" but I assume is not something we actually expect most users to be doing (right?). But I don't actually think the approach we have here is guaranteed to be any less disruptive. We retry 3 times at normal priority before upgrading to high priority, but there's no lower bound on how long those retries will take or how long we'll wait between retries. So it's still possible given a small enough backup that we end up pushing transactions that are arbitrarily up-to-date even with all of this added complexity. Don't we want a hard guarantee that we don't disrupt transactions below some threshold of staleness?

I wonder if there's any way to salvage an "always backup with high priority" approach. If so, I think that's worth exploring. For instance, what if we just ... waited until the backup timestamp was more than 10 seconds stale before sending the first Export request? For users using the best practice of running BACKUP with AS OF SYSTEM TIME, this would be free. For users not doing this, this would place a 10-second penalty on the latency of a backup. Unless we're optimizing for backup speed on very small tables, which I don't think we are, this penalty is likely completely negligible. And even then, follow the best practices as a user and you'll be fine.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @dt and @pbardea)

dt · 2020-07-21T13:40:29Z

I don't particularly like slapping an blanket latency penalty on every non-backdated backup: in penalizes backups on tables that weren't seeing any traffic unnecessarily and it isn't even clear what the right amount would be -- 1s might be appropriate in some scenarios but 30s in others depending on the workload we're trying to accommodate. I'd much rather only penalize a backup that actually hits an intent. So to do that, we'd still need a way to signal "i hit an intent" back to the sender.

This change attempts to use the fact we usually have other work to do that we can do in the meantime, instead of just sitting on our hand and doing nothing while we wait. As you point out, very small backups we may get back to the range quickly and thus run at high priority quite soon, aborting foreground traffic -- I should have included a TODO documenting our plan there: we're thinking that on subsequent retries waiting until back-date is something like priorAttempts * 10s -- so if we're backdated by 10s, the initial attempt goes out immediately as does the first retry, but we'd then impose that 10s and then 20s latency penalty on reties 2 and then 3.

Another thing that motivated this approach, and I also should have left a TODO about it, is that we're trying to address a constant frustration with many bulk operations which is "why is this taking so long?" is often hard to determine. Right now if we're stuck waiting on an intent, there's no indication of that -- we just see the request and overall BACKUP taking longer / not making progress. Informing the job that we hit an intent lets us send that information back to the coordinator, so we can then surface that status, e.g. something like "waiting on uncommitted transactions in 6 ranges". That in turn gives the operator a hint that maybe they need to back-date more, or maybe they need to look for some lost/stuck txns.

I don't particularly like just slapping priority high on all backups without an imposed backdate either, since impact on foreground traffic is one of the top concerns we hear every time we talk to production operators.

pbardea

Reviewed 1 of 4 files at r2.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @dt)

pkg/ccl/backupccl/backup_processor.go, line 42 at r2 (raw file):

	"kv.bulk_io_read.backup_priority_attempt_limit",
	"number of reads BACKUP should attempt before using high priority",
	3,

Alternatively (or maybe in addition) to having a backoff for every retry, I wonder if it would be useful to have a knob where the user could control how much time should pass before we start using high-priority transactions.

This would work similarly to defaulting to AOST -$cluster_setting seconds, but would be enforced by default as a cluster setting. Also, instead of reading in the past it would read now, but wait the period of the cluster setting if it gets blocked. This may help control the tradeoff between "I don't want my backups affecting my foreground traffic" and "But if my backup is taking really long I can sacrifice some retries on already long transactions".

dt · 2020-07-23T03:30:13Z

Okay, switched TransactionRetryError to the existing UnhandledRetryableError which is what we already return to callers when wrapping other retryable errors. I went ahead and added the time-based priority escalation instead of the simple count, and the min delay between tries.

miretskiy

Reviewed 7 of 13 files at r3, 4 of 8 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @dt, @miretskiy, and @pbardea)

pkg/ccl/backupccl/backup_processor.go, line 187 at r4 (raw file):

				// check to see if it is time to switch to priority.
				if !priority && span.attempts > 0 {
					// Check if this is starting a new pass and we should delay first.

want to expand the comment to explain why it's okay to wait (everything else is lower priority anyway)?

pkg/roachpb/api.proto, line 1433 at r4 (raw file):

  int64 target_file_size = 10;

  // FailOnIntents causes the request to return a TransactionRetryError if it

s/TransactionRetryError/UnhandledRetryeError/?

miretskiy

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @dt, @miretskiy, and @pbardea)

nvanbenschoten

The mechanics in the second commit now . Thanks for adding in the read_with_priority_after window.

The remaining question I have is whether we want to repurpose this error type.

Reviewed 2 of 13 files at r3, 8 of 8 files at r5, 8 of 8 files at r6.
Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @dt)

pkg/ccl/backupccl/backup_processor.go, line 42 at r6 (raw file):

	)
	priorityAfter = settings.RegisterNonNegativeDurationSetting(
		"bulkio.backup.read_with_priority_after",

s/priority/high_priority/ here and in the sentence below.

pkg/ccl/backupccl/backup_processor.go, line 49 at r6 (raw file):

		"bulkio.backup.read_retry_delay",
		"amount of time since the read-as-of time, per-prior attempt, to wait before making another attempt",
		time.Second*5,

nit: 5*time.Second

pkg/ccl/backupccl/backup_processor.go, line 223 at r6 (raw file):

				if pErr != nil {
					if err := pErr.Detail.GetUnhandledRetry(); err != nil {
						span.lastTried = timeutil.Now()

Do we want to use the HLC clock here and talk in terms of HLC timestamps?

pkg/ccl/storageccl/export.go, line 160 at r6 (raw file):

		if err != nil {
			if err, ok := err.(*roachpb.WriteIntentError); ok && args.FailOnIntents {
				wrapped := &roachpb.UnhandledRetryableError{}

The fact that we're not wrapping an error is an indication that we don't want to repurpose this error type, but I'll still defer to @andreimatei on that.

pkg/roachpb/errors.proto, line 518 at r5 (raw file):

    LeaseRejectedError lease_rejected = 13;
    NodeUnavailableError node_unavailable = 14;
    UnhandledRetryableError unhandled_retry =  40;

@andreimatei do you mind taking a look at this? My intuition is that we don't want to repurpose this error type and certainly don't want to add it to this union because its purpose is to wrap these errors in this union when converted into a Go error (see Error.GoError). Adding it to the union forces us to ask the question of what an UnhandledRetryableError{Err: UnhandledRetryableError{...}} means. I think you have a clearer picture of what the vision of this error type was though.

dt

Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @dt)

pkg/ccl/storageccl/export.go, line 160 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The fact that we're not wrapping an error is an indication that we don't want to repurpose this error type, but I'll still defer to @andreimatei on that.

we are wrapping an error, aren't we? I set PErr on the line below to the actual WriteIntentError.

pkg/roachpb/errors.proto, line 518 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@andreimatei do you mind taking a look at this? My intuition is that we don't want to repurpose this error type and certainly don't want to add it to this union because its purpose is to wrap these errors in this union when converted into a Go error (see Error.GoError). Adding it to the union forces us to ask the question of what an UnhandledRetryableError{Err: UnhandledRetryableError{...}} means. I think you have a clearer picture of what the vision of this error type was though.

I'm also eager to learn more about this history and motivation for this error type, but my brief spelunking made me think it was appropriate for this use -- As I understood it from the comments, it a type of error that callers should expect to see when they're returned a retryable error, wrapping some underlying error that explains why they're being asked to retry.

By my read, that is exactly what I want here -- my caller to see a retryable error, wrapping the underlying (WriteIntent) error. What I believe is different about my usage, and why I wanted to put it in the union, is that I wanted to give the server the choice to wrap it before even replying.

I don't expect distsender to wrap an error that is already UnhandledRetryableError in another layer of UnhandledRetryableError, so I don't think we're any more forced to consider what UnhandledRetryableError(UnhandledRetryableError(x)) means than we were before.

andreimatei

Is it just intents that exports need to worry about, or is it also unreplicated locks? Unless there's something special about ExportRequests, those should block you just as much. If so, I think you want to expand the handling of the new flag and consider it when evaluation looks through the lock table (not sure where exactly, but I can dig or we can ask Nathan). And even intents, I think, can be discovered through the lock table, not only through scans, depending whether the Export is the first guy to run into a specific intent, or if other requests ran into it recently. I'm sorry if I'm off.

Another thought I have is that I think it's pretty brutal how the patch currently hides the WriteIntentError. Part of handling that error is putting the intent into the lock table. I think it'd be nice to structure the code such that that part still happens, even if the request then doesn't actually wait for the intent to be resolved. It's not a big deal if Nathan didn't feel the same, but I'm surprised he didn't. This dove-tails with my other comment about how maybe we don't want to wrap the WriteIntentError in anything.

Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @dt, and @nvanbenschoten)

pkg/ccl/backupccl/backup_processor.go, line 218 at r6 (raw file):

					req.FailOnIntents = true
				}
				log.Infof(ctx, "sending ExportRequest for span %s (attempt %d, priority %s)",

did you want to leave this at Info level?

pkg/roachpb/api.proto, line 1433 at r6 (raw file):

  int64 target_file_size = 10;

  // FailOnIntents causes the request to return a UnhandledRetryeError if it

UnhandledRetry*e*Error

pkg/roachpb/errors.proto, line 518 at r5 (raw file):

As I understood it from the comments, it a type of error that callers should expect to see when they're returned a retryable error, wrapping some underlying error that explains why they're being asked to retry.

Not really, depending on what "callers" we're talking about. Generally code under the TxnCoordSender deals with specific retriable errors (e.g. TransactionRetryError), and code above the TxnCoordSender deals with TransactionRetryWithProtoRefreshError (which corresponds to a "handled" retriable error - the name cracks me up every time).

I don't think UnhandledRetryableError belongs here. UnhandledRetryableError is used by DistSQL leaf transactions to communicate with the DistSQL root transaction (it is produced on kvclient side); it's not meant for communication between kvserver and kvclient. There's also this assertion here that verifies that generally the user of a transaction doesn't get this error.

cockroach/pkg/kv/txn.go

Line 835 in 9547502

    
           log.Fatalf(ctx, "unexpected UnhandledRetryableError at the txn.exec() level: %s", err)

I also don't think that UnhandledRetryableError should ever wrap a WriteIntentError; a write intent is not retriable in the same sense as the errors that implement transactionRestartError - those are the retriable ones.

If ExportRequest needs to return some new error type, then I'd much rather we introduce a brand new error type in this enum. But I think a better idea might be to just return WriteIntentError to the client. In fact, have you considered making fail_on_intents a field on the RequestHeader? It could apply in the same way to any request - it would inhibit the call to handleWriteIntentError() in replica.executeBatchWithConcurrencyRetries(). FWIW, if we were to do that, we'd change the character of WriteIntentError a bit - I think currently this error is never returned from the server to the client, and is only defined as a proto because we had the bad idea to standardize this Sender interface that returns pErr and use it both on the client side and on the server side. I'm suggesting we start returning WriteIntentError to the client - but that makes sense to me.

nvanbenschoten · 2020-07-30T21:56:44Z

If so, I think you want to expand the handling of the new flag and consider it when evaluation looks through the lock table (not sure where exactly, but I can dig or we can ask Nathan). And even intents, I think, can be discovered through the lock table, not only through scans, depending whether the Export is the first guy to run into a specific intent, or if other requests ran into it recently. I'm sorry if I'm off.

Great point, I completely missed this, but @andreimatei's correct. The lock table needs to be made aware of this flag as well. As is, this is only a partial solution.

This is a pretty strong indication that we should generalize this concept. Specifically, I think it's time to consider adding a WaitPolicy option to the BatchRequest header that will start off with two options, Block and Error. Block would be the default and requests would behave as they do today. Error would return a new form of error when it encounters any locks that it would need to block on. The lock table would make this easy to implement, we'd need just a single check here.

WaitPolicy.Error would be used for this PR and it would also be used to implement the SELECT ... FOR UPDATE NOWAIT option that is starting to come up in #40476. We already support the syntax for this, but don't support it in the KV layer:

cockroach/pkg/sql/sem/tree/select.go

Lines 1030 to 1033 in c09aeaf

    
           // LockingWaitPolicy represents the possible policies for dealing with rows 
        
           // being locked by FOR UPDATE/SHARE clauses (i.e., it represents the NOWAIT 
        
           // and SKIP LOCKED options). 
        
           type LockingWaitPolicy byte

I'm a little less eager to support SELECT ... FOR UPDATE SKIP LOCKED because it's a complete violation of serializability, but that's a discussion we can have at a later point.

Anyway, @dt this would require a bit of KV work but then we could remove all of the pkg/kv changes from this PR. I'd be happy to work on this, but won't be able to get to it until later next week. Does that work for your timeline?

dt · 2020-07-30T22:05:12Z

Anyway, @dt this would require a bit of KV work but then we could remove all of the pkg/kv changes from this PR. I'd be happy to work on this, but won't be able to get to it until later next week. Does that work for your timeline?

@nvanbenschoten More than happy to wait (har har) to rebase this if you think it's feasible that it'll be in the KV API in time to get this in for 20.2. A generalized WaitPolicy sounds like exactly what we want, and is on the path, I believe, to some of our other desires as well (e.g. limit how long we wait trying to contact unavailable ranges or other places we currently see hangs). Thanks!

dt · 2020-07-30T22:08:47Z

Oh, I meant to say: I might suggest WaitPolicy be a message type with this enum nested in field within it, so we can add other fields along side later (like my desired dailtimeout)?

nvanbenschoten · 2020-08-05T04:51:29Z

@dt I just pushed #52388, which should provide you with a better interface to build this PR on top of. You should be able to set a WaitPolicy: lock.WaitPolicy_Error on the ExportRequest's Batch.Header and catch WriteIntentErrors without any of the associated KV changes in this PR.

Oh, I meant to say: I might suggest WaitPolicy be a message type with this enum nested in field within it, so we can add other fields along side later (like my desired dailtimeout)?

I didn't end up going along this route. The nested enum approach is tricky to keep a) free in the common case and b) backwards compatible when expressed as protobufs. I'm also not convinced that it's needed. The idea of a dailtimeout is well out of scope for this work and I don't think it would be a good idea to conflate it with what we're talking about here. The lock.WaitPolicy is very specifically talking about behavior under contention and when encountering conflicting locks. It's analogous to the "locking clause options" in Postgres: https://www.postgresql.org/docs/current/sql-select.html.

Informs cockroachdb#40476. Informs cockroachdb#51624. This commit introduces the concept of request "wait policies". A WaitPolicy specifies the behavior of a request when it encounters conflicting locks held by other active transactions. The default behavior is to block until the conflicting lock is released, but other policies can make sense in special situations. Within this new formalization, the commit creates two initial wait policy variants: ``` // Block indicates that if a request encounters a conflicting locks held by // another active transaction, it should wait for the conflicting lock to be // released before proceeding. Block = 0; // Error indicates that if a request encounters a conflicting locks held by // another active transaction, it should raise an error instead of blocking. Error = 1; ``` `Block` is equivalent to the current behavior, so there's no further action needed for that variant. However, the `Error` policy is new, and this commit teaches the `lockTableWaiter` about it. It ensures that when a request with the `Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH PushRequest to determine whether the lock is abandoned or whether its holder is still active. If the holder is abandoned, the request cleans up the lock and proceeds. If the holder is still active, a WriteIntentError is returned to the client. This will unblock both of the referenced issues.

dt · 2020-08-05T15:21:20Z

Thanks for jumping on this @nvanbenschoten !

I've rebased this PR on yours and am happy to see it working exactly as expected. I'll rebase again after yours merges but otherwise I think this is all set for a final review from bulk-folks just looking at the second commit.

@nvanbenschoten just to clarify, my API suggestion: I certainly agree anything beyond the error-vs-block are all out of scope for now -- i was just thinking that if we add more details later, like a dial timeout or an alternative block-up-to-limit versus just block-or-error, then it might be nice to have a message type wrapper to add them to. But I'm also completely happy with this as-is -- it was just a suggestion.

Informs cockroachdb#40476. Informs cockroachdb#51624. This commit introduces the concept of request "wait policies". A WaitPolicy specifies the behavior of a request when it encounters conflicting locks held by other active transactions. The default behavior is to block until the conflicting lock is released, but other policies can make sense in special situations. Within this new formalization, the commit creates two initial wait policy variants: ``` // Block indicates that if a request encounters a conflicting locks held by // another active transaction, it should wait for the conflicting lock to be // released before proceeding. Block = 0; // Error indicates that if a request encounters a conflicting locks held by // another active transaction, it should raise an error instead of blocking. Error = 1; ``` `Block` is equivalent to the current behavior, so there's no further action needed for that variant. However, the `Error` policy is new, and this commit teaches the `lockTableWaiter` about it. It ensures that when a request with the `Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH PushRequest to determine whether the lock is abandoned or whether its holder is still active. If the holder is abandoned, the request cleans up the lock and proceeds. If the holder is still active, a WriteIntentError is returned to the client. This will unblock both of the referenced issues.

52388: kv: add WaitPolicy option to BatchRequest r=nvanbenschoten a=nvanbenschoten Informs #40476. Informs #51624. This commit introduces the concept of request "wait policies". A WaitPolicy specifies the behavior of a request when it encounters conflicting locks held by other active transactions. The default behavior is to block until the conflicting lock is released, but other policies can make sense in special situations. Within this new formalization, the commit creates two initial wait policy variants: ``` // Block indicates that if a request encounters a conflicting locks held by // another active transaction, it should wait for the conflicting lock to be // released before proceeding. Block = 0; // Error indicates that if a request encounters a conflicting locks held by // another active transaction, it should raise an error instead of blocking. Error = 1; ``` `Block` is equivalent to the current behavior, so there's no further action needed for that variant. However, the `Error` policy is new, and this commit teaches the `lockTableWaiter` about it. It ensures that when a request with the `Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH PushRequest to determine whether the lock is abandoned or whether its holder is still active. If the holder is abandoned, the request cleans up the lock and proceeds. If the holder is still active, a WriteIntentError is returned to the client. This will unblock both of the referenced issues. Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2020-08-07T21:08:00Z

I'll rebase again after yours merges but otherwise I think this is all set for a final review from bulk-folks just looking at the second commit.

Just merged, so you should be all set!

BACKUP, like any read, cannot export a range if it contains an unresolved intent. If we hit an intent during iteration, we typically return a WriteIntentError like any scan and let the store resolve it and retry the request before replying to the sender. However blocking intent resolution can mean a single txn left open can cause backups to hang, sometimes for hours or even days in some observed cases. While operators typically want bulk, background jobs like BACKUP to have as little impact on their foreground traffic as possible, even more importantly they want their BACKUPs to actually complete to maintain their RPO SLA. So while you typically would not want a BACKUP to immeidately run at high priority and abort any transactions from OLTP foreground traffic as soon as it starts, you do, eventually, want it to do whatever it needs to do to finish. Backdating BACKUPs with AS OF SYSTEM TIME '-30s' is a good best practice to reduce the number of still-running transactions they may encounter, but an abandoned or very long-running transaction can still hang such a backup. This change extends BACKUP to initially start asking individual ranges to backup, but bail out if they encounter an another transaction, rather than attempting to resolve it and holding up the whole BACKUP while doing so. If any range replies that it bailed out, it is put at the back of the queue to come back to later, and BACKUP continues to work on other ranges in the hopes that by the time it is done with them, the intent may have already resolved itself. When it comes back to a range that it has already attempted to backup, if the time since the backup's read-time is greater than a configurable limit, it instead sets the priority to high to mean any transaction encountered will be pushed so the backup can complete. Release note (enterprise change): BACKUP takes priority over other transactions if its initial attempts to export a range fail.

dt · 2020-08-09T21:51:24Z

Thanks @nvanbenschoten, working like a charm!

And I realized the bulk-reviewers already LGTM'ed so I guess this is now all good to go.

bors r+

craig · 2020-08-09T22:16:06Z

Build succeeded:

Compile Build (Cockroach)

Informs cockroachdb#40476. Informs cockroachdb#51624. This commit introduces the concept of request "wait policies". A WaitPolicy specifies the behavior of a request when it encounters conflicting locks held by other active transactions. The default behavior is to block until the conflicting lock is released, but other policies can make sense in special situations. Within this new formalization, the commit creates two initial wait policy variants: ``` // Block indicates that if a request encounters a conflicting locks held by // another active transaction, it should wait for the conflicting lock to be // released before proceeding. Block = 0; // Error indicates that if a request encounters a conflicting locks held by // another active transaction, it should raise an error instead of blocking. Error = 1; ``` `Block` is equivalent to the current behavior, so there's no further action needed for that variant. However, the `Error` policy is new, and this commit teaches the `lockTableWaiter` about it. It ensures that when a request with the `Error` wait policy encounters a conflicting lock, it uses a PUSH_TOUCH PushRequest to determine whether the lock is abandoned or whether its holder is still active. If the holder is abandoned, the request cleans up the lock and proceeds. If the holder is still active, a WriteIntentError is returned to the client. This will unblock both of the referenced issues.

dt requested review from pbardea, ajwerner and a team July 20, 2020 21:54

miretskiy approved these changes Jul 20, 2020

View reviewed changes

dt force-pushed the backup-hang branch from 120b9c4 to f3790b6 Compare July 20, 2020 22:42

dt commented Jul 20, 2020

View reviewed changes

ajwerner reviewed Jul 21, 2020

View reviewed changes

miretskiy reviewed Jul 21, 2020

View reviewed changes

nvanbenschoten reviewed Jul 21, 2020

View reviewed changes

pbardea approved these changes Jul 21, 2020

View reviewed changes

dt force-pushed the backup-hang branch 3 times, most recently from b576652 to dc427c4 Compare July 23, 2020 03:26

miretskiy requested review from ajwerner and pbardea July 23, 2020 12:19

miretskiy suggested changes Jul 23, 2020

View reviewed changes

miretskiy self-requested a review July 23, 2020 12:19

miretskiy approved these changes Jul 23, 2020

View reviewed changes

miretskiy self-requested a review July 23, 2020 12:19

dt force-pushed the backup-hang branch from dc427c4 to 3a85abb Compare July 27, 2020 21:42

dt requested a review from nvanbenschoten July 27, 2020 21:42

dt force-pushed the backup-hang branch 2 times, most recently from 32cd26f to 19ead9f Compare July 27, 2020 21:54

nvanbenschoten reviewed Jul 27, 2020

View reviewed changes

dt commented Jul 27, 2020

View reviewed changes

dt requested a review from nvanbenschoten July 30, 2020 03:52

dt requested a review from andreimatei July 30, 2020 03:52

andreimatei suggested changes Jul 30, 2020

View reviewed changes

nvanbenschoten mentioned this pull request Aug 5, 2020

kv: add WaitPolicy option to BatchRequest #52388

Merged

dt force-pushed the backup-hang branch from 19ead9f to 858d790 Compare August 5, 2020 14:44

dt force-pushed the backup-hang branch from 858d790 to 9075c49 Compare August 8, 2020 17:13

craig bot merged commit fdf54eb into cockroachdb:master Aug 9, 2020

dt deleted the backup-hang branch August 11, 2020 02:02

lnhsingh mentioned this pull request Aug 26, 2020

backupccl: avoid hanging forever on intents cockroachdb/docs#8117

Closed

mwang1026 mentioned this pull request Sep 24, 2020

backupccl: make backup more resilient to temporary unavailability #47213

Closed

mwang1026 mentioned this pull request Oct 26, 2020

Graceful Handling of Hanging Backups cockroachdb/docs#7930

Closed

dt mentioned this pull request Nov 17, 2020

backup: backups blocked by long-running transactions #51414

Closed

pbardea mentioned this pull request Mar 16, 2021

backup: avoid starvation by long-running queries #46016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: avoid hanging forever on intents #51624

backupccl: avoid hanging forever on intents #51624

dt commented Jul 20, 2020

cockroach-teamcity commented Jul 20, 2020

dt commented Jul 20, 2020

miretskiy left a comment

dt left a comment

ajwerner left a comment

miretskiy left a comment

nvanbenschoten left a comment

dt commented Jul 21, 2020

pbardea left a comment

dt commented Jul 23, 2020

miretskiy left a comment

miretskiy left a comment

nvanbenschoten left a comment

dt left a comment

andreimatei left a comment

nvanbenschoten commented Jul 30, 2020

dt commented Jul 30, 2020

dt commented Jul 30, 2020

nvanbenschoten commented Aug 5, 2020

dt commented Aug 5, 2020 •

edited

Loading

nvanbenschoten commented Aug 7, 2020

dt commented Aug 9, 2020

craig bot commented Aug 9, 2020

backupccl: avoid hanging forever on intents #51624

backupccl: avoid hanging forever on intents #51624

Conversation

dt commented Jul 20, 2020

cockroach-teamcity commented Jul 20, 2020

dt commented Jul 20, 2020

miretskiy left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

dt commented Jul 21, 2020

pbardea left a comment

Choose a reason for hiding this comment

dt commented Jul 23, 2020

miretskiy left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jul 30, 2020

dt commented Jul 30, 2020

dt commented Jul 30, 2020

nvanbenschoten commented Aug 5, 2020

dt commented Aug 5, 2020 • edited Loading

nvanbenschoten commented Aug 7, 2020

dt commented Aug 9, 2020

craig bot commented Aug 9, 2020

dt commented Aug 5, 2020 •

edited

Loading