perf: investigate `loadgen/kv` slowdown #15797

tbg · 2017-05-09T13:54:09Z

Something odd I've noticed is that this:

go run main.go --cycle-length 1 --concurrency 8
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
      1s        0          344.6          344.6    100.7    134.2    285.2
      2s        0          319.2          331.9    104.9    209.7    285.2
      3s        0          325.0          329.6     96.5    184.5    251.7
      4s        0          322.9          327.9    117.4    167.8    209.7

is much slower than this:

go run main.go --cycle-length 10 --concurrency 8
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
      1s        0         3165.0         3164.9      5.5      7.9     10.5
      2s        0         3194.6         3179.8      5.2      8.9     12.6
      3s        0         3093.7         3151.1      5.5      8.4     12.1
      4s        0         2807.6         3065.3      6.3     11.0     13.1

Unless I'm mistaken, the eight clients "never" intersect, so the only
difference is that in one example they're each hitting one own key, and in the
latter eight own keys. Perhaps there is more range parallelism in the latter,
but you wouldn't expect it. The difference disappears with --concurrency=1.

@petermattis for triage.

The text was updated successfully, but these errors were encountered:

tbg · 2017-05-09T14:30:13Z

@petermattis pointed out to me that the generators verbatim hash the sequence numbers, and don't use their own seed to make their hashes unique. So this is simply expected: one has no overlapping writes, the other only overlapping writes. Duh!

petermattis · 2017-05-09T14:40:12Z

Contention vs no contention explains some of the difference between --cycle-length=1 and --cycle-length=10. But there is something else that deserves investigation:

~ kv --cycle-length=1 --concurrency=1 --duration 30s
...
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          60413         2013.7      0.7      2.0     13.1

~ kv --cycle-length=1 --concurrency=2 --duration 30s
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          78992         2632.8      3.0      4.2     28.3

~ kv --cycle-length=1 --concurrency=4 --duration 30s
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          27628          920.9     12.6     46.1    260.0

~ kv --cycle-length=1 --concurrency=8 --duration 30s
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          14067          468.9     39.8    318.8   2550.1

I would expect latencies to increase with higher concurrency, but throughput to remain relatively steady. Instead, throughput drops through the floor.

petermattis · 2017-05-10T18:45:06Z

Not sure why, but with --cycle-length=1 --concurrency=2 (or anything higher than 1) I'm seeing write-intent errors. Lots of them. This is curious because I would have expected only 1PC transactions for this workload which do not leave intents.

petermattis · 2017-05-10T19:18:15Z

Looks like what is happening is that 2 concurrent writers to the same key for a 1PC transaction will eventually cause a WriteTooOldError due to the transactions being performed in a different order than the transaction timestamps. When the WriteTooOldError is encountered, the 1PC evaluation is aborted and we fallback to the normal code path which immediately encounters the WriteTooOldError again. In the normal code path this causes an intent to be laid down for the key and the transaction retried. At this point, the next transaction will encounter the intent causing it to be kicked out of the 1PC code path and performance is whack.

I think we could handle the WriteTooOldError in the 1PC code path and retry 1PC evaluation (after forwarding the transaction timestamp). Anyone see a reason this would be problematic?

tbg · 2017-05-10T19:54:58Z

It would bypass the timestamp cache - the "normal" txn path does this too, but doesn't commit, so if there's a bug there it maybe doesn't manifest. For example Begin, DelRange A-B, Put C, Commit where only the Put C runs into the WriteTooOld would be bumped up to a higher timestamp, allowing a later Put to write [A,B) in violation of the DelRange?

@tschottdorf

As @tschottdorf points out, this likely isn't safe. Seems like we'd want to retry the 1PC at a higher level so that we go through the timestamp cache again. Fixes cockroachdb#15797

bdarnell · 2017-05-10T23:27:24Z

Retrying at a higher timestamp in evaluateTxnWriteBatch is unsafe because it ignores the timestamp cache, but it would be safe to simply return the WriteTooOldError instead of retrying on the 2PC path. Then the retry would be another 1PC at the newer timestamp (issued by the gateway), which would probably be faster than the 2PC retry. The catch is that because no intents are laid down, you could keep getting pushed further and further into the future; you'd have no ability to ratchet the transaction priority or hold a place in line. Under low contention it's better to retry as 1PC; under high contention it could lead to unbounded backoffs and retries.

@tschottdorf In your example, where is the timestamp moved after the timestamp cache has been checked? If I'm reading this correctly the 2PC retry uses the original Transaction's timestamp, not one influenced by a possible WriteTooOldError.

bdarnell · 2017-05-10T23:31:13Z

The unbounded retry scenario could be mitigated by returning the WriteTooOldError if the transaction's epoch is less than N, and doing the 2PC retry for later epochs.

tbg · 2017-05-11T02:12:47Z

@bdarnell @petermattis' suggestion was:

I think we could handle the WriteTooOldError in the 1PC code path and retry 1PC evaluation (after forwarding the transaction timestamp). Anyone see a reason this would be problematic?

and that's also the code in https://github.com/cockroachdb/cockroach/pull/15863/files?w=1#diff-0f4e61b240de001f1b97a87fc31af043R4057.

That would mean that my example would do the following:

try at t=0, catch WriteTooOld(t=5), bump txn timestamp to t=6
try at t=6, ok (but no timestamp cache update)
other writer can write to [A,B) at t=1.

Ok, now that I wrote that I see that you're probably assuming that there's a problem with the current code - no, probably not, though we are laying down an intent at a higher timestamp in the 2PC code, no? At least if I remember correctly how WriteTooOld is handled:

write DeleteRange at t=0
catch WriteTooOld at C, and lay an intent on top of that, but flag the txn as WriteTooOld with the higher timestamp
retry on commit

bdarnell · 2017-05-11T16:52:29Z

Yes, @petermattis's suggestion was what I was referring to as "retrying in evalutateTxnWriteBatch", and it's unsafe (definitely in this example with DelRange, and probably without DelRange too).

My second paragraph (where I asked where the timestamp was moved) was because I thought your DelRange example was raising a concern about the current 2PC code. There's not a problem when a 1PC txn retries as 2PC because the request's txn timestamp isn't moved (as far as I can see).

tbg · 2017-05-11T18:50:20Z

Wouldn't the 2PC example go like this:

tscache: register t=0 for [A,B), C
lay down intents (t=0) for individual keys affected in [A,B) (say, none)
try C at t=0, encountering the existing write at t=5 lays down an intent at t=6, returns no error but a bumped txn timestamp, and flags txn as WriteTooOld
commit realizes WriteTooOld flag, forces a retry (with new timestamp t=6)
client.Txn retries, writes everything again at t=6, commits, so no anomaly here, but I would disagree with the statement that "the request's txn timestamp isn't moved". Let me know if I got this wrong (it's from memory).

The corresponding hook that catches it on commit is here (also copying the RetryOnPush flag because it might generate similar problems when trying to be clever about 1PC):

func isEndTransactionTriggeringRetryError(
	headerTxn, currentTxn *roachpb.Transaction,
) (bool, roachpb.TransactionRetryReason) {
	// If we saw any WriteTooOldErrors, we must restart to avoid lost
	// update anomalies.
	if headerTxn.WriteTooOld {
		return true, roachpb.RETRY_WRITE_TOO_OLD
	}

	isTxnPushed := currentTxn.Timestamp != headerTxn.OrigTimestamp

	// If pushing requires a retry and the transaction was pushed, retry.
	if headerTxn.RetryOnPush && isTxnPushed {
		return true, roachpb.RETRY_DELETE_RANGE
	}

bdarnell · 2017-05-14T17:15:33Z

I think we're in agreement here but we're just confusing each other because we're not being clear about when we're talking about the current code and when we're talking about hypothetical changes. #15797 (comment) describes the current behavior, and everything is fine. The retry in #15863 is unsafe because it advances the timestamp inside evaluateTxnWriteBatch. We have to return an error up the stack to leave the command queue and reenter at a new timestamp.

We currently handle WriteTooOldErrors by doing two things: falling back to 2PC mode and returning an error to the client.Txn to retry. The proposal here is to separate the two: retry at least once while staying in 1PC mode before falling back to the much more expensive 2PC.

petermattis · 2017-05-31T16:57:08Z

@tschottdorf I'm going to bump this issue over to you as you're more familiar with this area of the code than I am. #15863 clearly isn't the right approach.

See cockroachdb#15797 for context. The main problem here are `WriteTooOldError`s, which lay down an intent. If this happens frequently, performance suffers dramatically since there is almost always an intent on the key, forcing everyone into conflict resolution (which is sure to lay down another intent thanks to `WriteTooOld`). With this change we don't lay down intents for this workload, but the 1PC transactions could be starved. We could do even better (making the hack in cockroachdb#15863 correct): If we knew that the 1PC transaction wasn't previously used for any reads (or it's SNAPSHOT), we could in theory commit it directly again from the Store (saving the roundtrip to the gateway/client). If it's SERIALIZABLE and there were previous reads, we can't simulate a restart internally and the client must do it properly. It also stands to reason that on a restart, the client shouldn't take the existing timestamp but simply use `hlc.Now()` which is less stale. Neither of the last two paragraphs are explored in this PR. Here, we should discuss whether it's OK to expose 1PC txns to starvation in the first place. ``` go run $GOPATH/src/github.com/cockroachdb/loadgen/kv/main.go --cycle-length 1 --concurrency $c --duration 30s // this PR, c=10 _elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)_____seq(begin/end) 30.0s 0 14905 496.8 16.3 17.8 29.4 0 / 0 // this PR and master, c=1 _elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)_____seq(begin/end) 30.0s 0 22204 740.1 1.6 3.5 15.7 0 / 0 // master, c=10 _elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)_____seq(begin/end) 30.0s 0 3185 106.2 369.1 604.0 1811.9 0 / 0 ```

tbg · 2017-09-20T21:20:28Z

Since we didn't like #17121 and we also don't want to put in a lot of complexity for the 1PC path (which ultimately users only see when they have quite well-adjusted insert patterns), nothing has been done so far.

One thought is that users could explicitly opt out of starvation avoidance for such workloads. That is, the transaction doesn't lay down intents on WriteTooOld. Or the SQL layer decides on its own to take that approach, namely when it has the whole transaction and sees that there's only one insert (and in particular, the gateway can run the potential retries). The latter seems hit-miss and the former adds client surface, which we may also want to avoid.

Or perhaps there's value in exploring a general heuristic in which intents are laid down only after the n-th write. Say, for n=20, if my txn writes 2 keys, it would only start writing intents after having retried 10 times. If the transaction writes 400 keys, it'll write intents for writes 20-400 and any in restarts after that. But that leaves a question for the optimal value of n, which again depends on the workload, and it also matters how many ranges you're hitting (if it's a single one for the whole txn, probably better to use a larger n to avoid the problem in the initial post).

petermattis · 2017-09-20T22:17:24Z

@jordanlewis mentioned today that TPC-C exhibits contention on rows. Probably better to motivate improvements here based on that benchmark which is somewhat more realistic than a single key.

jordanlewis · 2017-09-20T22:28:20Z

Yes. There are two primary examples of that in TPC-C.

In NewOrder, the most frequent transaction of TPC-C, the "next order number" on a district (10 districts per warehouse, warehouses are the main entity of partitioning) is atomically bumped:

https://github.com/cockroachdb/loadgen/blob/2c013fdbd26b74e6d6a2317b9e7ce02323968354/tpcc/new_order.go#L134-L144

In Payment, the second-most frequent transaction, a Warehouse's YTD profit is atomically updated, along with the YTD profits of a district:

https://github.com/cockroachdb/loadgen/blob/2c013fdbd26b74e6d6a2317b9e7ce02323968354/tpcc/payment.go#L116-L136

Since we've been testing with just 1 Warehouse for the time being, the first one can probably be thought of as similar to kv --cycle-count=10, the second one is a hybrid of that and kv --cycle-count=1. The main difference is that the transactions are longer than the single contended statement.

jordanlewis · 2017-09-20T22:44:35Z

To amend that last statement, there are actually so many things different about TPCC and kv that it's probably not too useful to think about that comparison I drew, except for the fact that they both write to a contended row.

tbg · 2017-11-16T18:46:53Z

The TPCC examples don't look like they could ever pass for 1PC txns, but they would indeed be harmed by not laying down intents there. I'm inclined to say there isn't much we're going to be able to do here in the foreseeable future. A real client using TPCC would likely be advised to run an individual txn to bump the ID though (risking creation of IDs that are never used, which seems OK), which makes this a 1PC txn again which would then profit from better handling of 1PC txns that run into a WriteTooOld.

tbg · 2018-03-03T05:28:51Z

@spencerkimball fixed this.

tbg added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label May 9, 2017

tbg assigned petermattis May 9, 2017

tbg closed this as completed May 9, 2017

petermattis reopened this May 9, 2017

petermattis added this to the 1.1 milestone May 9, 2017

petermattis mentioned this issue May 10, 2017

[WIP] storage: retry 1PC transactions on WriteTooOld errors #15863

Closed

petermattis assigned tbg and unassigned petermattis May 31, 2017

tbg mentioned this issue Jul 19, 2017

[DNM]: experiment for contending 1PC txns #17121

Closed

petermattis mentioned this issue Aug 11, 2017

perf: Update/upsert performanace regression 1.0.4 -> master #17557

Closed

petermattis modified the milestones: 1.2, 1.1 Aug 17, 2017

tbg mentioned this issue Feb 2, 2018

storage: minimize retries with 1PC commits #22315

Merged

tbg closed this as completed Mar 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: investigate `loadgen/kv` slowdown #15797

perf: investigate `loadgen/kv` slowdown #15797

tbg commented May 9, 2017

tbg commented May 9, 2017

petermattis commented May 9, 2017

petermattis commented May 10, 2017

petermattis commented May 10, 2017 •

edited by tamird

Loading

tbg commented May 10, 2017

bdarnell commented May 10, 2017 •

edited

Loading

bdarnell commented May 10, 2017

tbg commented May 11, 2017

bdarnell commented May 11, 2017

tbg commented May 11, 2017

bdarnell commented May 14, 2017

petermattis commented May 31, 2017

tbg commented Sep 20, 2017

petermattis commented Sep 20, 2017

jordanlewis commented Sep 20, 2017

jordanlewis commented Sep 20, 2017

tbg commented Nov 16, 2017

tbg commented Mar 3, 2018

perf: investigate loadgen/kv slowdown #15797

perf: investigate loadgen/kv slowdown #15797

Comments

tbg commented May 9, 2017

tbg commented May 9, 2017

petermattis commented May 9, 2017

petermattis commented May 10, 2017

petermattis commented May 10, 2017 • edited by tamird Loading

tbg commented May 10, 2017

bdarnell commented May 10, 2017 • edited Loading

bdarnell commented May 10, 2017

tbg commented May 11, 2017

bdarnell commented May 11, 2017

tbg commented May 11, 2017

bdarnell commented May 14, 2017

petermattis commented May 31, 2017

tbg commented Sep 20, 2017

petermattis commented Sep 20, 2017

jordanlewis commented Sep 20, 2017

jordanlewis commented Sep 20, 2017

tbg commented Nov 16, 2017

tbg commented Mar 3, 2018

perf: investigate `loadgen/kv` slowdown #15797

perf: investigate `loadgen/kv` slowdown #15797

petermattis commented May 10, 2017 •

edited by tamird

Loading

bdarnell commented May 10, 2017 •

edited

Loading