Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: investigate loadgen/kv slowdown #15797

Closed
tbg opened this issue May 9, 2017 · 18 comments
Closed

perf: investigate loadgen/kv slowdown #15797

tbg opened this issue May 9, 2017 · 18 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@tbg
Copy link
Member

tbg commented May 9, 2017

From cockroachdb/loadgen#50:

Something odd I've noticed is that this:

go run main.go --cycle-length 1 --concurrency 8
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
      1s        0          344.6          344.6    100.7    134.2    285.2
      2s        0          319.2          331.9    104.9    209.7    285.2
      3s        0          325.0          329.6     96.5    184.5    251.7
      4s        0          322.9          327.9    117.4    167.8    209.7

is much slower than this:

go run main.go --cycle-length 10 --concurrency 8
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
      1s        0         3165.0         3164.9      5.5      7.9     10.5
      2s        0         3194.6         3179.8      5.2      8.9     12.6
      3s        0         3093.7         3151.1      5.5      8.4     12.1
      4s        0         2807.6         3065.3      6.3     11.0     13.1

Unless I'm mistaken, the eight clients "never" intersect, so the only
difference is that in one example they're each hitting one own key, and in the
latter eight own keys. Perhaps there is more range parallelism in the latter,
but you wouldn't expect it. The difference disappears with --concurrency=1.

@petermattis for triage.

@tbg tbg added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label May 9, 2017
@tbg
Copy link
Member Author

tbg commented May 9, 2017

@petermattis pointed out to me that the generators verbatim hash the sequence numbers, and don't use their own seed to make their hashes unique. So this is simply expected: one has no overlapping writes, the other only overlapping writes. Duh!

@tbg tbg closed this as completed May 9, 2017
@petermattis
Copy link
Collaborator

Contention vs no contention explains some of the difference between --cycle-length=1 and --cycle-length=10. But there is something else that deserves investigation:

~ kv --cycle-length=1 --concurrency=1 --duration 30s
...
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          60413         2013.7      0.7      2.0     13.1

~ kv --cycle-length=1 --concurrency=2 --duration 30s
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          78992         2632.8      3.0      4.2     28.3

~ kv --cycle-length=1 --concurrency=4 --duration 30s
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          27628          920.9     12.6     46.1    260.0

~ kv --cycle-length=1 --concurrency=8 --duration 30s
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)
   30.0s        0          14067          468.9     39.8    318.8   2550.1

I would expect latencies to increase with higher concurrency, but throughput to remain relatively steady. Instead, throughput drops through the floor.

@petermattis petermattis reopened this May 9, 2017
@petermattis petermattis added this to the 1.1 milestone May 9, 2017
@petermattis
Copy link
Collaborator

Not sure why, but with --cycle-length=1 --concurrency=2 (or anything higher than 1) I'm seeing write-intent errors. Lots of them. This is curious because I would have expected only 1PC transactions for this workload which do not leave intents.

@petermattis
Copy link
Collaborator

petermattis commented May 10, 2017

Looks like what is happening is that 2 concurrent writers to the same key for a 1PC transaction will eventually cause a WriteTooOldError due to the transactions being performed in a different order than the transaction timestamps. When the WriteTooOldError is encountered, the 1PC evaluation is aborted and we fallback to the normal code path which immediately encounters the WriteTooOldError again. In the normal code path this causes an intent to be laid down for the key and the transaction retried. At this point, the next transaction will encounter the intent causing it to be kicked out of the 1PC code path and performance is whack.

I think we could handle the WriteTooOldError in the 1PC code path and retry 1PC evaluation (after forwarding the transaction timestamp). Anyone see a reason this would be problematic?

@tbg
Copy link
Member Author

tbg commented May 10, 2017

It would bypass the timestamp cache - the "normal" txn path does this too, but doesn't commit, so if there's a bug there it maybe doesn't manifest. For example Begin, DelRange A-B, Put C, Commit where only the Put C runs into the WriteTooOld would be bumped up to a higher timestamp, allowing a later Put to write [A,B) in violation of the DelRange?

petermattis added a commit to petermattis/cockroach that referenced this issue May 10, 2017
As @tschottdorf points out, this likely isn't safe. Seems like we'd want
to retry the 1PC at a higher level so that we go through the timestamp
cache again.

Fixes cockroachdb#15797
@bdarnell
Copy link
Contributor

bdarnell commented May 10, 2017

Retrying at a higher timestamp in evaluateTxnWriteBatch is unsafe because it ignores the timestamp cache, but it would be safe to simply return the WriteTooOldError instead of retrying on the 2PC path. Then the retry would be another 1PC at the newer timestamp (issued by the gateway), which would probably be faster than the 2PC retry. The catch is that because no intents are laid down, you could keep getting pushed further and further into the future; you'd have no ability to ratchet the transaction priority or hold a place in line. Under low contention it's better to retry as 1PC; under high contention it could lead to unbounded backoffs and retries.

@tschottdorf In your example, where is the timestamp moved after the timestamp cache has been checked? If I'm reading this correctly the 2PC retry uses the original Transaction's timestamp, not one influenced by a possible WriteTooOldError.

@bdarnell
Copy link
Contributor

The unbounded retry scenario could be mitigated by returning the WriteTooOldError if the transaction's epoch is less than N, and doing the 2PC retry for later epochs.

@tbg
Copy link
Member Author

tbg commented May 11, 2017

@bdarnell @petermattis' suggestion was:

I think we could handle the WriteTooOldError in the 1PC code path and retry 1PC evaluation (after forwarding the transaction timestamp). Anyone see a reason this would be problematic?

and that's also the code in https://github.com/cockroachdb/cockroach/pull/15863/files?w=1#diff-0f4e61b240de001f1b97a87fc31af043R4057.

That would mean that my example would do the following:

  • try at t=0, catch WriteTooOld(t=5), bump txn timestamp to t=6
  • try at t=6, ok (but no timestamp cache update)
  • other writer can write to [A,B) at t=1.

Ok, now that I wrote that I see that you're probably assuming that there's a problem with the current code - no, probably not, though we are laying down an intent at a higher timestamp in the 2PC code, no? At least if I remember correctly how WriteTooOld is handled:

  • write DeleteRange at t=0
  • catch WriteTooOld at C, and lay an intent on top of that, but flag the txn as WriteTooOld with the higher timestamp
  • retry on commit

@bdarnell
Copy link
Contributor

Yes, @petermattis's suggestion was what I was referring to as "retrying in evalutateTxnWriteBatch", and it's unsafe (definitely in this example with DelRange, and probably without DelRange too).

My second paragraph (where I asked where the timestamp was moved) was because I thought your DelRange example was raising a concern about the current 2PC code. There's not a problem when a 1PC txn retries as 2PC because the request's txn timestamp isn't moved (as far as I can see).

@tbg
Copy link
Member Author

tbg commented May 11, 2017

Wouldn't the 2PC example go like this:

  • tscache: register t=0 for [A,B), C
  • lay down intents (t=0) for individual keys affected in [A,B) (say, none)
  • try C at t=0, encountering the existing write at t=5 lays down an intent at t=6, returns no error but a bumped txn timestamp, and flags txn as WriteTooOld
  • commit realizes WriteTooOld flag, forces a retry (with new timestamp t=6)
  • client.Txn retries, writes everything again at t=6, commits, so no anomaly here, but I would disagree with the statement that "the request's txn timestamp isn't moved". Let me know if I got this wrong (it's from memory).

The corresponding hook that catches it on commit is here (also copying the RetryOnPush flag because it might generate similar problems when trying to be clever about 1PC):

func isEndTransactionTriggeringRetryError(
	headerTxn, currentTxn *roachpb.Transaction,
) (bool, roachpb.TransactionRetryReason) {
	// If we saw any WriteTooOldErrors, we must restart to avoid lost
	// update anomalies.
	if headerTxn.WriteTooOld {
		return true, roachpb.RETRY_WRITE_TOO_OLD
	}

	isTxnPushed := currentTxn.Timestamp != headerTxn.OrigTimestamp

	// If pushing requires a retry and the transaction was pushed, retry.
	if headerTxn.RetryOnPush && isTxnPushed {
		return true, roachpb.RETRY_DELETE_RANGE
	}

@bdarnell
Copy link
Contributor

I think we're in agreement here but we're just confusing each other because we're not being clear about when we're talking about the current code and when we're talking about hypothetical changes. #15797 (comment) describes the current behavior, and everything is fine. The retry in #15863 is unsafe because it advances the timestamp inside evaluateTxnWriteBatch. We have to return an error up the stack to leave the command queue and reenter at a new timestamp.

We currently handle WriteTooOldErrors by doing two things: falling back to 2PC mode and returning an error to the client.Txn to retry. The proposal here is to separate the two: retry at least once while staying in 1PC mode before falling back to the much more expensive 2PC.

@petermattis
Copy link
Collaborator

@tschottdorf I'm going to bump this issue over to you as you're more familiar with this area of the code than I am. #15863 clearly isn't the right approach.

@petermattis petermattis assigned tbg and unassigned petermattis May 31, 2017
tbg added a commit to tbg/cockroach that referenced this issue Jul 19, 2017
See cockroachdb#15797 for context.

The main problem here are `WriteTooOldError`s, which lay down an intent. If
this happens frequently, performance suffers dramatically since there is almost
always an intent on the key, forcing everyone into conflict resolution (which
is sure to lay down another intent thanks to `WriteTooOld`).

With this change we don't lay down intents for this workload, but the 1PC
transactions could be starved.

We could do even better (making the hack in cockroachdb#15863 correct): If we knew that
the 1PC transaction wasn't previously used for any reads (or it's SNAPSHOT), we
could in theory commit it directly again from the Store (saving the roundtrip
to the gateway/client). If it's SERIALIZABLE and there were previous reads, we
can't simulate a restart internally and the client must do it properly.

It also stands to reason that on a restart, the client shouldn't take the
existing timestamp but simply use `hlc.Now()` which is less stale.

Neither of the last two paragraphs are explored in this PR. Here, we should
discuss whether it's OK to expose 1PC txns to starvation in the first place.

```
go run $GOPATH/src/github.com/cockroachdb/loadgen/kv/main.go --cycle-length 1 --concurrency $c --duration 30s
// this PR, c=10
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)_____seq(begin/end)
   30.0s        0          14905          496.8     16.3     17.8     29.4              0 / 0
// this PR and master, c=1
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)_____seq(begin/end)
   30.0s        0          22204          740.1      1.6      3.5     15.7              0 / 0
// master, c=10
_elapsed___errors____________ops___ops/sec(cum)__p95(ms)__p99(ms)_pMax(ms)_____seq(begin/end)
   30.0s        0           3185          106.2    369.1    604.0   1811.9              0 / 0
```
@petermattis petermattis modified the milestones: 1.2, 1.1 Aug 17, 2017
@tbg
Copy link
Member Author

tbg commented Sep 20, 2017

Since we didn't like #17121 and we also don't want to put in a lot of complexity for the 1PC path (which ultimately users only see when they have quite well-adjusted insert patterns), nothing has been done so far.

One thought is that users could explicitly opt out of starvation avoidance for such workloads. That is, the transaction doesn't lay down intents on WriteTooOld. Or the SQL layer decides on its own to take that approach, namely when it has the whole transaction and sees that there's only one insert (and in particular, the gateway can run the potential retries). The latter seems hit-miss and the former adds client surface, which we may also want to avoid.

Or perhaps there's value in exploring a general heuristic in which intents are laid down only after the n-th write. Say, for n=20, if my txn writes 2 keys, it would only start writing intents after having retried 10 times. If the transaction writes 400 keys, it'll write intents for writes 20-400 and any in restarts after that. But that leaves a question for the optimal value of n, which again depends on the workload, and it also matters how many ranges you're hitting (if it's a single one for the whole txn, probably better to use a larger n to avoid the problem in the initial post).

@petermattis
Copy link
Collaborator

@jordanlewis mentioned today that TPC-C exhibits contention on rows. Probably better to motivate improvements here based on that benchmark which is somewhat more realistic than a single key.

@jordanlewis
Copy link
Member

Yes. There are two primary examples of that in TPC-C.

  1. In NewOrder, the most frequent transaction of TPC-C, the "next order number" on a district (10 districts per warehouse, warehouses are the main entity of partitioning) is atomically bumped:

https://github.com/cockroachdb/loadgen/blob/2c013fdbd26b74e6d6a2317b9e7ce02323968354/tpcc/new_order.go#L134-L144

  1. In Payment, the second-most frequent transaction, a Warehouse's YTD profit is atomically updated, along with the YTD profits of a district:

https://github.com/cockroachdb/loadgen/blob/2c013fdbd26b74e6d6a2317b9e7ce02323968354/tpcc/payment.go#L116-L136

Since we've been testing with just 1 Warehouse for the time being, the first one can probably be thought of as similar to kv --cycle-count=10, the second one is a hybrid of that and kv --cycle-count=1. The main difference is that the transactions are longer than the single contended statement.

@jordanlewis
Copy link
Member

To amend that last statement, there are actually so many things different about TPCC and kv that it's probably not too useful to think about that comparison I drew, except for the fact that they both write to a contended row.

@tbg
Copy link
Member Author

tbg commented Nov 16, 2017

The TPCC examples don't look like they could ever pass for 1PC txns, but they would indeed be harmed by not laying down intents there. I'm inclined to say there isn't much we're going to be able to do here in the foreseeable future. A real client using TPCC would likely be advised to run an individual txn to bump the ID though (risking creation of IDs that are never used, which seems OK), which makes this a 1PC txn again which would then profit from better handling of 1PC txns that run into a WriteTooOld.

@tbg
Copy link
Member Author

tbg commented Mar 3, 2018

@spencerkimball fixed this.

@tbg tbg closed this as completed Mar 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

4 participants