Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: investigate mutex contention with TPCC #1997

Closed
sumeerbhola opened this issue Oct 5, 2022 · 2 comments
Closed

cache: investigate mutex contention with TPCC #1997

sumeerbhola opened this issue Oct 5, 2022 · 2 comments

Comments

@sumeerbhola
Copy link
Collaborator

@ajwerner reported high mutex contention on the block cache when running TPCC with 16 core machines (internal link: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1664480089719489), specifically with nodes under high load (running tpccbench). And also pointed to a discussion regarding poor RWMutex scaling with high core count golang/go#17973.

The diff from the 2 profiles (that are 60s apart) in that internal link is shown below and shows a total contention of 122s. 34s is in the sstable cache.
Screen Shot 2022-10-05 at 5 28 00 PM

There doesn't seem to be an easy fix here but we should reproduce this experiment and measure (a) mutex contention, (b) mutex related overhead in a cpu profile, (c) the number of concurrent MVCCGet calls. If (b) is "small" and (c) is "large", the contention walltime may be acceptable, since it would not impact the peak throughput we can achieve on that node.

@andrewbaptist
Copy link

In a recent customer investigation (https://github.com/cockroachlabs/support/issues/2066) I was looking at what the runnable goroutines on the system were doing to try and determine why requests were taking longer than expected. There are a large number (445) of runnable goroutines on the system. Looking more into what they were doing, about half of them were waiting in the pebble layer. Note that this does not include goroutines that are in the semacquire state which would be what I naively expected.

Instead, these were busy attempting to get a lock, not waiting for the OS signal.
There are a few things that they are waiting for, but breaking them down it looks like they are almost all waiting on the pebble shard lock. I see that it is set to 2xgoprocs, but maybe that is not enough?

goroutine_dump.2023-02-12T06_50_56.716.double_since_last_dump.000002714.txt

Looking at what all the runnable threads are doing:

% grep -h -A1 "^goroutine.*runnable" goroutine_dump.2023-02-12T06_50_56.716.double_since_last_dump.000002714.txt | grep -v goroutine | grep -v "^--" | grep -o '^.*(' | sort | uniq -c | sort -n  | tail -10
   3 google.golang.org/grpc/internal/transport.(*recvBufferReader).read(
   3 sync.runtime_notifyListWait(
   5 github.com/cockroachdb/cockroach/pkg/kv/kvserver/spanlatch.(*Manager).waitForSignal(
   5 google.golang.org/grpc/internal/transport.(*controlBuffer).get(
   5 syscall.Syscall6(
   7 sync.runtime_Semacquire(
  15 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2(
  21 internal/poll.runtime_pollWait(
 152 google.golang.org/grpc.(*Server).serveStreams.func1.2(
 211 sync.runtime_SemacquireMutex(

Looking at just where the runnable mutexes came from:

% grep -h -A10 "^goroutine.*runnable" goroutine_dump.2023-02-12T06_50_56.716.double_since_last_dump.000002714.txt | grep -v goroutine | grep -v "^--" | grep -A1 "GOROOT/src/sync/rwmutex.go" | grep -o "^github.*("  | sort | uniq -c | sort -n
   1 github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).GetLiveness(
   1 github.com/cockroachdb/pebble/internal/cache.(*shard).Delete(
   3 github.com/cockroachdb/pebble/internal/cache.(*shard).Set(
 207 github.com/cockroachdb/pebble/internal/cache.(*shard).Get(

jbowens added a commit to jbowens/pebble that referenced this issue Jul 13, 2023
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (cockroachdb#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(cockroachdb#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 13, 2023
When opening a sstable for the first time, we read two 'meta' blocks: one
describes the layout of sstable, encoding block handles for the index block,
properties block, etc. The other contains table properties. These blocks are
decoded, and the necessary state is copied onto the heap, and then blocks are
released. In typical configurations with sufficiently large table caches, these
blocks are never read again or only much later after the table has been evicted
from the table cache.

This commit uses a BufferPool to hold these temporary blocks, and refrains from
adding these blocks to the block cache. This reduces contention on the block
cache mutexes (cockroachdb#1997).
jbowens added a commit to jbowens/pebble that referenced this issue Jul 14, 2023
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (cockroachdb#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(cockroachdb#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 14, 2023
When opening a sstable for the first time, we read two 'meta' blocks: one
describes the layout of sstable, encoding block handles for the index block,
properties block, etc. The other contains table properties. These blocks are
decoded, and the necessary state is copied onto the heap, and then blocks are
released. In typical configurations with sufficiently large table caches, these
blocks are never read again or only much later after the table has been evicted
from the table cache.

This commit uses a BufferPool to hold these temporary blocks, and refrains from
adding these blocks to the block cache. This reduces contention on the block
cache mutexes (cockroachdb#1997).
jbowens added a commit that referenced this issue Jul 16, 2023
During compactions, avoid populating the block cache with input files' blocks.
These files will soon be removed from the LSM, making it less likely any
iterator will need to read these blocks. While Pebble uses a scan-resistant
block cache algorithm (ClockPRO), the act of inserting the blocks into the
cache increases contention on the block cache mutexes (#1997). This contention
has been observed to significantly contribute to tail latencies, both for reads
and for writes during memtable reservation. Additionally, although these blocks
may be soon replaced with more useful blocks due to ClockPRO's scan resistance,
they may be freed by a different thread inducing excessive TLB shootdowns
(#2693).

A compaction only requires a relatively small working set of buffers during its
scan across input sstables. In this commit, we introduce a per-compaction
BufferPool that is used to allocate buffers during cache misses. Buffers are
reused throughout the compaction and only freed to the memory allocator when
they're too small or the compaction is finished. This reduces pressure on the
memory allocator and the block cache.
jbowens added a commit that referenced this issue Jul 16, 2023
When opening a sstable for the first time, we read two 'meta' blocks: one
describes the layout of sstable, encoding block handles for the index block,
properties block, etc. The other contains table properties. These blocks are
decoded, and the necessary state is copied onto the heap, and then blocks are
released. In typical configurations with sufficiently large table caches, these
blocks are never read again or only much later after the table has been evicted
from the table cache.

This commit uses a BufferPool to hold these temporary blocks, and refrains from
adding these blocks to the block cache. This reduces contention on the block
cache mutexes (#1997).
jbowens added a commit to jbowens/pebble that referenced this issue Jul 26, 2023
We've observed significant block cache mutex contention originating from
EvictFile. When evicting a file with a significant count of blocks currently
held within the block cache, EvictFile may hold the block cache shard mutex for
a long duration, increasing latency for requests that must access the same
shard. This commit periodically drops the mutex to provide these other requests
with an opportunity to make progress.

Informs cockroachdb#1997.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 26, 2023
We've observed significant block cache mutex contention originating from
EvictFile. When evicting a file with a significant count of blocks currently
held within the block cache, EvictFile may hold the block cache shard mutex for
a long duration, increasing latency for requests that must access the same
shard. This commit periodically drops the mutex to provide these other requests
with an opportunity to make progress.

Informs cockroachdb#1997.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 26, 2023
We've observed significant block cache mutex contention originating from
EvictFile. When evicting a file with a significant count of blocks currently
held within the block cache, EvictFile may hold the block cache shard mutex for
a long duration, increasing latency for requests that must access the same
shard.

This commit mitigates this contention with two approaches. EvictFile now
periodically drops the mutex to provide these other requests with an
opportunity to make progress. Additionally, rather than free()-ing manually
managed memory back to the memory allocator while holding the mutex, the free
is deferred until the shard mutex has been dropped.

Informs cockroachdb#1997.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 26, 2023
We've observed block cache mutex contention originating from EvictFile. When
evicting a file with a significant count of blocks currently held within the
block cache, EvictFile may hold the block cache shard mutex for an extended
duration, increasing latency for requests that must access the same shard.

This commit mitigates this contention with two approaches. EvictFile now
periodically drops the mutex to provide these other requests with an
opportunity to make progress. Additionally, rather than free()-ing manually
managed memory back to the memory allocator while holding the mutex, the free
is deferred until the shard mutex has been dropped.

Informs cockroachdb#1997.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 27, 2023
We've observed large allocations like the 64MB memtable allocation take 10ms+.
This can contribute to batch commit tail latencies by slowing down WAL/memtable
rotation during which the entire commit pipeline is stalled. This commit adapts
the memtable lifecycle to keep the most recent obsolete memtable around for use
as the next memtable.

This reduces the commit latency hiccup during a memtable rotation, and it also
reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we
must reserve memory from the block cache.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
                   │   old.txt   │               new.txt               │
                   │   sec/op    │   sec/op     vs base                │
RotateMemtables-24   120.7µ ± 2%   102.8µ ± 4%  -14.85% (p=0.000 n=25)

                   │   old.txt    │               new.txt               │
                   │     B/op     │     B/op      vs base               │
RotateMemtables-24   124.3Ki ± 0%   124.0Ki ± 0%  -0.27% (p=0.000 n=25)

                   │  old.txt   │              new.txt              │
                   │ allocs/op  │ allocs/op   vs base               │
RotateMemtables-24   114.0 ± 0%   111.0 ± 0%  -2.63% (p=0.000 n=25)
```

Informs cockroachdb#2646.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 27, 2023
We've observed block cache mutex contention originating from EvictFile. When
evicting a file with a significant count of blocks currently held within the
block cache, EvictFile may hold the block cache shard mutex for an extended
duration, increasing latency for requests that must access the same shard.

This commit mitigates this contention with two approaches. EvictFile now
periodically drops the mutex to provide these other requests with an
opportunity to make progress. Additionally, rather than free()-ing manually
managed memory back to the memory allocator while holding the mutex, the free
is deferred until the shard mutex has been dropped.

Informs cockroachdb#1997.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 27, 2023
We've observed block cache mutex contention originating from EvictFile. When
evicting a file with a significant count of blocks currently held within the
block cache, EvictFile may hold the block cache shard mutex for an extended
duration, increasing latency for requests that must access the same shard.

This commit mitigates this contention with two approaches. EvictFile now
periodically drops the mutex to provide these other requests with an
opportunity to make progress. Additionally, rather than free()-ing manually
managed memory back to the memory allocator while holding the mutex, the free
is deferred until the shard mutex has been dropped.

Informs cockroachdb#1997.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 28, 2023
We've observed large allocations like the 64MB memtable allocation take 10ms+.
This can contribute to batch commit tail latencies by slowing down WAL/memtable
rotation during which the entire commit pipeline is stalled. This commit adapts
the memtable lifecycle to keep the most recent obsolete memtable around for use
as the next memtable.

This reduces the commit latency hiccup during a memtable rotation, and it also
reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we
must reserve memory from the block cache.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
                   │   old.txt   │               new.txt               │
                   │   sec/op    │   sec/op     vs base                │
RotateMemtables-24   120.7µ ± 2%   102.8µ ± 4%  -14.85% (p=0.000 n=25)

                   │   old.txt    │               new.txt               │
                   │     B/op     │     B/op      vs base               │
RotateMemtables-24   124.3Ki ± 0%   124.0Ki ± 0%  -0.27% (p=0.000 n=25)

                   │  old.txt   │              new.txt              │
                   │ allocs/op  │ allocs/op   vs base               │
RotateMemtables-24   114.0 ± 0%   111.0 ± 0%  -2.63% (p=0.000 n=25)
```

Informs cockroachdb#2646.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 28, 2023
We've observed large allocations like the 64MB memtable allocation take 10ms+.
This can add latency to the WAL/memtable rotation critical section during which
the entire commit pipeline is stalled, contributing to batch commit tail
latencies. This commit adapts the memtable lifecycle to keep the most recent
obsolete memtable around for use as the next mutable memtable.

This reduces the commit latency hiccup during a memtable rotation, and it also
reduces block cache mutex contention (cockroachdb#1997) by reducing the number of times we
must reserve memory from the block cache.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
                   │   old.txt   │               new.txt               │
                   │   sec/op    │   sec/op     vs base                │
RotateMemtables-24   120.7µ ± 2%   102.8µ ± 4%  -14.85% (p=0.000 n=25)

                   │   old.txt    │               new.txt               │
                   │     B/op     │     B/op      vs base               │
RotateMemtables-24   124.3Ki ± 0%   124.0Ki ± 0%  -0.27% (p=0.000 n=25)

                   │  old.txt   │              new.txt              │
                   │ allocs/op  │ allocs/op   vs base               │
RotateMemtables-24   114.0 ± 0%   111.0 ± 0%  -2.63% (p=0.000 n=25)
```

Informs cockroachdb#2646.
jbowens added a commit that referenced this issue Jul 28, 2023
We've observed large allocations like the 64MB memtable allocation take 10ms+.
This can add latency to the WAL/memtable rotation critical section during which
the entire commit pipeline is stalled, contributing to batch commit tail
latencies. This commit adapts the memtable lifecycle to keep the most recent
obsolete memtable around for use as the next mutable memtable.

This reduces the commit latency hiccup during a memtable rotation, and it also
reduces block cache mutex contention (#1997) by reducing the number of times we
must reserve memory from the block cache.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
                   │   old.txt   │               new.txt               │
                   │   sec/op    │   sec/op     vs base                │
RotateMemtables-24   120.7µ ± 2%   102.8µ ± 4%  -14.85% (p=0.000 n=25)

                   │   old.txt    │               new.txt               │
                   │     B/op     │     B/op      vs base               │
RotateMemtables-24   124.3Ki ± 0%   124.0Ki ± 0%  -0.27% (p=0.000 n=25)

                   │  old.txt   │              new.txt              │
                   │ allocs/op  │ allocs/op   vs base               │
RotateMemtables-24   114.0 ± 0%   111.0 ± 0%  -2.63% (p=0.000 n=25)
```

Informs #2646.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 28, 2023
We've observed block cache mutex contention originating from EvictFile. When
evicting a file with a significant count of blocks currently held within the
block cache, EvictFile may hold the block cache shard mutex for an extended
duration, increasing latency for requests that must access the same shard.

This commit mitigates this contention with two approaches. EvictFile now
periodically drops the mutex to provide these other requests with an
opportunity to make progress. Additionally, rather than free()-ing manually
managed memory back to the memory allocator while holding the mutex, the free
is deferred until the shard mutex has been dropped.

Informs cockroachdb#1997.
jbowens added a commit that referenced this issue Jul 28, 2023
We've observed block cache mutex contention originating from EvictFile. When
evicting a file with a significant count of blocks currently held within the
block cache, EvictFile may hold the block cache shard mutex for an extended
duration, increasing latency for requests that must access the same shard.

This commit mitigates this contention with two approaches. EvictFile now
periodically drops the mutex to provide these other requests with an
opportunity to make progress. Additionally, rather than free()-ing manually
managed memory back to the memory allocator while holding the mutex, the free
is deferred until the shard mutex has been dropped.

Informs #1997.
@jbowens
Copy link
Collaborator

jbowens commented Jul 28, 2023

We've merged several different patches to mitigate block cache mutex contention. I'll close this out, and we can reopen if we uncover it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants