Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: disk bandwidth as a bottleneck resource #82898

Closed
sumeerbhola opened this issue Jun 14, 2022 · 1 comment · Fixed by #86063
Closed

admission: disk bandwidth as a bottleneck resource #82898

sumeerbhola opened this issue Jun 14, 2022 · 1 comment · Fixed by #86063
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jun 14, 2022

Admission control is unaware of disk bandwidth as a constrained resource.
This leads to issues like https://github.com/cockroachlabs/support/issues/1628

First-class support for the disk bandwidth resource would address this.
Draft PR #82813

Jira issue: CRDB-16717

Epic CRDB-14607

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Jun 14, 2022
@blathers-crl blathers-crl bot added T-storage Storage Team T-kv KV Team labels Jun 14, 2022
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv KV Team T-storage Storage Team labels Jun 14, 2022
@mwang1026 mwang1026 added T-storage Storage Team and removed T-kv KV Team labels Jun 27, 2022
@nicktrav
Copy link
Collaborator

Linking this to #83490, which seems more general, but would most likely make use of whatever we work out where for the disk bandwidth signal.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 7, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 8, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 8, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 10, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It composes the previous two objects and
  uses load information to limit write tokens for elastic writes
  and limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 10, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 11, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 11, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand additive increase and multiplicative
  decrease, (unlike what we do for flush and compaction tokens, where we
  try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 11, 2022
We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand small multiplicative increase and
  large multiplicative decrease, (unlike what we do for flush and compaction
  tokens, where we try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble/issues/1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in cockroachdb#82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
  to limit write tokens for elastic writes and in the future will also
  limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens. A different WorkQueue is needed to
  prevent a situation where elastic work for one tenant is queued ahead
  of regualar work from another tenant, and stops the latter from making
  progress due to lack of elastic tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
cockroachdb#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs cockroachdb#82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).
craig bot pushed a commit that referenced this issue Aug 12, 2022
85722: admission: add support for disk bandwidth as a bottleneck resource r=tbg,irfansharif a=sumeerbhola

We assume that:
- There is a provisioned known limit on the sum of read and write
  bandwidth. This limit is allowed to change.
- Admission control can only shape the rate of admission of writes. Writes
  also cause reads, since compactions do reads and writes.

There are multiple challenges:
- We are unable to precisely track the causes of disk read bandwidth, since
  we do not have observability into what reads missed the OS page cache.
  That is, we don't know how much of the reads were due to incoming reads
  (that we don't shape) and how much due to compaction read bandwidth.
- We don't shape incoming reads.
- There can be a large time lag between the shaping of incoming writes, and when
  it affects actual writes in the system, since compaction backlog can
  build up in various levels of the LSM store.
- Signals of overload are coarse, since we cannot view all the internal
  queues that can build up due to resource overload. For instance,
  different examples of bandwidth saturation exhibit different
  latency effects, presumably because the queue buildup is different. So it
  is non-trivial to approach full utilization without risking high latency.

Due to these challenges, and previous design attempts that were quite
complicated (and incomplete), we adopt a goal of simplicity of design, and strong
abstraction boundaries.
- The disk load is abstracted using an enum. The diskLoadWatcher can be
  evolved independently.
- The approach uses easy to understand small multiplicative increase and 
  large multiplicative decrease, (unlike what we do for flush and compaction 
  tokens, where we try to more precisely calculate the sustainable rates).

Since we are using a simple approach that is somewhat coarse in its behavior,
we start by limiting its application to two kinds of writes:
- Incoming writes that are deemed "elastic": This can be done by
  introducing a work-class (in addition to admissionpb.WorkPriority), or by
  implying a work-class from the priority (e.g. priorities < NormalPri are
  deemed elastic). This prototype does the latter.
- Optional compactions: We assume that the LSM store is configured with a
  ceiling on number of regular concurrent compactions, and if it needs more
  it can request resources for additional (optional) compactions. These
  latter compactions can be limited by this approach. See
  cockroachdb/pebble#1329 for motivation. This control on compactions
  is not currently implemented and is future work (though the prototype
  in #82813 had code for
  it).

The reader should start with disk_bandwidth.go, consisting of
- diskLoadWatcher: which computes load levels.
- diskBandwidthLimiter: It used the load level computed by diskLoadWatcher
   to limit write tokens for elastic writes and in the future will also
   limit compactions.

There is significant refactoring and changes in granter.go and
work_queue.go. This is driven by the fact that:
- Previously the tokens were for L0 and now we need to support tokens for
  bytes into L0 and tokens for bytes into the LSM (the former being a subset
  of the latter).
- Elastic work is in a different WorkQueue than regular work, but they
  are competing for the same tokens. A different WorkQueue is needed to
  prevent a situation where elastic work for one tenant is queued ahead
  of regualar work from another tenant, and stops the latter from making
  progress due to lack of elastic tokens.

The latter is handled by allowing kvSlotGranter to multiplex across
multiple requesters, via multiple child granters. A number of interfaces
are adjusted to make this viable. In general, the GrantCoordinator
is now slightly dumber and some of that logic is moved into the granters.

For the former (handling two kinds of tokens), I considered adding multiple
resource dimensions to the granter-requester interaction but found it
too complicated. Instead we rely on the observation that we request
tokens based on the total incoming bytes of the request (not just L0),
and when the request is completed, tell the granter how many bytes
went into L0. The latter allows us to return tokens to L0. So at the
time the request is completed, we can account separately for the L0
tokens and these new tokens for all incoming bytes (which we are calling
disk bandwidth tokens, since they are constrained based on disk bandwidth).

This is a cleaned up version of the prototype in
#82813 which contains the
experimental results. The plumbing from the KV layer to populate the
disk reads, writes and provisioned bandwidth is absent in this PR,
and will be added in a subsequent PR.

Disk bandwidth bottlenecks are considered only if both the following
are true:
- DiskStats.ProvisionedBandwidth is non-zero.
- The cluster setting admission.disk_bandwidth_tokens.elastic.enabled
  is true (defaults to true).

Informs #82898

Release note: None (the cluster setting mentioned earlier is useless
since the integration with CockroachDB will be in a future PR).

85786: sql: support UDFs with named args, strictness, and volatility r=mgartner a=mgartner

#### sql: UDF with empty result should evaluate to NULL

If the last statement in a UDF returns no rows, the UDF will evaluate to
NULL. Prior to this commit the evaluation of the UDF would panic.

Release note: None

#### sql: support UDFs with named arguments

UDFs with named arguments can now be evaluated.

During query planning, statements in the function body are built with a
scope that includes the named arguments for the function as columns.
This allows references to arguments to be resolved as variables.

During evaluation, the input expressions are first evaluated into
datums. When a plan is built for each statement in the UDF, the argument
columns in the expression are replaced with the input datums before the
expression is optimized.

Note that anonymous arguments and integer references to arguments (e.g.,
`$1`) are not yet supported.

Also, the formatting of `UDFExpr`s has been improved to show argument
columns and input expressions.

Release note: None

#### sql: do not evaluate strict UDFs if any input values are NULL

A UDF can have one of two behaviors when it is invoked with NULL inputs:

  1. If the UDF is `CALLED ON NULL INPUT` (the default) then the
     function is evaluated regardless of whether or not any of the input
     values are NULL.
  2. If the UDF `RETURNS NULL ON NULL INPUT` or is `STRICT` then the
     function is not evaluated if any of the input values are NULL.
     Instead, the function directly results in NULL.

This commit implements these two behaviors.

In the future, we can add a normalization rule that folds a strict UDF
if any of its inputs are constant NULL values.

Release note: None

#### sql: make mutations visible to volatile UDFs

The volatility of a UDF affects the visibility of mutations made by the
statement calling the function. A volatile function will see these
mutations. Also, statements within a volatile function's body will see
changes made by previous statements the function body (note that this is
left untested in this commit because we do not currently support
mutations within UDF bodies). In contrast, a stable, immutable, or
leakproof function will see a snapshot of the data as of the start of
the statement calling the function.

Release note: None


Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 15, 2022
A previous PR cockroachdb#85722 added support for disk bandwidth as a bottlneck
resource, in the admission control package. To utilize this, admission
control needs to be provided the provisioned bandwidth and the observed
read and write bytes. This PR adds configuration support for this via
the StoreSpec (that uses the --store flag). The StoreSpec now has an
optional ProvisionedRateSpec that contains the name of the disk
corresponding to the store, and an optional provisioned bandwidth,
that are specified as
provisioned-rate=name=<disk-name>[:bandwidth=<bandwidth-bytes>].

The disk-name is used to map the DiskStats, retrieved via the existing
code in status.GetDiskCounters to the correct Store. These DiskStats
contain the read and write bytes. The optional bandwidth is used to
override the provisioned bandwidth set via the new cluster setting
kv.store.admission.provisioned_bandwidth.

Release note (ops change): Disk bandwidth constraint can now be
used to control admission of elastic writes. This requires configuration
for each store, via the --store flag, that now contains a
provisioned-rate field. The provisioned-rate field needs to provide
a disk-name for the store and optionally a disk bandwidth. If the
disk bandwidth is not provided the cluster setting
kv.store.admission.provisioned_bandwidth will be used. The cluster
setting defaults to 0. If the effective disk bandwidth, i.e., using
the possibly overridden cluster setting is 0, there is no disk
bandwidth constraint. Additionally, the admission control cluster
setting admission.disk_bandwidth_tokens.elastic.enabled (defaults
to true) can be used to turn off enforcement even when all the other
configuration has been setup. Turning off enforcement will still
output all the relevant information about disk bandwidth usage, so
can be used to observe part of the mechanism in action.

Fixes cockroachdb#82898
craig bot pushed a commit that referenced this issue Aug 25, 2022
85392: streamingccl: support sst event in random stream client r=gh-casper a=gh-casper

Previously random stream client lacks test coverage
for AddSSTable operation, this PR enables random stream
client to generate and validate random SSTableEvents.

This PR also cleans up the InterceptableStreamClient
to avoid unnecessary abstraction, making code cleaner.

Release justification: low risk, high benefit changes to 
existing functionality

Release note : None

86063: base,kvserver,server: configuration of provisioned bandwidth for a store r=irfansharif a=sumeerbhola

A previous PR #85722 added support for disk bandwidth as a bottlneck
resource, in the admission control package. To utilize this, admission
control needs to be provided the provisioned bandwidth and the observed
read and write bytes. This PR adds configuration support for this via
the StoreSpec (that uses the --store flag). The StoreSpec now has an
optional ProvisionedRateSpec that contains the name of the disk
corresponding to the store, and an optional provisioned bandwidth,
that are specified as
provisioned-rate=name=<disk-name>[:bandwidth=<bandwidth-bytes>].

The disk-name is used to map the DiskStats, retrieved via the existing
code in status.GetDiskCounters to the correct Store. These DiskStats
contain the read and write bytes. The optional bandwidth is used to
override the provisioned bandwidth set via the new cluster setting
kv.store.admission.provisioned_bandwidth.

Fixes #82898

Release note (ops change): Disk bandwidth constraint can now be
used to control admission of elastic writes. This requires configuration
for each store, via the --store flag, that now contains an optional
provisioned-rate field. The provisioned-rate field, if specified,
needs to provide a disk-name for the store and optionally a disk
bandwidth. If the disk bandwidth is not provided the cluster setting
kv.store.admission.provisioned_bandwidth will be used. The cluster
setting defaults to 0 (which means that the disk bandwidth constraint
is disabled). If the effective disk bandwidth, i.e., using the possibly
overridden cluster setting is 0, there is no disk bandwidth constraint.
Additionally, the admission control cluster setting
admission.disk_bandwidth_tokens.elastic.enabled (defaults to true)
can be used to turn off enforcement even when all the other
configuration has been setup. Turning off enforcement will still
output all the relevant information about disk bandwidth usage, so
can be used to observe part of the mechanism in action.
To summarize, to enable this for a cluster with homogenous disk,
provide a disk-name in the provisioned-rate field in the store-spec,
and set the kv.store.admission.provisioned_bandwidth cluster setting
to the bandwidth limit. To only get information about disk bandwidth
usage by elastic traffic (currently via logs, not metrics), do the
above but also set admission.disk_bandwidth_tokens.elastic.enabled
to false.

Release justification: Low risk, high benefit change that allows
an operator to enable new functionality (disabled by default).


86558: spanconfigtestutils: copy gc job testing knobs to tenant r=ajwerner a=stevendanna

Some of the sqltranslator tests require that we can disable the GC job
related to an index creation so that we can observe the span
configuration on the temporary indexes.  We pause the GC job via a
testing knob.

Fixes #85507

Release justification: Test-only change

Release note: None

86596: ui/cluster-ui: create statements insights api r=xinhaoz a=xinhaoz

This commit adds a function to the `cluster-ui` pkg that queries
`crdb_internal.cluster_execution_insights` to surface slow running
queries. The information retrieved  is intended to be used in the
insights statement overview and details pages. A new field
`statementInsights` is added to the `cachedData` object in the
db-console redux store, and the corresponding function to issue the
data fetch is also added in `apiReducers`. This commit does not add
any reducers  or sagas to CC to fetch and store this data.

Release justification: non-production code change
Release note: None

86684: release: update cockroachdb version to v22.1.6 r=rhu713 a=rail

Release justification: not a part of the build
Release note: None

Co-authored-by: Casper <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
@craig craig bot closed this as completed in 833a031 Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants