Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add x-region, x-zone Raft msg metrics to Store #105122

Merged
merged 1 commit into from
Jun 30, 2023

Conversation

wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Jun 18, 2023

Previously, there were no metrics to observe cross-region, cross-zone traffic in
raft messages requests sent / received at each store.

To improve this issue, this commit adds six new store metrics -

"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes"

The first two metrics track the total byte of raft messages received and sent in
a store. Additionally, there are four metrics to track the aggregate byte count
of cross-region, cross-zone Raft messages sent and received in the store.

Note that these metrics capture the byte count of requests immediately upon
message reception and just prior to message transmission. In the case of
messages containing heartbeats or heartbeat_resps, they capture the byte count
of requests with coalesced heartbeats.

To facilitate metrics updating, this commit also introduces a new raft message
handler interface OutgoingRaftMessageHandler. This interface captures outgoing
messages right before they are sent to raftSendQueue. Note that the message
may not be successfully queued if the outgoing queue is full.

Resolves: #103983

Release note (ops change): Six new metrics -
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes" - have now been added for each store.

For accurate metrics, follow these assumptions:

  • Configure region and zone tier keys consistently across nodes.
  • Within a node locality, ensure unique region and zone tier keys.
  • Maintain consistent configuration of region and zone tiers across nodes.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the outgoing-handler branch 3 times, most recently from 7aa20ec to eb9fe42 Compare June 20, 2023 02:20
@wenyihu6 wenyihu6 force-pushed the outgoing-handler branch 2 times, most recently from 7383221 to 9f95c9c Compare June 20, 2023 04:47
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jun 20, 2023
The commit renames `RaftMessageHandler` to `IncomingRaftMessageHandler`. Another
PR is introducing a new interface `OutgoingRaftMessageHandler`, dedicated to
managing messages sent. The main purpose of this PR is to make the future PR
cleaner by handling the renaming process. Note that this commit does not change
any existing functionality.

Part of: cockroachdb#103983
Related: cockroachdb#105122
Release Note: None
@wenyihu6 wenyihu6 marked this pull request as ready for review June 20, 2023 05:09
@wenyihu6 wenyihu6 requested a review from a team as a code owner June 20, 2023 05:09
@wenyihu6 wenyihu6 requested a review from a team June 20, 2023 05:09
@wenyihu6 wenyihu6 requested a review from a team as a code owner June 20, 2023 05:09
@wenyihu6 wenyihu6 requested a review from tbg June 20, 2023 05:09
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jun 20, 2023
The commit renames `RaftMessageHandler` to `IncomingRaftMessageHandler`. Another
PR is introducing a new interface `OutgoingRaftMessageHandler`, dedicated to
managing messages sent. The main purpose of this PR is to make the future PR
cleaner by handling the renaming process. Note that this commit does not change
any existing functionality.

Part of: cockroachdb#103983
Related: cockroachdb#105122
Release Note: None
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good (only looked at last commit here)! Some of the same ideas I had in the base PRs apply here as well. It probably makes sense to settle these dependencies first, after that I think this PR will be in pretty good shape automatically.

pkg/kv/kvserver/client_raft_helpers_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_raft_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_raft_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_raft_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/store_raft.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/store_raft.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/store_raft.go Show resolved Hide resolved
@wenyihu6 wenyihu6 requested a review from kvoli June 20, 2023 15:07
@wenyihu6 wenyihu6 force-pushed the outgoing-handler branch 2 times, most recently from b894549 to 026ab0c Compare June 21, 2023 12:28
@wenyihu6 wenyihu6 requested a review from tbg June 21, 2023 16:51
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jun 25, 2023
The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to
`IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and
`StopIncomingRaftMessages`. Another PR is introducing a new interface
`OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main
purpose of this PR is to make the future PR cleaner by handling the renaming
process. Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Related: cockroachdb#105122

Release Note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jun 25, 2023
The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to
`IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and
`StopIncomingRaftMessages`. Another PR is introducing a new interface
`OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main
purpose of this PR is to make the future PR cleaner by handling the renaming
process. Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Related: cockroachdb#105122

Release Note: None
@wenyihu6
Copy link
Contributor Author

pkg/kv/kvserver/raft_transport.go line 386 at r8 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I didn’t rename getHandler primarily because I didn’t want to cause too many changes on the existing interface. I was also worried that this would lead to inconsistency. If we decide to rename everything to Incoming, we should probably change Listen, Stop to ListenIncoming, StopIncoming as well. What do you think?

Update: decided to just rename everything and have updated the rename PR

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, almost there! Main point is trimming down the tests.

pkg/kv/kvserver/raft_transport.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/raft_transport.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/store_raft.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_raft_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/metrics.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/metrics.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/metrics.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/metrics.go Show resolved Hide resolved
pkg/kv/kvserver/metrics.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/metrics.go Outdated Show resolved Hide resolved
craig bot pushed a commit that referenced this pull request Jun 26, 2023
105123: kvserver: rename RaftMessageHandler to IncomingRaftMessageHandler r=tbg,kvoli a=wenyihu6

The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to
`IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and
`StopIncomingRaftMessages`. Another PR is introducing a new interface
`OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main
purpose of this PR is to make the future PR cleaner by handling the renaming
process. Note that this commit does not change any existing functionality.

Part of: #103983

Related: #105122

Release Note: None

105519: kvserver: fix and re-enable slow lease application logging r=erikgrinaker a=erikgrinaker

Resolves #97209.

Epic: none
Release note: None

Co-authored-by: Wenyi <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@wenyihu6 wenyihu6 force-pushed the outgoing-handler branch 5 times, most recently from 6c0f4e8 to 40368c6 Compare June 27, 2023 02:29
@wenyihu6 wenyihu6 requested a review from tbg June 27, 2023 02:39
@wenyihu6
Copy link
Contributor Author

Looking good, almost there! Main point is trimming down the tests.

Ack. Replaced the end to end test with one integration test for handlers and one unit test for cross locality metrics.

pkg/kv/kvserver/store_snapshot.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/client_raft_helpers_test.go Show resolved Hide resolved
require.Equal(t, expected, actual)
})

t.Run("sent raft message", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtests need to be independent of each other, I think these ones aren't. So either make this just one test (remove the subtests) or set up the environment anew in each subtest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking it back, they are independent due to the use of diffs. So this is "fine", I still think my top-level comment about merging the tests makes sense - could always have overlooked something.

@tbg
Copy link
Member

tbg commented Jun 30, 2023

Thanks for the updates! The code looks good, the test could be improved somewhat. In the cross-region specific test, you have a nice tabledriven setup. But then you also have the other test, which tests a few more slices of the stack (in a way that seems to work out really well) but which doesn't have nice tabledriven layout. Would you consider consolidating the two? I'm thinking ditch the non-tabledriven test, moving its coverage into the tabledriven test instead and renaming that test so that it is a bit more general (testing all send/receive metrics that have to do w the raft transport). After that (and my other trivial comments above), this is good to 🚀 !

Approving preemptively.

require.Equal(t, expected, actual)
})

t.Run("sent raft message", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking it back, they are independent due to the use of diffs. So this is "fine", I still think my top-level comment about merging the tests makes sense - could always have overlooked something.

Previously, there were no metrics to observe cross-region, cross-zone traffic in
raft messages requests sent / received at each store.

To improve this issue, this commit adds six new store metrics -

```
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes"
```

The first two metrics track the total byte of raft messages received and sent in
a store. Additionally, there are four metrics to track the aggregate byte count
of cross-region, cross-zone Raft messages sent and received in the store.

Note that these metrics capture the byte count of requests immediately upon
message reception and just prior to message transmission. In the case of
messages containing heartbeats or heartbeat_resps, they capture the byte count
of requests with coalesced heartbeats.

To facilitate metrics updating, this commit also introduces a new raft message
handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing
messages right before they are sent to `raftSendQueue`. Note that the message
may not be successfully queued if the outgoing queue is full.

Resolves: cockroachdb#103983

Release note (ops change): Six new metrics -
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes" - have now been added for each store.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.
@wenyihu6
Copy link
Contributor Author

Thanks for the updates! The code looks good, the test could be improved somewhat. In the cross-region specific test, you have a nice tabledriven setup. But then you also have the other test, which tests a few more slices of the stack (in a way that seems to work out really well) but which doesn't have nice tabledriven layout. Would you consider consolidating the two? I'm thinking ditch the non-tabledriven test, moving its coverage into the tabledriven test instead and renaming that test so that it is a bit more general (testing all send/receive metrics that have to do w the raft transport). After that (and my other trivial comments above), this is good to 🚀 !

Approving preemptively.

Discussed this over slack -
The two tests couldn't be merged easily since passing in locality info directly to HandleRaftRequest and HandleRaftResponse is not possible. They both require getNodeLocality to check if it is cross region or cross zone. And to obtain this information, we would need to go back to the e2e tests as they just can't the locality info of a node without setting up a cluster iiuc. So we decided to just keep the way it is.

The last push was to rebase on master and cherry pick the last commit. Will merge when CI goes green.

@wenyihu6
Copy link
Contributor Author

Thanks a lot for the review!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Jun 30, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: add metrics for cross-region, cross-zone batch requests / responses, snapshots, raft activities
4 participants