Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAULT-31755: Add removed and HA health to the sys/health endpoint #28991

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

miagilepner
Copy link
Contributor

Description

This PR adds HA health and removed as statuses to the sys/health endpoint.

  • If a node has been removed from the cluster, its status code will be 530 (or the value from the removedcode query parameter)
  • If a node has missed 2 request forwarding heartbeats, its status code will be 474 (or the value from the haunhealthycode query parameter)

The health response has new fields:

type HealthResponse struct {
...
  RemovedFromCluster  *bool    // present when raft is the HA/storage backend
  HAConnectionHealthy *bool    // present when the node is a standby/perf standby

  LastRequestForwardingHeartbeatMillis int64   // non-zero when the node has completed at least one request forwarding heartbeat 
}

TODO only if you're a HashiCorp employee

  • Backport Labels: If this fix needs to be backported, use the appropriate backport/ label that matches the desired release branch. Note that in the CE repo, the latest release branch will look like backport/x.x.x, but older release branches will be backport/ent/x.x.x+ent.
    • LTS: If this fixes a critical security vulnerability or severity 1 bug, it will also need to be backported to the current LTS versions of Vault. To ensure this, use all available enterprise labels.
  • ENT Breakage: If this PR either 1) removes a public function OR 2) changes the signature
    of a public function, even if that change is in a CE file, double check that
    applying the patch for this PR to the ENT repo and running tests doesn't
    break any tests. Sometimes ENT only tests rely on public functions in CE
    files.
  • Jira: If this change has an associated Jira, it's referenced either
    in the PR description, commit message, or branch name.
  • RFC: If this change has an associated RFC, please link it in the description.
  • ENT PR: If this change has an associated ENT PR, please link it in the
    description. Also, make sure the changelog is in this PR, not in your ENT PR.

@miagilepner miagilepner added this to the 1.19.0-rc milestone Nov 22, 2024
@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Nov 22, 2024
Copy link

github-actions bot commented Nov 22, 2024

CI Results:
All Go tests succeeded! ✅

@miagilepner miagilepner marked this pull request as ready for review November 22, 2024 16:04
@miagilepner miagilepner requested a review from a team as a code owner November 22, 2024 16:04
Copy link

Build Results:
All builds succeeded! ✅

Copy link
Contributor

@kubawi kubawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +119 to +136
// Partition forces the inmem layer to disconnect itself from peers and prevents
// creating new connections. The returned function will add all peers back
// and re-enable connections
func (l *InmemLayer) Partition() (unpartition func()) {
l.l.Lock()
peersCopy := make([]*InmemLayer, 0, len(l.peers))
for _, peer := range l.peers {
peersCopy = append(peersCopy, peer)
}
l.l.Unlock()
l.DisconnectAll()
return func() {
for _, peer := range peersCopy {
l.Connect(peer)
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat!

@@ -977,6 +977,10 @@ func (c *TestClusterCore) ClusterListener() *cluster.Listener {
return c.getClusterListener()
}

func (c *TestClusterCore) NetworkLayer() cluster.NetworkLayer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I made an identical getter function in a yet-unmerged PR, and wrote this comment out, so I figured I might as well suggest the godoc here 😄

I know that the function itself is super simple, but our overall test cluster machinery is complex, so I think it's nice to clarify intent behind any exported functions for future devs, etc. Feel free to discard this if it's over the top.

Suggested change
func (c *TestClusterCore) NetworkLayer() cluster.NetworkLayer {
// NetworkLayer returns the network layer for the cluster core. This can be used
// in conjunction with the cluster.InmemLayer to disconnect specific nodes from
// the cluster when we need to simulate abrupt node failure or a network
// partition in NewTestCluster tests.
func (c *TestClusterCore) NetworkLayer() cluster.NetworkLayer {

@kubawi kubawi added the core/ha specific to high-availability label Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ha specific to high-availability hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants