Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CC-7044] Start HCP manager as part of link creation #20312

Merged
merged 14 commits into from
Jan 29, 2024

Conversation

mkam
Copy link
Contributor

@mkam mkam commented Jan 23, 2024

Description

Depends on #20306

This PR starts the HCP manger when an HCP link resource is created instead of when Consul is started. This allows the linking process to be initiated via the HCP link API.

These changes are best viewed commit-by-commit. A summary of the changes are:

  • Check explicitly for ACL policies since the HCP manger will create a token, so acl:write is also now required in addition to operator:write
  • Change the HCP manager’s Run method to a Start method, keep track of if its running, and only allow to start once
  • Always initialize the required HCP components (i.e., SCADA provider, HCP metrics sink) when Consul starts
  • Pass the HCP manager as a dependency to the link controller
  • When a link is created, update the HCP manager with the HCP configs and start it

This PR also introduces a breaking change, though that change actually fixes Consul's behavior to match what we have documented. The agent telemetry docs and the agent configuration docs for telemetry.disable_hostname both state that by default, the hostname of the Consul agent should prefix gauge-type metrics. However, before this PR, if there were no additional metrics sinks enabled, the hostname prefixing was disabled. Now that we're always enabling the HCP metrics sink, we will now always prefix the gauge-metrics by default.

Testing & Reproduction steps

Linking via API:

  1. Start Consul without a cloud configuration
  2. Make a PUT request to create a link resource
  3. Inspect logs to see that HCP manager has started
  4. Check in HCP portal that all features are working as expected (server info and services syncing, observability, global workflows, etc.)
  5. Make a GET request to the link and check the status is successful

Other variations tested:

  • Linking via configuration rather than the API
  • Bootstrapping via configuration

Links

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

@github-actions github-actions bot added the theme/config Relating to Consul Agent configuration, including reloading label Jan 23, 2024
@mkam mkam force-pushed the mkam/CC-7044/link-starts-hcp-manager branch 3 times, most recently from f8d0b98 to c71d5a8 Compare January 23, 2024 16:36
@mkam mkam force-pushed the mkam/CC-7044/link-starts-hcp-manager branch from c71d5a8 to 048440b Compare January 23, 2024 17:41
@mkam mkam force-pushed the mkam/CC-7063/fetch-bootstrap-in-link branch from 36b4dab to 2a656de Compare January 23, 2024 20:06
@mkam mkam force-pushed the mkam/CC-7044/link-starts-hcp-manager branch 2 times, most recently from b166213 to 8f9f407 Compare January 23, 2024 20:54
@mkam mkam marked this pull request as ready for review January 23, 2024 21:16
@mkam mkam requested a review from a team as a code owner January 23, 2024 21:16
Comment on lines 91 to 95
if m.isRunning() {
m.logger.Trace("HCP manager already started")
return nil
}
m.setRunning(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a possibility of a race here, even with the lock. Two goroutines can call m.isRunning() sequentially and both can see false, then both would proceed to call the rest of the code. I think we'd have to lock the lock, then do compare and set in one atomic operation like:

func (m *Manager) setRunningIfNotRunning() bool {
	m.runLock.Lock()
	defer m.runLock.Unlock()
        if m.running { return false }
	m.running = true
        return true
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great point, will make the change!

Base automatically changed from mkam/CC-7063/fetch-bootstrap-in-link to main January 24, 2024 15:51
@mkam mkam force-pushed the mkam/CC-7044/link-starts-hcp-manager branch from 8f9f407 to 2912114 Compare January 24, 2024 15:55
Comment on lines +123 to +146
existingCfg := r.hcpManager.GetCloudConfig()
newCfg := config.CloudConfig{
ResourceID: link.ResourceId,
ClientID: link.ClientId,
ClientSecret: link.ClientSecret,
}
cfg := config.Merge(existingCfg, newCfg)
hcpClient, err := r.hcpClientFn(cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, but just FYI, we'll have a merge conflict with Nick E's change: https://github.com/hashicorp/consul/pull/20257/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up, will keep this in mind!

Copy link
Contributor

@NickCellino NickCellino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking through this but have to stop now for a meeting. Will resume later

StatusFn: s.hcpServerStatus(flat),
Logger: logger.Named("hcp_manager"),
SCADAProvider: flat.HCP.Provider,
TelemetryProvider: flat.HCP.TelemetryProvider,
ManagementTokenUpserterFn: func(name, secretId string) error {
if s.IsLeader() {
if s.config.ACLsEnabled && s.IsLeader() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I've seen the reasoning behind this change, can we clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missed by me when I originally added this — if ACLs are not enabled, upsertManagementToken will error, so we'd end up with unwanted error logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkam happy to look into this myself in a followup, but just wanted to bring it up: maybe we also want to check s.InPrimaryDatacenter() so we are being consistent with how this works:

if s.InPrimaryDatacenter() {

Comment on lines +123 to +145
existingCfg := r.hcpManager.GetCloudConfig()
newCfg := config.CloudConfig{
ResourceID: link.ResourceId,
ClientID: link.ClientId,
ClientSecret: link.ClientSecret,
}
cfg := config.Merge(existingCfg, newCfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify why we need to merge instead of changing the existingCfg values directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment about this! The reasoning is:

  1. Some values can be set at startup like what HCP API endpoint to use. We don't currently have a way in the linking API to set this, but support for this will be added later, so I thought a merge made more sense in the long term.
  2. The NodeID and the NodeName of the cluster is set on the cloud config when Consul starts, so we want to continue to use these values. I considered some different strategies around this like passing these two values as dependencies to the controller, but I figured I'll want to implement a merge anyway because of (1).

mkam added 9 commits January 25, 2024 11:32
Link eventually will be creating a token, so require acl:write.
Start as part of link creation rather than always starting. Update
the HCP manager with values from the link before starting as well.
The HCP metrics sink will always be enabled, so the length of sinks will
always be greater than zero. This also means that we will also always
default to prefixing metrics with the hostname, which is what our
documentation states is the expected behavior anyway.
@mkam mkam force-pushed the mkam/CC-7044/link-starts-hcp-manager branch from 7ec7b41 to 5719b08 Compare January 25, 2024 17:40
@loshz
Copy link
Contributor

loshz commented Jan 25, 2024

This all makes sense and the code looks good.

Just to clarify the breaking change, this is for all metrics or just HCP related?

@mkam
Copy link
Contributor Author

mkam commented Jan 25, 2024

Just to clarify the breaking change, this is for all metrics or just HCP related?

@loshz It'll be for all metrics (of a specific type) and not just ones that are exported to HCP. For example, this table of the metrics that Consul collects has a column Type, and any metric whose type is gauge will be affected.

@loshz
Copy link
Contributor

loshz commented Jan 25, 2024

It'll be for all metrics (of a specific type) and not just ones that are exported to HCP. For example, this table of the metrics that Consul collects has a column Type, and any metric whose type is gauge will be affected.

Makes sense! I'm wondering if we need to make this change more widely known in the org, just to get another perspective. Wondering if it's also worth us fixing this in a separate PR? Not sure who to tag, but it might worth us bringing it up in Slack? WDYT?

@mkam mkam force-pushed the mkam/CC-7044/link-starts-hcp-manager branch from b177812 to 8660a03 Compare January 29, 2024 22:06
@mkam mkam merged commit 3b9bb8d into main Jan 29, 2024
91 checks passed
@mkam mkam deleted the mkam/CC-7044/link-starts-hcp-manager branch January 29, 2024 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-backport theme/config Relating to Consul Agent configuration, including reloading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants