Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-9825 control: Update Telemetry Endpoint to use HTTPS #15216

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

ravalsam
Copy link
Contributor

  • Adding new option for telemetry config in server, control and agent yaml file.
  • Telemetry endpoint can have option to run in both secure (https) and insecure (http) mode.
    telemetry_config: allow_insecure: false server_cert: /etc/daos/certs/telemetryserver.crt server_key: /etc/daos/certs/telemetryserver.key ca_cert: /etc/daos/certs/daosTelemetryCA.crt
  • Telemetry old configuration option is supported bur recommend to use the new options.
  • Updated dmg to create the correct prometheus config install based on telemetry is insecure mode or not.
    # cat /root/.prometheus.yml scheme: https tls_config: ca_file: /etc/daos/certs/daosTelemetryCA.crt

Features: control telemetry

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

- Adding new option for telemetry config in server, control and agent yaml file.
- Telemetry endpoint can have option to run in both secure (https) and insecure (http) mode.
    telemetry_config:
      allow_insecure: false
      server_cert: /etc/daos/certs/telemetryserver.crt
      server_key: /etc/daos/certs/telemetryserver.key
      ca_cert: /etc/daos/certs/daosTelemetryCA.crt
- Telemetry old configuration option is supported bur recommend to use the new options.
- Updated dmg to create the correct prometheus config install based on telemetry is insecure mode or not.
  # cat /root/.prometheus.yml
    scheme: https
    tls_config:
      ca_file: /etc/daos/certs/daosTelemetryCA.crt

Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@ravalsam ravalsam requested review from a team as code owners September 27, 2024 20:36
Copy link

github-actions bot commented Sep 27, 2024

Ticket title is 'Update Telemetry Endpoint to use HTTPS'
Status is 'In Review'
Labels: 'Checkmarx,scrubbed_2.6,scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-9825

@daosbuild1
Copy link
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/1/execution/node/1169/log

Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@ravalsam ravalsam force-pushed the samirrav/DAOS-9825-Final branch from fa93839 to 91ddce3 Compare September 30, 2024 17:53
Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@ravalsam ravalsam force-pushed the samirrav/DAOS-9825-Final branch from 91ddce3 to d3f9941 Compare September 30, 2024 17:56
@ravalsam ravalsam changed the title DAOS-9825 control:Update Telemetry Endpoint to use HTTPS DAOS-9825 control: Update Telemetry Endpoint to use HTTPS Sep 30, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/5/execution/node/1507/log

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/6/execution/node/1438/log

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to run with one or both of these

Features: telemetry control

Comment on lines 290 to 292
self.manager.job.copy_telemetry_certificates(
get_log_file("daosTelemetryCA"), self._hosts)
self.manager.job.generate_telemetry_certificates(self._hosts, "daos_agent")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but how can we copy them before generating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the certificate will be created on two stage first to create the private CA on Admin node and copy and create the individual certificate on server/client.
I see your point on naming confusion, let me change the function name to avoid confusion.

Comment on lines 50 to 63
def get_certificate_data(self, name_list):
"""Get certificate data.

Args:
name_list (list): list of certificate attribute names.

Returns:
data (dict): a dictionary of parameter directory name keys and
value.

"""
data = super().get_certificate_data(name_list)
return data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just returns the super() so it's not necessary

Suggested change
def get_certificate_data(self, name_list):
"""Get certificate data.
Args:
name_list (list): list of certificate attribute names.
Returns:
data (dict): a dictionary of parameter directory name keys and
value.
"""
data = super().get_certificate_data(name_list)
return data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

Comment on lines 1083 to 1085
result = run_pcmd(hosts, command, 30)
if result[0]['exit_status'] != 0:
self.log.info(" WARNING: command %s failed", command)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use run_remote instead of run_pcmd

Suggested change
result = run_pcmd(hosts, command, 30)
if result[0]['exit_status'] != 0:
self.log.info(" WARNING: command %s failed", command)
result = run_remote(self.log, hosts, command, 30)
if not result.passed:
self.log.info(" WARNING: command %s failed", command)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it.

"""Telemetry credentials listing certificates for secure communication."""

def __init__(self, namespace, title, log_dir):
"""Initialize a TelemetryConfig object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Initialize a TelemetryConfig object.
"""Initialize a TelemetryCredentials object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telemetry has more than credential so TelemetryConfig is make sense from that context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment just needs to match whatever the class name is

Comment on lines 810 to 811
title (str, optional): namespace under which to place the
parameters when creating the yaml file. Defaults to None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title (str, optional): namespace under which to place the
parameters when creating the yaml file. Defaults to None.
title (str): namespace under which to place the
parameters when creating the yaml file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

"""Telemetry credentials listing certificates for secure communication."""

def __init__(self, log_dir=os.path.join(os.sep, "tmp")):
"""Initialize a TelemetryConfig object."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Initialize a TelemetryConfig object."""
"""Initialize a DaosServerTelemetryCredentials object."""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

Comment on lines 77 to 90
def get_certificate_data(self, name_list):
"""Get certificate data.

Args:
name_list (list): list of certificate attribute names.

Returns:
data (dict): a dictionary of parameter directory name keys and
value.

"""
data = super().get_certificate_data(name_list)
return data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_certificate_data(self, name_list):
"""Get certificate data.
Args:
name_list (list): list of certificate attribute names.
Returns:
data (dict): a dictionary of parameter directory name keys and
value.
"""
data = super().get_certificate_data(name_list)
return data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
TelemetryPort int `yaml:"telemetry_port,omitempty"`
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"`
TelemetryRetain time.Duration `yaml:"telemetry_retain,omitempty"`
TelemetryConfig *security.TelemetryConfig `yaml:"telemetry_config"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we'd still want to support the old telemetry config format as well, for at least one version, so we don't force people to change config files without warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's just that those variables moved under the TelemetryConfig struct so from user point of view nothing has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An old config file using telemetry_port no longer works, though. A previously-working config file will just stop working, and they will need to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, let me change the name to support that old config too.

return nil, errors.New("telemetry_enabled requires telemetry_port")
}

if cfg.TelemetryConfig.AllowInsecure == false {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - slightly more idiomatic way of saying the same

Suggested change
if cfg.TelemetryConfig.AllowInsecure == false {
if !cfg.TelemetryConfig.AllowInsecure {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

Comment on lines 51 to 53
# # Server certificate for use in TLS handshakes
# # DAOS client is the HTTPS server to open secure telemetry endpoint.
# server_cert: /etc/daos/certs/telemetryserver.crt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name for these is a little confusing.

Suggested change
# # Server certificate for use in TLS handshakes
# # DAOS client is the HTTPS server to open secure telemetry endpoint.
# server_cert: /etc/daos/certs/telemetryserver.crt
# # HTTPS certificate for telemetry endpoint
# https_cert: /etc/daos/certs/telemetryserver.crt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will Update

Comment on lines 55 to 57
# # Key portion of Server Certificate
# # DAOS client is the HTTPS server to open secure telemetry endpoint.
# server_key: /etc/daos/certs/telemetryserver.key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# # Key portion of Server Certificate
# # DAOS client is the HTTPS server to open secure telemetry endpoint.
# server_key: /etc/daos/certs/telemetryserver.key
# # Private key for HTTPS certificate for telemetry endpoint
# https_key: /etc/daos/certs/telemetryserver.key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

Comment on lines +47 to +49
# # In order to disable transport security, uncomment and set allow_insecure
# # to true. Not recommended for production configurations.
# allow_insecure: false
Copy link
Contributor

@kjacque kjacque Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel unsure about having this option be explicit, rather than implied by the cert+key not being defined. I think this endpoint is fundamentally different from the transport security we use for DAOS overall. The telemetry endpoint is not authenticated via certs (or at all) the way that the component communications are. The HTTPS encryption is to guarantee the data read by prometheus (or any other tool) hasn't been tampered with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will also changes once we use the tool suggested by you. So let me try and see how that works and will change the code accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code was modified to use the system certificate. But to be aligned with Transport config option I think it's better we can keep it. This will also gives user a chance to use specific security level with certificates.

Comment on lines 174 to 176
if req.AllowInsecure == false && req.CaCertPath == "" {
return nil, errors.New("Provide the CA certificate path")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right. The dmg config shouldn't need the user to supply a root cert if it is from a trusted CA that the OS recognizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to check this and change the code.

Comment on lines 148 to 149
AllowInsecure bool // Set the https end point secure
CaCertPath string // CA Cert path for telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should need these from the dmg side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to check this and change the code.

url *url.URL
getFn httpGetFn
allowInsecure *bool
cacertpath *string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, we shouldn't need to include a path to a root cert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to check this and change the code.

@@ -128,6 +165,22 @@ func httpGetBody(ctx context.Context, url *url.URL, get httpGetFn, timeout time.
return nil, errors.New("nil get function")
}

if *allowInsecure == false {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a pointer? If it's nil, dereferencing will segfault, as in C.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to bool.

// and return the http.Get
func httpsGetFunc(cert []byte) (httpGetFn, error) {
caCertPool := x509.NewCertPool()
result := caCertPool.AppendCertsFromPEM(cert)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to harvest what's in the OS list of trusted certs, similar to what web browsers do. I think x509.SystemCertPool does this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to check this and change the code.

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM but will defer to Kris on the config/interface since ftest just mirrors that

Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@ravalsam ravalsam force-pushed the samirrav/DAOS-9825-Final branch from 5ab0cf4 to 3b9be62 Compare October 7, 2024 18:06
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/9/execution/node/1520/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/9/execution/node/1566/log

Copy link
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than change requests from kjacque this LGTM. So once those concerns have been addressed I'm happy to approve.

Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15216/10/testReport/

Features: control telemetry

Required-githooks: true
Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/12/execution/node/1210/log

Features: control telemetry

Required-githooks: true
Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry

Required-githooks: true
Signed-off-by: Samir Raval <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/16/execution/node/1211/log

@daosbuild1
Copy link
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15216/18/display/redirect

Features: control telemetry

Required-githooks: true
Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry
Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15216/24/testReport/

This reverts commit f67aed2.

Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry

Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry
Required-githooks: true

Signed-off-by: Samir Raval <[email protected]>
@tanabarr
Copy link
Contributor

need to resolve conflicts

Features: control telemetry

Required-githooks: true
Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry

Required-githooks: true
Signed-off-by: Samir Raval <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15216/28/testReport/

Copy link
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants