-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-9825 control: Update Telemetry Endpoint to use HTTPS #15216
base: master
Are you sure you want to change the base?
Conversation
- Adding new option for telemetry config in server, control and agent yaml file. - Telemetry endpoint can have option to run in both secure (https) and insecure (http) mode. telemetry_config: allow_insecure: false server_cert: /etc/daos/certs/telemetryserver.crt server_key: /etc/daos/certs/telemetryserver.key ca_cert: /etc/daos/certs/daosTelemetryCA.crt - Telemetry old configuration option is supported bur recommend to use the new options. - Updated dmg to create the correct prometheus config install based on telemetry is insecure mode or not. # cat /root/.prometheus.yml scheme: https tls_config: ca_file: /etc/daos/certs/daosTelemetryCA.crt Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Ticket title is 'Update Telemetry Endpoint to use HTTPS' |
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/1/execution/node/1169/log |
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
fa93839
to
91ddce3
Compare
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
91ddce3
to
d3f9941
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/5/execution/node/1507/log |
Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/6/execution/node/1438/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs to run with one or both of these
Features: telemetry control
src/tests/ftest/util/agent_utils.py
Outdated
self.manager.job.copy_telemetry_certificates( | ||
get_log_file("daosTelemetryCA"), self._hosts) | ||
self.manager.job.generate_telemetry_certificates(self._hosts, "daos_agent") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but how can we copy them before generating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the certificate will be created on two stage first to create the private CA on Admin node and copy and create the individual certificate on server/client.
I see your point on naming confusion, let me change the function name to avoid confusion.
def get_certificate_data(self, name_list): | ||
"""Get certificate data. | ||
|
||
Args: | ||
name_list (list): list of certificate attribute names. | ||
|
||
Returns: | ||
data (dict): a dictionary of parameter directory name keys and | ||
value. | ||
|
||
""" | ||
data = super().get_certificate_data(name_list) | ||
return data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function just returns the super()
so it's not necessary
def get_certificate_data(self, name_list): | |
"""Get certificate data. | |
Args: | |
name_list (list): list of certificate attribute names. | |
Returns: | |
data (dict): a dictionary of parameter directory name keys and | |
value. | |
""" | |
data = super().get_certificate_data(name_list) | |
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it.
result = run_pcmd(hosts, command, 30) | ||
if result[0]['exit_status'] != 0: | ||
self.log.info(" WARNING: command %s failed", command) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use run_remote instead of run_pcmd
result = run_pcmd(hosts, command, 30) | |
if result[0]['exit_status'] != 0: | |
self.log.info(" WARNING: command %s failed", command) | |
result = run_remote(self.log, hosts, command, 30) | |
if not result.passed: | |
self.log.info(" WARNING: command %s failed", command) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it.
"""Telemetry credentials listing certificates for secure communication.""" | ||
|
||
def __init__(self, namespace, title, log_dir): | ||
"""Initialize a TelemetryConfig object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Initialize a TelemetryConfig object. | |
"""Initialize a TelemetryCredentials object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Telemetry has more than credential so TelemetryConfig is make sense from that context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment just needs to match whatever the class name is
title (str, optional): namespace under which to place the | ||
parameters when creating the yaml file. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title (str, optional): namespace under which to place the | |
parameters when creating the yaml file. Defaults to None. | |
title (str): namespace under which to place the | |
parameters when creating the yaml file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
"""Telemetry credentials listing certificates for secure communication.""" | ||
|
||
def __init__(self, log_dir=os.path.join(os.sep, "tmp")): | ||
"""Initialize a TelemetryConfig object.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Initialize a TelemetryConfig object.""" | |
"""Initialize a DaosServerTelemetryCredentials object.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
def get_certificate_data(self, name_list): | ||
"""Get certificate data. | ||
|
||
Args: | ||
name_list (list): list of certificate attribute names. | ||
|
||
Returns: | ||
data (dict): a dictionary of parameter directory name keys and | ||
value. | ||
|
||
""" | ||
data = super().get_certificate_data(name_list) | ||
return data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_certificate_data(self, name_list): | |
"""Get certificate data. | |
Args: | |
name_list (list): list of certificate attribute names. | |
Returns: | |
data (dict): a dictionary of parameter directory name keys and | |
value. | |
""" | |
data = super().get_certificate_data(name_list) | |
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it.
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
TelemetryPort int `yaml:"telemetry_port,omitempty"` | ||
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"` | ||
TelemetryRetain time.Duration `yaml:"telemetry_retain,omitempty"` | ||
TelemetryConfig *security.TelemetryConfig `yaml:"telemetry_config"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we'd still want to support the old telemetry config format as well, for at least one version, so we don't force people to change config files without warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's just that those variables moved under the TelemetryConfig struct so from user point of view nothing has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An old config file using telemetry_port
no longer works, though. A previously-working config file will just stop working, and they will need to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, let me change the name to support that old config too.
src/control/cmd/daos_agent/config.go
Outdated
return nil, errors.New("telemetry_enabled requires telemetry_port") | ||
} | ||
|
||
if cfg.TelemetryConfig.AllowInsecure == false { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - slightly more idiomatic way of saying the same
if cfg.TelemetryConfig.AllowInsecure == false { | |
if !cfg.TelemetryConfig.AllowInsecure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
utils/config/daos_agent.yml
Outdated
# # Server certificate for use in TLS handshakes | ||
# # DAOS client is the HTTPS server to open secure telemetry endpoint. | ||
# server_cert: /etc/daos/certs/telemetryserver.crt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name for these is a little confusing.
# # Server certificate for use in TLS handshakes | |
# # DAOS client is the HTTPS server to open secure telemetry endpoint. | |
# server_cert: /etc/daos/certs/telemetryserver.crt | |
# # HTTPS certificate for telemetry endpoint | |
# https_cert: /etc/daos/certs/telemetryserver.crt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will Update
utils/config/daos_agent.yml
Outdated
# # Key portion of Server Certificate | ||
# # DAOS client is the HTTPS server to open secure telemetry endpoint. | ||
# server_key: /etc/daos/certs/telemetryserver.key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# # Key portion of Server Certificate | |
# # DAOS client is the HTTPS server to open secure telemetry endpoint. | |
# server_key: /etc/daos/certs/telemetryserver.key | |
# # Private key for HTTPS certificate for telemetry endpoint | |
# https_key: /etc/daos/certs/telemetryserver.key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
# # In order to disable transport security, uncomment and set allow_insecure | ||
# # to true. Not recommended for production configurations. | ||
# allow_insecure: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel unsure about having this option be explicit, rather than implied by the cert+key not being defined. I think this endpoint is fundamentally different from the transport security we use for DAOS overall. The telemetry endpoint is not authenticated via certs (or at all) the way that the component communications are. The HTTPS encryption is to guarantee the data read by prometheus (or any other tool) hasn't been tampered with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will also changes once we use the tool suggested by you. So let me try and see how that works and will change the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code was modified to use the system certificate. But to be aligned with Transport config option I think it's better we can keep it. This will also gives user a chance to use specific security level with certificates.
src/control/lib/control/telemetry.go
Outdated
if req.AllowInsecure == false && req.CaCertPath == "" { | ||
return nil, errors.New("Provide the CA certificate path") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right. The dmg config shouldn't need the user to supply a root cert if it is from a trusted CA that the OS recognizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to check this and change the code.
src/control/lib/control/telemetry.go
Outdated
AllowInsecure bool // Set the https end point secure | ||
CaCertPath string // CA Cert path for telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should need these from the dmg side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to check this and change the code.
src/control/lib/control/http.go
Outdated
url *url.URL | ||
getFn httpGetFn | ||
allowInsecure *bool | ||
cacertpath *string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before, we shouldn't need to include a path to a root cert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to check this and change the code.
src/control/lib/control/http.go
Outdated
@@ -128,6 +165,22 @@ func httpGetBody(ctx context.Context, url *url.URL, get httpGetFn, timeout time. | |||
return nil, errors.New("nil get function") | |||
} | |||
|
|||
if *allowInsecure == false { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it a pointer? If it's nil, dereferencing will segfault, as in C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to bool.
src/control/lib/control/http.go
Outdated
// and return the http.Get | ||
func httpsGetFunc(cert []byte) (httpGetFn, error) { | ||
caCertPool := x509.NewCertPool() | ||
result := caCertPool.AppendCertsFromPEM(cert) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to harvest what's in the OS list of trusted certs, similar to what web browsers do. I think x509.SystemCertPool
does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to check this and change the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest LGTM but will defer to Kris on the config/interface since ftest just mirrors that
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
5ab0cf4
to
3b9be62
Compare
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/9/execution/node/1520/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/9/execution/node/1566/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than change requests from kjacque this LGTM. So once those concerns have been addressed I'm happy to approve.
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15216/10/testReport/ |
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/12/execution/node/1210/log |
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15216/16/execution/node/1211/log |
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15216/18/display/redirect |
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15216/24/testReport/ |
This reverts commit f67aed2. Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
3e5f1e0
to
834e532
Compare
need to resolve conflicts |
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Features: control telemetry Required-githooks: true Signed-off-by: Samir Raval <[email protected]>
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15216/28/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
telemetry_config: allow_insecure: false server_cert: /etc/daos/certs/telemetryserver.crt server_key: /etc/daos/certs/telemetryserver.key ca_cert: /etc/daos/certs/daosTelemetryCA.crt
# cat /root/.prometheus.yml scheme: https tls_config: ca_file: /etc/daos/certs/daosTelemetryCA.crt
Features: control telemetry
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: