Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-9825 control: Update Telemetry Endpoint to use HTTPS #15216

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
7cf9875
DAOS-9825 control:Update Telemetry Endpoint to use HTTPS
Sep 27, 2024
a3bcfc9
Fixed Test code.
Sep 30, 2024
d3f9941
Spell correction and script fix.
Sep 30, 2024
d9860a4
Few minor fix and Test case fix.
Oct 1, 2024
3c3bbc3
Code modified based on review comments.
Oct 1, 2024
3b9be62
Code updated based on review comments.
Oct 3, 2024
82b37d1
Code updated based on review comments.
Nov 14, 2024
f1d463a
Merge branch 'master' into samirrav/DAOS-9825-Final
Nov 19, 2024
1329d1e
Code updated based on review comments.
Dec 2, 2024
49d5062
Merge branch 'master' into samirrav/DAOS-9825-Final
Dec 4, 2024
29db84d
Merge branch 'master' into samirrav/DAOS-9825-Final
Dec 5, 2024
ca672e7
Merge branch 'master' into samirrav/DAOS-9825-Final
Dec 11, 2024
f67aed2
Code updated based on review comments.
Dec 13, 2024
b3cc471
Merge branch 'master' into samirrav/DAOS-9825-Final
Dec 13, 2024
599ec1e
Revert "Code updated based on review comments."
Dec 13, 2024
e7833f8
Code modified based on review comments
Dec 13, 2024
834e532
Merge branch 'master' into samirrav/DAOS-9825-Final
Dec 16, 2024
5b5f7a7
Merge branch 'master' into samirrav/DAOS-9825-Final
Dec 18, 2024
662baa9
Merge branch 'master' into samirrav/DAOS-9825-Final
Dec 19, 2024
8186949
Updated based on review comments.
Jan 7, 2025
474c6f3
Merge branch 'master' into samirrav/DAOS-9825-Final
Jan 7, 2025
17800cb
Copyright fix.
Jan 7, 2025
5abf1aa
Code modified based on review comments
Feb 22, 2025
45dd685
Few minor update based on review comments.
Feb 22, 2025
587354d
Updated Copyright on files based on Ci warning.
Feb 22, 2025
f841535
Merge branch 'master' into samirrav/DAOS-9825-Final
ravalsam Feb 22, 2025
fbfa2ed
Code fixed based on Ci result.
Feb 23, 2025
e44a626
Test Fixed based on Ci results and running again.
Feb 23, 2025
5e9d8f5
Merge branch 'master' into samirrav/DAOS-9825-Final
Feb 25, 2025
22d7f3d
Code modified based on review comments.
Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ written to `$HOME/.prometheus.yml`.
To start the Prometheus server with the configuration file generated by `dmg`:

```
prometheus --config-file=$HOME/.prometheus.yml
prometheus --config.file=$HOME/.prometheus.yml
```

## Storage Operations
Expand Down
156 changes: 156 additions & 0 deletions docs/admin/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -759,6 +759,162 @@ transport_config:
key: /etc/daos/certs/admin.key
```

#### Telemetry Certificate Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Telemetry Certificate Configuration
#### Telemetry Endpoint Configuration

IMO we need to disconnect this as much as possible from the DAOS control plane transport_config to avoid confusion. We can't reuse those certs for telemetry, and the purpose is entirely different. It may be better to move this into its own section of the doc for that reason.


The DAOS Telemetry framework has option to use certificates to authenticate
between server/client and admin node.A set of certificates for a given DAOS systems may be
generated by running the `gen_telemetry_admin_certificate.sh` and `gen_telemetry_server_certificate.sh` script provided with the DAOS
software if there is not an existing TLS certificate infrastructure. The
Both script uses the `openssl` tool to generate all of the
necessary files.

##### Telemetry Admin script

This `gen_telemetry_admin_certificate.sh` script needs to run on the system where the `dmg telemetry metrics` command is going to run or on the system where Prometheus is going to be setup for collecting metrics.

```bash
$ cd /tmp/
$ gen_telemetry_admin_certificate.sh
Generating Private CA Root Certificate
Generating RSA private key, 3072 bit long modulus (2 primes)
............................................................................................................++++
.............++++
e is 65537 (0x010001)
Private CA Root Certificate for Telemetry created in ./daosTelemetryCA
```

This will create the key and cert file

```bash
$ ls -l /tmp/daosTelemetryCA/
total 12
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:06 daosTelemetryCA.crt
-r-------- 1 root root 2455 Sep 27 17:06 daosTelemetryCA.key
-rw-r--r-- 1 root root 0 Sep 27 17:06 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:06 serial.txt
```

The generated keys and certificates must then be securely distributed to all nodes for which you need to collect the DAOS metrics.

You can copy this certificates on /etc/daos/certs/ or someother secure location

##### Telemetry Server script

This `gen_telemetry_server_certificate.sh` script need to run on the DAOS server/client node for which DAOS metrics needs to be gathered.

Below files are copied from the Admin node in previous steps.

```bash
$ ls -l /tmp/daosTelemetryCA/
total 12
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:06 daosTelemetryCA.crt
-r-------- 1 root root 2455 Sep 27 17:06 daosTelemetryCA.key
-rw-r--r-- 1 root root 0 Sep 27 17:06 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:06 serial.txt
```

Run this script with arguments.
First argument is the File permission you want on certificate,for example below command is run on daos client where it needs to be set as daos_agent user permission.
Second argument is optional for certificate path (By default it's in current directory).
For security reason this script will delete the CA key copied at the end which was copied from the Admin node and will create the local node certificate and key.

```bash
$ cd daosTelemetryCA/
$ gen_telemetry_server_certificate.sh daos_agent
Generating Server Certificate
Generating RSA private key, 2048 bit long modulus (2 primes)
.......................+++++
......................................................................................................+++++
e is 65537 (0x010001)
Signature ok
subject=CN = wolf-170
Getting CA Private Key
Required Server Certificate Files:
.//daosTelemetryCA.crt
.//telemetry.key
.//telemetry.crt
$ ls -l
total 20
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:18 daosTelemetryCA.crt
-rw-r--r-- 1 root root 41 Sep 27 17:19 daosTelemetryCA.srl
-rw-r--r-- 1 root root 0 Sep 27 17:18 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:18 serial.txt
-rw-r--r-- 1 daos_agent daos_agent 1302 Sep 27 17:19 telemetry.crt
-r-------- 1 daos_agent daos_agent 1675 Sep 27 17:19 telemetry.key
```

Below example is ran with daos_server user on server node

```bash
$ cd daosTelemetryCA/
$ gen_telemetry_server_certificate.sh daos_server
Generating Server Certificate
Generating RSA private key, 2048 bit long modulus (2 primes)
.................................................+++++
.+++++
e is 65537 (0x010001)
Signature ok
subject=CN = wolf-173
Getting CA Private Key
Required Server Certificate Files:
.//daosTelemetryCA.crt
.//telemetry.key
.//telemetry.crt
$ ls -l
total 20
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:24 daosTelemetryCA.crt
-rw-r--r-- 1 root root 41 Sep 27 17:24 daosTelemetryCA.srl
-rw-r--r-- 1 root root 0 Sep 27 17:24 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:24 serial.txt
-rw-r--r-- 1 daos_server daos_server 1302 Sep 27 17:24 telemetry.crt
-r-------- 1 daos_server daos_server 1679 Sep 27 17:24 telemetry.key
```

You can copy this certificates on /etc/daos/certs/ or someother secure location

#### Telemetry Yaml Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Telemetry Yaml Example
##### Examples

I think it's OK to simplify this and include it under the "Telemetry Endpoint Configuration" section.


Now you have certificate created and you can add those path in the respective yaml file.

```yaml
# /etc/daos/daos_server.yml (servers)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: false
# Set the server telemetry endpoint port number
port: 9191
# Server certificate for use in TLS handshakes
https_cert: /etc/daos/certs/telemetry.crt
# Key portion of Server Certificate
https_key: /etc/daos/certs/telemetry.key
Comment on lines +776 to +779
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leery of calling it a "server certificate" to avoid confusion with the server component certificate -- how about this?

Suggested change
# Server certificate for use in TLS handshakes
https_cert: /etc/daos/certs/telemetry.crt
# Key portion of Server Certificate
https_key: /etc/daos/certs/telemetry.key
# Configure endpoint to use HTTPS with certificate and key
https_cert: /etc/daos/certs/telemetry.crt
https_key: /etc/daos/certs/telemetry.key

```

```yaml
# /etc/daos/daos_agent.yml (clients)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: false
# Enable client telemetry for all DAOS clients.
enabled: true
# Set the client telemetry endpoint port number
port: 9192
# Retain client telemetry for a period of time after the client process exits.
retain: 30s
# Server certificate for use in TLS handshakes
https_cert: /etc/daos/certs/telemetry.crt
# Key portion of Server Certificate
https_key: /etc/daos/certs/telemetry.key
Comment on lines +791 to +794
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the server case.

Suggested change
# Server certificate for use in TLS handshakes
https_cert: /etc/daos/certs/telemetry.crt
# Key portion of Server Certificate
https_key: /etc/daos/certs/telemetry.key
# Configure endpoint to use HTTPS with certificate and key
https_cert: /etc/daos/certs/telemetry.crt
https_key: /etc/daos/certs/telemetry.key

```

```yaml
# /etc/daos/daos_control.yml (dmg/admin)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: true
# Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosTelemetryCA.crt
```

### Server Startup

The DAOS Server is started as a systemd service. The DAOS Server
Expand Down
34 changes: 28 additions & 6 deletions src/control/cmd/daos_agent/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,16 @@ type Config struct {
ExcludeFabricIfaces common.StringSet `yaml:"exclude_fabric_ifaces,omitempty"`
FabricInterfaces []*NUMAFabricConfig `yaml:"fabric_ifaces,omitempty"`
ProviderIdx uint // TODO SRS-31: Enable with multiprovider functionality
TelemetryPort int `yaml:"telemetry_port,omitempty"`
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"`
TelemetryRetain time.Duration `yaml:"telemetry_retain,omitempty"`
TelemetryConfig *security.TelemetryConfig `yaml:"telemetry_config"`
// Support Old config options.
TelemetryPort int `yaml:"telemetry_port,omitempty"`
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"`
TelemetryRetain time.Duration `yaml:"telemetry_retain,omitempty"`
}

// TelemetryExportEnabled returns true if client telemetry export is enabled.
func (c *Config) TelemetryExportEnabled() bool {
return c.TelemetryPort > 0
return c.TelemetryConfig.Port > 0
}

// NUMAFabricConfig defines a list of fabric interfaces that belong to a NUMA
Expand Down Expand Up @@ -99,14 +101,33 @@ func LoadConfig(cfgPath string) (*Config, error) {
return nil, fmt.Errorf("invalid system name: %s", cfg.SystemName)
}

if cfg.TelemetryRetain > 0 && cfg.TelemetryPort == 0 {
// Support Old config options and copy it to the underline new structure value.
if cfg.TelemetryRetain > 0 {
cfg.TelemetryConfig.Retain = cfg.TelemetryRetain
}

if cfg.TelemetryPort != 0 {
cfg.TelemetryConfig.Port = cfg.TelemetryPort
}

if cfg.TelemetryEnabled {
cfg.TelemetryConfig.Enabled = cfg.TelemetryEnabled
}

if cfg.TelemetryConfig.Retain > 0 && cfg.TelemetryConfig.Port == 0 {
return nil, errors.New("telemetry_retain requires telemetry_port")
}

if cfg.TelemetryEnabled && cfg.TelemetryPort == 0 {
if cfg.TelemetryConfig.Enabled && cfg.TelemetryConfig.Port == 0 {
return nil, errors.New("telemetry_enabled requires telemetry_port")
}

if !cfg.TelemetryConfig.AllowInsecure {
if cfg.TelemetryConfig.HttpsCert == "" || cfg.TelemetryConfig.HttpsKey == "" {
return nil, errors.New("For secure mode, https_cert and https_key required under telemetry_config")
}
}

return cfg, nil
}

Expand All @@ -121,5 +142,6 @@ func DefaultConfig() *Config {
LogLevel: common.DefaultControlLogLevel,
TransportConfig: security.DefaultAgentTransportConfig(),
CredentialConfig: &security.CredentialConfig{},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
}
}
74 changes: 74 additions & 0 deletions src/control/cmd/daos_agent/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,62 @@ transport_config:
allow_insecure: true
`)

telemetryRetainWithBadPort := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
telemetry_retain: 1m
telemetry_port: 0
`)

telemetryEnabledWithBadPort := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
telemetry_enabled: true
telemetry_port: 0
`)

telemetryWithoutHttpsCert := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
allow_insecure: false
https_cert: ""
`)

telemetryWithoutHttpsKey := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
allow_insecure: false
https_key: ""
`)

for name, tc := range map[string]struct {
path string
expResult *Config
Expand All @@ -108,6 +164,22 @@ transport_config:
path: emptyFile,
expResult: DefaultConfig(),
},
"telemetry retain with no port": {
path: telemetryRetainWithBadPort,
expErr: errors.New("telemetry_retain requires telemetry_port"),
},
"telemetry enabled with no port": {
path: telemetryEnabledWithBadPort,
expErr: errors.New("telemetry_enabled requires telemetry_port"),
},
"telemetry with secure mode with no server certificate": {
path: telemetryWithoutHttpsCert,
expErr: errors.New("For secure mode, https_cert and https_key required under telemetry_config"),
},
"telemetry with secure mode with no server key": {
path: telemetryWithoutHttpsKey,
expErr: errors.New("For secure mode, https_cert and https_key required under telemetry_config"),
},
"without optional items": {
path: withoutOptCfg,
expResult: &Config{
Expand All @@ -122,6 +194,7 @@ transport_config:
AllowInsecure: true,
CertificateConfig: DefaultConfig().TransportConfig.CertificateConfig,
},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
},
},
"bad log mask": {
Expand Down Expand Up @@ -154,6 +227,7 @@ transport_config:
AllowInsecure: true,
CertificateConfig: DefaultConfig().TransportConfig.CertificateConfig,
},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
ExcludeFabricIfaces: common.NewStringSet("ib3"),
FabricInterfaces: []*NUMAFabricConfig{
{
Expand Down
4 changes: 2 additions & 2 deletions src/control/cmd/daos_agent/infocache.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ func NewInfoCache(ctx context.Context, log logging.Logger, client control.UnaryI
devStateGetter: network.DefaultNetDevStateProvider(log),
}

ic.clientTelemetryEnabled.Store(cfg.TelemetryEnabled)
ic.clientTelemetryRetain.Store(cfg.TelemetryRetain > 0)
ic.clientTelemetryEnabled.Store(cfg.TelemetryConfig.Enabled)
ic.clientTelemetryRetain.Store(cfg.TelemetryConfig.Retain > 0)

if cfg.DisableCache {
ic.DisableAttachInfoCache()
Expand Down
3 changes: 2 additions & 1 deletion src/control/cmd/daos_agent/infocache_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"github.com/daos-stack/daos/src/control/lib/hardware"
"github.com/daos-stack/daos/src/control/lib/telemetry"
"github.com/daos-stack/daos/src/control/logging"
"github.com/daos-stack/daos/src/control/security"
)

type testInfoCacheParams struct {
Expand Down Expand Up @@ -539,7 +540,7 @@ func TestAgent_NewInfoCache(t *testing.T) {
t.Run(name, func(t *testing.T) {
log, buf := logging.NewTestLogger(t.Name())
defer test.ShowBufferOnFailure(t, buf)

tc.cfg.TelemetryConfig = security.DefaultClientTelemetryConfig()
ic := NewInfoCache(test.Context(t), log, nil, tc.cfg)

test.AssertEqual(t, tc.expEnabled, ic.IsAttachInfoCacheEnabled(), "")
Expand Down
9 changes: 6 additions & 3 deletions src/control/cmd/daos_agent/telemetry.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,14 @@ import (

func startPrometheusExporter(ctx context.Context, log logging.Logger, cs *promexp.ClientSource, cfg *Config) (func(), error) {
expCfg := &promexp.ExporterConfig{
Port: cfg.TelemetryPort,
Title: "DAOS Client Telemetry",
Port: cfg.TelemetryConfig.Port,
Title: "DAOS Client Telemetry",
AllowInsecure: cfg.TelemetryConfig.AllowInsecure,
HttpsCert: cfg.TelemetryConfig.HttpsCert,
HttpsKey: cfg.TelemetryConfig.HttpsKey,
Register: func(ctx context.Context, log logging.Logger) error {
c, err := promexp.NewClientCollector(ctx, log, cs, &promexp.CollectorOpts{
RetainDuration: cfg.TelemetryRetain,
RetainDuration: cfg.TelemetryConfig.Retain,
})
if err != nil {
return err
Expand Down
5 changes: 5 additions & 0 deletions src/control/cmd/dmg/auto_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,11 @@ system_ram_reserved: 16
disable_hugepages: false
control_log_mask: INFO
control_log_file: /tmp/daos_server.log
telemetry_config:
allow_insecure: true
https_cert: /etc/daos/certs/telemetry.crt
https_key: /etc/daos/certs/telemetry.key
ca_cert: /etc/daos/certs/daosTelemetryCA.crt
core_dump_filter: 19
name: daos_server
socket_dir: /var/run/daos_server
Expand Down
Loading
Loading