Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-9825 control: Update Telemetry Endpoint to use HTTPS #15216

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ written to `$HOME/.prometheus.yml`.
To start the Prometheus server with the configuration file generated by `dmg`:

```
prometheus --config-file=$HOME/.prometheus.yml
prometheus --config.file=$HOME/.prometheus.yml
```

## Storage Operations
Expand Down
156 changes: 156 additions & 0 deletions docs/admin/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -759,6 +759,162 @@ transport_config:
key: /etc/daos/certs/admin.key
```

#### Telemetry Certificate Configuration

The DAOS Telemetry framework has option to use certificates to authenticate
between server/client and admin node.A set of certificates for a given DAOS systems may be
generated by running the `gen_telemetry_admin_certificate.sh` and `gen_telemetry_server_certificate.sh` script provided with the DAOS
software if there is not an existing TLS certificate infrastructure. The
Both script uses the `openssl` tool to generate all of the
necessary files.

##### Telemetry Admin script

This `gen_telemetry_admin_certificate.sh` script needs to run on the system where the `dmg telemetry metrics` command is going to run or on the system where Prometheus is going to be setup for collecting metrics.

```bash
$ cd /tmp/
$ gen_telemetry_admin_certificate.sh
Generating Private CA Root Certificate
Generating RSA private key, 3072 bit long modulus (2 primes)
............................................................................................................++++
.............++++
e is 65537 (0x010001)
Private CA Root Certificate for Telemetry created in ./daosTelemetryCA
```

This will create the key and cert file

```bash
$ ls -l /tmp/daosTelemetryCA/
total 12
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:06 daosTelemetryCA.crt
-r-------- 1 root root 2455 Sep 27 17:06 daosTelemetryCA.key
-rw-r--r-- 1 root root 0 Sep 27 17:06 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:06 serial.txt
```

The generated keys and certificates must then be securely distributed to all nodes for which you need to collect the DAOS metrics.

You can copy this certificates on /etc/daos/certs/ or someother secure location

##### Telemetry Server script

This `gen_telemetry_server_certificate.sh` script need to run on the DAOS server/client node for which DAOS metrics needs to be gathered.

Below files are copied from the Admin node in previous steps.

```bash
$ ls -l /tmp/daosTelemetryCA/
total 12
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:06 daosTelemetryCA.crt
-r-------- 1 root root 2455 Sep 27 17:06 daosTelemetryCA.key
-rw-r--r-- 1 root root 0 Sep 27 17:06 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:06 serial.txt
```

Run this script with arguments.
First argument is the File permission you want on certificate,for example below command is run on daos client where it needs to be set as daos_agent user permission.
Second argument is optional for certificate path (By default it's in current directory).
For security reason this script will delete the CA key copied at the end which was copied from the Admin node and will create the local node certificate and key.

```bash
$ cd daosTelemetryCA/
$ gen_telemetry_server_certificate.sh daos_agent
Generating Server Certificate
Generating RSA private key, 2048 bit long modulus (2 primes)
.......................+++++
......................................................................................................+++++
e is 65537 (0x010001)
Signature ok
subject=CN = wolf-170
Getting CA Private Key
Required Server Certificate Files:
.//daosTelemetryCA.crt
.//telemetryserver.key
.//telemetryserver.crt
$ ls -l
total 20
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:18 daosTelemetryCA.crt
-rw-r--r-- 1 root root 41 Sep 27 17:19 daosTelemetryCA.srl
-rw-r--r-- 1 root root 0 Sep 27 17:18 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:18 serial.txt
-rw-r--r-- 1 daos_agent daos_agent 1302 Sep 27 17:19 telemetryserver.crt
-r-------- 1 daos_agent daos_agent 1675 Sep 27 17:19 telemetryserver.key
```

Below example is ran with daos_server user on server node

```bash
$ cd daosTelemetryCA/
$ gen_telemetry_server_certificate.sh daos_server
Generating Server Certificate
Generating RSA private key, 2048 bit long modulus (2 primes)
.................................................+++++
.+++++
e is 65537 (0x010001)
Signature ok
subject=CN = wolf-173
Getting CA Private Key
Required Server Certificate Files:
.//daosTelemetryCA.crt
.//telemetryserver.key
.//telemetryserver.crt
$ ls -l
total 20
-rw-r--r-- 1 root daos_daemons 1460 Sep 27 17:24 daosTelemetryCA.crt
-rw-r--r-- 1 root root 41 Sep 27 17:24 daosTelemetryCA.srl
-rw-r--r-- 1 root root 0 Sep 27 17:24 index.txt
-rw-r--r-- 1 root root 3 Sep 27 17:24 serial.txt
-rw-r--r-- 1 daos_server daos_server 1302 Sep 27 17:24 telemetryserver.crt
-r-------- 1 daos_server daos_server 1679 Sep 27 17:24 telemetryserver.key
```

You can copy this certificates on /etc/daos/certs/ or someother secure location

#### Telemetry Yaml Example

Now you have certificate created and you can add those path in the respective yaml file.

```yaml
# /etc/daos/daos_server.yml (servers)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: false
# Set the server telemetry endpoint port number
port: 9191
# Server certificate for use in TLS handshakes
server_cert: /etc/daos/certs/telemetryserver.crt
# Key portion of Server Certificate
server_key: /etc/daos/certs/telemetryserver.key
```

```yaml
# /etc/daos/daos_agent.yml (clients)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: false
# Enable client telemetry for all DAOS clients.
enabled: true
# Set the client telemetry endpoint port number
port: 9192
# Retain client telemetry for a period of time after the client process exits.
retain: 30s
# Server certificate for use in TLS handshakes
server_cert: /etc/daos/certs/telemetryserver.crt
# Key portion of Server Certificate
server_key: /etc/daos/certs/telemetryserver.key
```

```yaml
# /etc/daos/daos_control.yml (dmg/admin)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: true
# Custom CA Root certificate for generated certs
ca_cert: /etc/daos/certs/daosTelemetryCA.crt
```

### Server Startup

The DAOS Server is started as a systemd service. The DAOS Server
Expand Down
17 changes: 11 additions & 6 deletions src/control/cmd/daos_agent/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,12 @@ type Config struct {
ExcludeFabricIfaces common.StringSet `yaml:"exclude_fabric_ifaces,omitempty"`
FabricInterfaces []*NUMAFabricConfig `yaml:"fabric_ifaces,omitempty"`
ProviderIdx uint // TODO SRS-31: Enable with multiprovider functionality
TelemetryPort int `yaml:"telemetry_port,omitempty"`
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"`
TelemetryRetain time.Duration `yaml:"telemetry_retain,omitempty"`
TelemetryConfig *security.TelemetryConfig `yaml:"telemetry_config"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we'd still want to support the old telemetry config format as well, for at least one version, so we don't force people to change config files without warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's just that those variables moved under the TelemetryConfig struct so from user point of view nothing has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An old config file using telemetry_port no longer works, though. A previously-working config file will just stop working, and they will need to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, let me change the name to support that old config too.

}

// TelemetryExportEnabled returns true if client telemetry export is enabled.
func (c *Config) TelemetryExportEnabled() bool {
return c.TelemetryPort > 0
return c.TelemetryConfig.Port > 0
}

// NUMAFabricConfig defines a list of fabric interfaces that belong to a NUMA
Expand Down Expand Up @@ -99,14 +97,20 @@ func LoadConfig(cfgPath string) (*Config, error) {
return nil, fmt.Errorf("invalid system name: %s", cfg.SystemName)
}

if cfg.TelemetryRetain > 0 && cfg.TelemetryPort == 0 {
if cfg.TelemetryConfig.Retain > 0 && cfg.TelemetryConfig.Port == 0 {
return nil, errors.New("telemetry_retain requires telemetry_port")
}

if cfg.TelemetryEnabled && cfg.TelemetryPort == 0 {
if cfg.TelemetryConfig.Enabled && cfg.TelemetryConfig.Port == 0 {
return nil, errors.New("telemetry_enabled requires telemetry_port")
}

if cfg.TelemetryConfig.AllowInsecure == false {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - slightly more idiomatic way of saying the same

Suggested change
if cfg.TelemetryConfig.AllowInsecure == false {
if !cfg.TelemetryConfig.AllowInsecure {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

if cfg.TelemetryConfig.ServerCert == "" || cfg.TelemetryConfig.ServerKey == "" {
return nil, errors.New("For secure mode, server_cert and server_key required under telemetry_config")
}
}

return cfg, nil
}

Expand All @@ -121,5 +125,6 @@ func DefaultConfig() *Config {
LogLevel: common.DefaultControlLogLevel,
TransportConfig: security.DefaultAgentTransportConfig(),
CredentialConfig: &security.CredentialConfig{},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
}
}
74 changes: 74 additions & 0 deletions src/control/cmd/daos_agent/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,62 @@ transport_config:
allow_insecure: true
`)

telemetryRetainWithBadPort := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
retain: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the config file example, this is shown as "1m". Is that valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes as it's time.duration but will change it to 1m as better example

port: 0
`)

telemetryEnabledWithBadPort := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
enabled: true
port: 0
`)

telemetryWithoutServerCert := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
allow_insecure: false
server_cert: ""
`)

telemetryWithoutServerKey := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
allow_insecure: false
server_key: ""
`)

for name, tc := range map[string]struct {
path string
expResult *Config
Expand All @@ -108,6 +164,22 @@ transport_config:
path: emptyFile,
expResult: DefaultConfig(),
},
"telemetry retain with no port": {
path: telemetryRetainWithBadPort,
expErr: errors.New("telemetry_retain requires telemetry_port"),
},
"telemetry enabled with no port": {
path: telemetryEnabledWithBadPort,
expErr: errors.New("telemetry_enabled requires telemetry_port"),
},
"telemetry with secure mode with no server certificate": {
path: telemetryWithoutServerCert,
expErr: errors.New("For secure mode, server_cert and server_key required under telemetry_config"),
},
"telemetry with secure mode with no server key": {
path: telemetryWithoutServerKey,
expErr: errors.New("For secure mode, server_cert and server_key required under telemetry_config"),
},
"without optional items": {
path: withoutOptCfg,
expResult: &Config{
Expand All @@ -122,6 +194,7 @@ transport_config:
AllowInsecure: true,
CertificateConfig: DefaultConfig().TransportConfig.CertificateConfig,
},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
},
},
"bad log mask": {
Expand Down Expand Up @@ -154,6 +227,7 @@ transport_config:
AllowInsecure: true,
CertificateConfig: DefaultConfig().TransportConfig.CertificateConfig,
},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
ExcludeFabricIfaces: common.NewStringSet("ib3"),
FabricInterfaces: []*NUMAFabricConfig{
{
Expand Down
4 changes: 2 additions & 2 deletions src/control/cmd/daos_agent/infocache.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ func NewInfoCache(ctx context.Context, log logging.Logger, client control.UnaryI
devStateGetter: network.DefaultNetDevStateProvider(log),
}

ic.clientTelemetryEnabled.Store(cfg.TelemetryEnabled)
ic.clientTelemetryRetain.Store(cfg.TelemetryRetain > 0)
ic.clientTelemetryEnabled.Store(cfg.TelemetryConfig.Enabled)
ic.clientTelemetryRetain.Store(cfg.TelemetryConfig.Retain > 0)

if cfg.DisableCache {
ic.DisableAttachInfoCache()
Expand Down
3 changes: 2 additions & 1 deletion src/control/cmd/daos_agent/infocache_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"github.com/daos-stack/daos/src/control/lib/hardware"
"github.com/daos-stack/daos/src/control/lib/telemetry"
"github.com/daos-stack/daos/src/control/logging"
"github.com/daos-stack/daos/src/control/security"
)

type testInfoCacheParams struct {
Expand Down Expand Up @@ -539,7 +540,7 @@ func TestAgent_NewInfoCache(t *testing.T) {
t.Run(name, func(t *testing.T) {
log, buf := logging.NewTestLogger(t.Name())
defer test.ShowBufferOnFailure(t, buf)

tc.cfg.TelemetryConfig = security.DefaultClientTelemetryConfig()
ic := NewInfoCache(test.Context(t), log, nil, tc.cfg)

test.AssertEqual(t, tc.expEnabled, ic.IsAttachInfoCacheEnabled(), "")
Expand Down
9 changes: 6 additions & 3 deletions src/control/cmd/daos_agent/telemetry.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,14 @@ import (

func startPrometheusExporter(ctx context.Context, log logging.Logger, cs *promexp.ClientSource, cfg *Config) (func(), error) {
expCfg := &promexp.ExporterConfig{
Port: cfg.TelemetryPort,
Title: "DAOS Client Telemetry",
Port: cfg.TelemetryConfig.Port,
Title: "DAOS Client Telemetry",
AllowInsecure: cfg.TelemetryConfig.AllowInsecure,
HttpsCert: cfg.TelemetryConfig.ServerCert,
HttpsKey: cfg.TelemetryConfig.ServerKey,
Register: func(ctx context.Context, log logging.Logger) error {
c, err := promexp.NewClientCollector(ctx, log, cs, &promexp.CollectorOpts{
RetainDuration: cfg.TelemetryRetain,
RetainDuration: cfg.TelemetryConfig.Retain,
})
if err != nil {
return err
Expand Down
5 changes: 5 additions & 0 deletions src/control/cmd/dmg/auto_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,11 @@ system_ram_reserved: 16
disable_hugepages: false
control_log_mask: INFO
control_log_file: /tmp/daos_server.log
telemetry_config:
allow_insecure: false
server_cert: /etc/daos/certs/telemetryserver.crt
server_key: /etc/daos/certs/telemetryserver.key
ca_cert: /etc/daos/certs/daosTelemetryCA.crt
core_dump_filter: 19
name: daos_server
socket_dir: /var/run/daos_server
Expand Down
1 change: 1 addition & 0 deletions src/control/cmd/dmg/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,7 @@ and access control settings, along with system wide operations.`

if opts.Insecure {
ctlCfg.TransportConfig.AllowInsecure = true
ctlCfg.TelemetryConfig.AllowInsecure = true
}
if err := ctlCfg.TransportConfig.PreLoadCertData(); err != nil {
return errors.Wrap(err, "Unable to load Certificate Data")
Expand Down
Loading
Loading