Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-9825 control: Update Telemetry Endpoint to use HTTPS #15216

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ written to `$HOME/.prometheus.yml`.
To start the Prometheus server with the configuration file generated by `dmg`:

```
prometheus --config-file=$HOME/.prometheus.yml
prometheus --config.file=$HOME/.prometheus.yml
```

## Storage Operations
Expand Down
50 changes: 50 additions & 0 deletions docs/admin/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -759,6 +759,56 @@ transport_config:
key: /etc/daos/certs/admin.key
```

#### Telemetry Certificate Configuration

The DAOS Telemetry framework has option to use certificates to authenticate
between server/client and admin node.
Creating of certificate is not part of DAOS scope and it is up to Admin to
generate the certificate and add it to the DAOS server and client system.

#### Telemetry Yaml Example

Information on telelmetry config parameters in respective yaml file.

```yaml
# /etc/daos/daos_server.yml (servers)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: false
# Set the server telemetry endpoint port number
port: 9191
# Server certificate for use in TLS handshakes
https_cert: /etc/daos/certs/telemetry.crt
# Key portion of Server Certificate
https_key: /etc/daos/certs/telemetry.key
```

```yaml
# /etc/daos/daos_agent.yml (clients)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: false
# Enable client telemetry for all DAOS clients.
enabled: true
# Set the client telemetry endpoint port number
port: 9192
# Retain client telemetry for a period of time after the client process exits.
retain: 30s
# Server certificate for use in TLS handshakes
https_cert: /etc/daos/certs/telemetry.crt
# Key portion of Server Certificate
https_key: /etc/daos/certs/telemetry.key
```

```yaml
# /etc/daos/daos_control.yml (dmg/admin)
telemetry_config:
# To use telemetry in secure mode
allow_insecure: false
# Skip the Server certificate verification. Recomendate for testing purpose only.
https_exception: true
```

### Server Startup

The DAOS Server is started as a systemd service. The DAOS Server
Expand Down
34 changes: 28 additions & 6 deletions src/control/cmd/daos_agent/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,11 @@ type Config struct {
IncludeFabricIfaces common.StringSet `yaml:"include_fabric_ifaces,omitempty"`
FabricInterfaces []*NUMAFabricConfig `yaml:"fabric_ifaces,omitempty"`
ProviderIdx uint // TODO SRS-31: Enable with multiprovider functionality
TelemetryPort int `yaml:"telemetry_port,omitempty"`
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"`
TelemetryRetain time.Duration `yaml:"telemetry_retain,omitempty"`
TelemetryConfig *security.TelemetryConfig `yaml:"telemetry_config"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we'd still want to support the old telemetry config format as well, for at least one version, so we don't force people to change config files without warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's just that those variables moved under the TelemetryConfig struct so from user point of view nothing has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An old config file using telemetry_port no longer works, though. A previously-working config file will just stop working, and they will need to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, let me change the name to support that old config too.

// Support Old config options.
TelemetryPort int `yaml:"telemetry_port,omitempty"`
TelemetryEnabled bool `yaml:"telemetry_enabled,omitempty"`
TelemetryRetain time.Duration `yaml:"telemetry_retain,omitempty"`
}

// Validate performs basic validation of the configuration.
Expand All @@ -73,11 +75,24 @@ func (c *Config) Validate() error {
return fmt.Errorf("invalid system name: %s", c.SystemName)
}

if c.TelemetryRetain > 0 && c.TelemetryPort == 0 {
// Support Old config options and copy it to the underline new structure value.
if c.TelemetryRetain > 0 {
c.TelemetryConfig.Retain = c.TelemetryRetain
}

if c.TelemetryPort != 0 {
c.TelemetryConfig.Port = c.TelemetryPort
}

if c.TelemetryEnabled {
c.TelemetryConfig.Enabled = c.TelemetryEnabled
}

if c.TelemetryConfig.Retain > 0 && c.TelemetryConfig.Port == 0 {
return errors.New("telemetry_retain requires telemetry_port")
}

if c.TelemetryEnabled && c.TelemetryPort == 0 {
if c.TelemetryConfig.Enabled && c.TelemetryConfig.Port == 0 {
return errors.New("telemetry_enabled requires telemetry_port")
}

Expand All @@ -90,7 +105,7 @@ func (c *Config) Validate() error {

// TelemetryExportEnabled returns true if client telemetry export is enabled.
func (c *Config) TelemetryExportEnabled() bool {
return c.TelemetryPort > 0
return c.TelemetryConfig.Port > 0
}

// NUMAFabricConfig defines a list of fabric interfaces that belong to a NUMA
Expand Down Expand Up @@ -125,6 +140,12 @@ func LoadConfig(cfgPath string) (*Config, error) {
return nil, errors.Wrap(err, "agent config validation failed")
}

if !cfg.TelemetryConfig.AllowInsecure {
if cfg.TelemetryConfig.HttpsCert == "" || cfg.TelemetryConfig.HttpsKey == "" {
return nil, errors.New("For secure mode, https_cert and https_key required under telemetry_config")
}
}

return cfg, nil
}

Expand All @@ -139,5 +160,6 @@ func DefaultConfig() *Config {
LogLevel: common.DefaultControlLogLevel,
TransportConfig: security.DefaultAgentTransportConfig(),
CredentialConfig: &security.CredentialConfig{},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
}
}
70 changes: 70 additions & 0 deletions src/control/cmd/daos_agent/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,58 @@ include_fabric_ifaces: ["ib0"]
exclude_fabric_ifaces: ["ib3"]
`)

telemetryRetainWithBadPort := test.CreateTestFile(t, dir, `

control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
telemetry_retain: 1m
telemetry_port: 0
`)

telemetryEnabledWithBadPort := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
telemetry_enabled: true
telemetry_port: 0
`)

telemetryWithoutHttpsCert := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
allow_insecure: false
https_cert: ""
`)

telemetryWithoutHttpsKey := test.CreateTestFile(t, dir, `
name: shire
access_points: ["one:10001", "two:10001"]
port: 4242
runtime_dir: /tmp/runtime
log_file: /home/frodo/logfile
control_log_mask: debug
transport_config:
allow_insecure: true
telemetry_config:
allow_insecure: false
https_key: ""
`)

for name, tc := range map[string]struct {
path string
expResult *Config
Expand All @@ -120,6 +172,22 @@ exclude_fabric_ifaces: ["ib3"]
path: emptyFile,
expResult: DefaultConfig(),
},
"telemetry retain with no port": {
path: telemetryRetainWithBadPort,
expErr: errors.New("telemetry_retain requires telemetry_port"),
},
"telemetry enabled with no port": {
path: telemetryEnabledWithBadPort,
expErr: errors.New("telemetry_enabled requires telemetry_port"),
},
"telemetry with secure mode with no server certificate": {
path: telemetryWithoutHttpsCert,
expErr: errors.New("For secure mode, https_cert and https_key required under telemetry_config"),
},
"telemetry with secure mode with no server key": {
path: telemetryWithoutHttpsKey,
expErr: errors.New("For secure mode, https_cert and https_key required under telemetry_config"),
},
"without optional items": {
path: withoutOptCfg,
expResult: &Config{
Expand All @@ -134,6 +202,7 @@ exclude_fabric_ifaces: ["ib3"]
AllowInsecure: true,
CertificateConfig: DefaultConfig().TransportConfig.CertificateConfig,
},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
},
},
"bad log mask": {
Expand Down Expand Up @@ -170,6 +239,7 @@ exclude_fabric_ifaces: ["ib3"]
AllowInsecure: true,
CertificateConfig: DefaultConfig().TransportConfig.CertificateConfig,
},
TelemetryConfig: security.DefaultClientTelemetryConfig(),
ExcludeFabricIfaces: common.NewStringSet("ib3"),
FabricInterfaces: []*NUMAFabricConfig{
{
Expand Down
4 changes: 2 additions & 2 deletions src/control/cmd/daos_agent/infocache.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ func NewInfoCache(ctx context.Context, log logging.Logger, client control.UnaryI
devStateGetter: network.DefaultNetDevStateProvider(log),
}

ic.clientTelemetryEnabled.Store(cfg.TelemetryEnabled)
ic.clientTelemetryRetain.Store(cfg.TelemetryRetain > 0)
ic.clientTelemetryEnabled.Store(cfg.TelemetryConfig.Enabled)
ic.clientTelemetryRetain.Store(cfg.TelemetryConfig.Retain > 0)

if cfg.DisableCache {
ic.DisableAttachInfoCache()
Expand Down
3 changes: 2 additions & 1 deletion src/control/cmd/daos_agent/infocache_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"github.com/daos-stack/daos/src/control/lib/hardware"
"github.com/daos-stack/daos/src/control/lib/telemetry"
"github.com/daos-stack/daos/src/control/logging"
"github.com/daos-stack/daos/src/control/security"
)

type testInfoCacheParams struct {
Expand Down Expand Up @@ -539,7 +540,7 @@ func TestAgent_NewInfoCache(t *testing.T) {
t.Run(name, func(t *testing.T) {
log, buf := logging.NewTestLogger(t.Name())
defer test.ShowBufferOnFailure(t, buf)

tc.cfg.TelemetryConfig = security.DefaultClientTelemetryConfig()
ic := NewInfoCache(test.Context(t), log, nil, tc.cfg)

test.AssertEqual(t, tc.expEnabled, ic.IsAttachInfoCacheEnabled(), "")
Expand Down
9 changes: 6 additions & 3 deletions src/control/cmd/daos_agent/telemetry.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,14 @@ import (

func startPrometheusExporter(ctx context.Context, log logging.Logger, cs *promexp.ClientSource, cfg *Config) (func(), error) {
expCfg := &promexp.ExporterConfig{
Port: cfg.TelemetryPort,
Title: "DAOS Client Telemetry",
Port: cfg.TelemetryConfig.Port,
Title: "DAOS Client Telemetry",
AllowInsecure: cfg.TelemetryConfig.AllowInsecure,
HttpsCert: cfg.TelemetryConfig.HttpsCert,
HttpsKey: cfg.TelemetryConfig.HttpsKey,
Register: func(ctx context.Context, log logging.Logger) error {
c, err := promexp.NewClientCollector(ctx, log, cs, &promexp.CollectorOpts{
RetainDuration: cfg.TelemetryRetain,
RetainDuration: cfg.TelemetryConfig.Retain,
})
if err != nil {
return err
Expand Down
4 changes: 4 additions & 0 deletions src/control/cmd/dmg/auto_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -595,6 +595,10 @@ system_ram_reserved: 26
disable_hugepages: false
control_log_mask: INFO
control_log_file: /tmp/daos_server.log
telemetry_config:
allow_insecure: true
https_cert: /etc/daos/certs/telemetry.crt
https_key: /etc/daos/certs/telemetry.key
core_dump_filter: 19
name: daos_server
socket_dir: /var/run/daos_server
Expand Down
1 change: 1 addition & 0 deletions src/control/cmd/dmg/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,7 @@ and access control settings, along with system wide operations.`

if opts.Insecure {
ctlCfg.TransportConfig.AllowInsecure = true
ctlCfg.TelemetryConfig.AllowInsecure = true
}
if err := ctlCfg.TransportConfig.PreLoadCertData(); err != nil {
return errors.Wrap(err, "Unable to load Certificate Data")
Expand Down
Loading
Loading