Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix master config and experiment config for log retention #9075

Merged
merged 1 commit into from
Apr 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions docs/reference/deploy/master-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1542,14 +1542,14 @@ Security-related configuration settings.
certificate is not signed by a well-known CA; cannot be specified if ``skip_verify`` is
enabled.

***********************
``logging_retention``
***********************
**********************
``retention_policy``
**********************

Specifies configuration settings for the logging retention of trial logs.
Specifies configuration settings for the retention of trial logs.

``days``
========
``log_retention_days``
======================

Number of days to retain logs for by default. This can be overridden on a per-experiment basis in
the :ref:`experiment configuration <log-retention-days>`. Values should be between ``-1`` and
Expand All @@ -1566,16 +1566,16 @@ For example, to schedule cleanup for midnight every day:

.. code:: yaml

logging_retention:
days: 90
retention_policy:
log_retention_days: 90
schedule: "0 0 * * *"

or to schedule cleanup every 24 hours from start:

.. code:: yaml

logging_retention:
days: 90
retention_policy:
log_retention_days: 90
schedule: "24h"

**********
Expand Down
20 changes: 14 additions & 6 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -321,18 +321,26 @@ defaults.

.. _log-retention-days:

``log_retention_days``
======================
``retention_policy``
====================

Optional. Defines retention policies for logs related to all trials of a given experiment.
Parameters include:

Optional. Overrides the number of days to retain logs for a trial. Values should be between ``-1``
and ``32767``. If set to ``-1``, logs will be retained indefinitely. If set to ``0``, logs will be
deleted during the next cleanup.
- ``log_retention_days``: Optional. Overrides the number of days to retain logs for a trial.
Acceptable values range from ``-1`` to ``32767``. If set to ``-1``, logs will be retained
indefinitely. If set to ``0``, logs will be deleted during the next cleanup. To modify the
retention settings post-completion for a single trial or the entire experiment, you can use the
CLI command ``det t set log-retention <trial-id>`` or ``det e set log-retention <exp-id>``. Both
commands accept either the argument: ``--days``, which sets the number of days to retain logs
from the time of creation, or ``--forever`` which retains logs indefinitely.

Example configuration:

.. code:: yaml

log_retention_days: 90
retention_policy:
log_retention_days: 90

This setting can be defined as a default setting for the entire cluster.

Expand Down
32 changes: 21 additions & 11 deletions master/internal/api_logretention_intg_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,17 @@ import (
"github.com/determined-ai/determined/proto/pkg/utilv1"
)

const pgTimeFormat = "2006-01-02T15:04:05.888738 -07:00:00"
const (
pgTimeFormat = "2006-01-02T15:04:05.888738 -07:00:00"
logRetentionConfig1000days = `
retention_policy:
log_retention_days: 1000
`
logRetentionConfigForever = `
retention_policy:
log_retention_days: -1
`
)

func setRetentionTime(timestamp string) error {
_, err := db.Bun().NewRaw(fmt.Sprintf(`
Expand Down Expand Up @@ -121,13 +131,13 @@ func TestDeleteExpiredTaskLogs(t *testing.T) {
require.Len(t, taskIDs1, 5)

// Create an experiment1 with 5 trials and a config to expire in 1000 days.
experiment2, trialIDs2, taskIDs2 := createTestRetentionExperiment(ctx, t, api, "log_retention_days: 1000", 5)
experiment2, trialIDs2, taskIDs2 := createTestRetentionExperiment(ctx, t, api, logRetentionConfig1000days, 5)
require.Nil(t, experiment2.EndTime)
require.Len(t, trialIDs2, 5)
require.Len(t, taskIDs2, 5)

// Create an experiment1 with 5 trials and config to never expire.
experiment3, trialIDs3, taskIDs3 := createTestRetentionExperiment(ctx, t, api, "log_retention_days: -1", 5)
experiment3, trialIDs3, taskIDs3 := createTestRetentionExperiment(ctx, t, api, logRetentionConfigForever, 5)
require.Nil(t, experiment3.EndTime)
require.Len(t, trialIDs3, 5)
require.Len(t, taskIDs3, 5)
Expand Down Expand Up @@ -292,8 +302,8 @@ func TestScheduleRetentionNoConfig(t *testing.T) {
api, _, ctx := setupAPITest(t, nil)

err := logretention.Schedule(model.LogRetentionPolicy{
Days: ptrs.Ptr(int16(100)),
Schedule: ptrs.Ptr("0 0 * * *"),
LogRetentionDays: ptrs.Ptr(int16(100)),
Schedule: ptrs.Ptr("0 0 * * *"),
})
require.NoError(t, err)

Expand Down Expand Up @@ -389,8 +399,8 @@ func TestScheduleRetention1000days(t *testing.T) {
api, _, ctx := setupAPITest(t, nil)

err := logretention.Schedule(model.LogRetentionPolicy{
Days: ptrs.Ptr(int16(100)),
Schedule: ptrs.Ptr("0 0 * * *"),
LogRetentionDays: ptrs.Ptr(int16(100)),
Schedule: ptrs.Ptr("0 0 * * *"),
})
require.NoError(t, err)

Expand All @@ -399,7 +409,7 @@ func TestScheduleRetention1000days(t *testing.T) {
require.NoError(t, err)

// Create an experiment with 5 trials and a config to expire in 1000 days.
experiment, trialIDs, taskIDs := createTestRetentionExperiment(ctx, t, api, "log_retention_days: 1000", 5)
experiment, trialIDs, taskIDs := createTestRetentionExperiment(ctx, t, api, logRetentionConfig1000days, 5)
require.Nil(t, experiment.EndTime)
require.Len(t, trialIDs, 5)
require.Len(t, taskIDs, 5)
Expand Down Expand Up @@ -493,8 +503,8 @@ func TestScheduleRetentionNeverExpire(t *testing.T) {
api, _, ctx := setupAPITest(t, nil)

err := logretention.Schedule(model.LogRetentionPolicy{
Days: ptrs.Ptr(int16(100)),
Schedule: ptrs.Ptr("0 0 * * *"),
LogRetentionDays: ptrs.Ptr(int16(100)),
Schedule: ptrs.Ptr("0 0 * * *"),
})
require.NoError(t, err)

Expand All @@ -503,7 +513,7 @@ func TestScheduleRetentionNeverExpire(t *testing.T) {
require.NoError(t, err)

// Create an experiment with 5 trials and config to never expire.
experiment, trialIDs, taskIDs := createTestRetentionExperiment(ctx, t, api, "log_retention_days: -1", 5)
experiment, trialIDs, taskIDs := createTestRetentionExperiment(ctx, t, api, logRetentionConfigForever, 5)
require.Nil(t, experiment.EndTime)
require.Len(t, trialIDs, 5)
require.Len(t, taskIDs, 5)
Expand Down
2 changes: 1 addition & 1 deletion master/internal/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ type Config struct {
LaunchError bool `json:"launch_error"`
ClusterName string `json:"cluster_name"`
Logging model.LoggingConfig `json:"logging"`
LoggingRetention model.LogRetentionPolicy `json:"logging_retention"`
RetentionPolicy model.LogRetentionPolicy `json:"retention_policy"`
Observability ObservabilityConfig `json:"observability"`
Cache CacheConfig `json:"cache"`
Webhooks WebhooksConfig `json:"webhooks"`
Expand Down
4 changes: 2 additions & 2 deletions master/internal/core.go
Original file line number Diff line number Diff line change
Expand Up @@ -1123,8 +1123,8 @@ func (m *Master) Run(ctx context.Context, gRPCLogInitDone chan struct{}) error {
SegmentEnabled: m.config.Telemetry.Enabled && m.config.Telemetry.SegmentMasterKey != "",
SegmentAPIKey: m.config.Telemetry.SegmentMasterKey,
}
if m.config.LoggingRetention.Schedule != nil {
if err := logretention.Schedule(m.config.LoggingRetention); err != nil {
if m.config.RetentionPolicy.Schedule != nil {
if err := logretention.Schedule(m.config.RetentionPolicy); err != nil {
return errors.Wrap(err, "initializing log retention")
}
}
Expand Down
4 changes: 2 additions & 2 deletions master/internal/core_experiment.go
Original file line number Diff line number Diff line change
Expand Up @@ -311,8 +311,8 @@ func (m *Master) parseCreateExperiment(ctx context.Context, req *apiv1.CreateExp
return nil, nil, config, nil, nil, errors.New("managed experiments require entrypoint")
}
// Merge log retention into the taskSpec.
if config.RawLogRetentionDays != nil {
taskSpec.LogRetentionDays = config.RawLogRetentionDays
if config.RawRetentionPolicy != nil {
taskSpec.LogRetentionDays = config.RawRetentionPolicy.RawLogRetentionDays
}

// Merge in workspace's checkpoint storage into the conifg.
Expand Down
3 changes: 2 additions & 1 deletion master/internal/experiment/bulk_action.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"github.com/determined-ai/determined/master/internal/db/bunutils"
"github.com/determined-ai/determined/master/internal/grpcutil"
"github.com/determined-ai/determined/master/pkg/model"
"github.com/determined-ai/determined/master/pkg/schemas/expconf"
"github.com/determined-ai/determined/proto/pkg/apiv1"
"github.com/determined-ai/determined/proto/pkg/experimentv1"
"github.com/determined-ai/determined/proto/pkg/rbacv1"
Expand Down Expand Up @@ -760,7 +761,7 @@ func changeExperimentConfigLogRetention(ctx context.Context, database db.DB,
err, "unable to load config for experiment %v", exp.ID,
)
}
activeConfig.SetLogRetentionDays(&numDays)
activeConfig.SetRetentionPolicy(&expconf.RetentionPolicyConfigV0{RawLogRetentionDays: &numDays})

if err := database.SaveExperimentConfig(exp.ID, activeConfig); err != nil {
return errors.Wrapf(err, "patching experiment %d", exp.ID)
Expand Down
2 changes: 1 addition & 1 deletion master/internal/logretention/logretention.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ func Schedule(config model.LogRetentionPolicy) error {
TestingOnlySynchronizationHelper.Done()
}
}()
count, err := DeleteExpiredTaskLogs(context.Background(), config.Days)
count, err := DeleteExpiredTaskLogs(context.Background(), config.LogRetentionDays)
if err != nil {
log.WithError(err).Error("failed to delete expired task logs")
} else if count > 0 {
Expand Down
4 changes: 2 additions & 2 deletions master/internal/populate_metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@ func PopulateExpTrialsMetrics(pgdb *db.PgDB, masterConfig *config.Config, trivia
JobID: &jID,
TaskType: model.TaskTypeTrial,
StartTime: time.Now().UTC().Truncate(time.Millisecond),
LogRetentionDays: masterConfig.LoggingRetention.Days,
LogRetentionDays: masterConfig.RetentionPolicy.LogRetentionDays,
}
if err = db.AddTask(ctx, tIn); err != nil {
return err
Expand All @@ -261,7 +261,7 @@ func PopulateExpTrialsMetrics(pgdb *db.PgDB, masterConfig *config.Config, trivia
ExperimentID: exp.ID,
State: model.CompletedState,
StartTime: time.Now(),
LogRetentionDays: masterConfig.LoggingRetention.Days,
LogRetentionDays: masterConfig.RetentionPolicy.LogRetentionDays,
}
if err = db.AddTrial(ctx, &tr, tID); err != nil {
return err
Expand Down
4 changes: 2 additions & 2 deletions master/pkg/model/logging_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ func (o *ElasticSecurityConfig) Resolve() error {
// LogRetentionPolicy configures the default log retention policy for trials and tasks.
type LogRetentionPolicy struct {
// Days is the default number of days to retain logs for.
Days *int16 `json:"days"`
LogRetentionDays *int16 `json:"log_retention_days"`
// Schedule is a time duration or cron expression interval to cleanup logs.
Schedule *string `json:"schedule"`
}
Expand All @@ -100,7 +100,7 @@ var (
// Validate implements the check.Validatable interface.
func (p LogRetentionPolicy) Validate() []error {
var errs []error
if p.Days != nil && *p.Days < -1 {
if p.LogRetentionDays != nil && *p.LogRetentionDays < -1 {
errs = append(errs, errLogRetentionDaysParse)
}
if p.Schedule != nil {
Expand Down
9 changes: 8 additions & 1 deletion master/pkg/schemas/expconf/experiment_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ type ExperimentConfigV0 struct {
RawHyperparameters HyperparametersV0 `json:"hyperparameters"`
RawLabels LabelsV0 `json:"labels"`
RawLogPolicies LogPoliciesConfigV0 `json:"log_policies"`
RawLogRetentionDays *int16 `json:"log_retention_days,omitempty"`
RawRetentionPolicy *RetentionPolicyConfigV0 `json:"retention_policy,omitempty"`
RawMaxRestarts *int `json:"max_restarts"`
RawMinCheckpointPeriod *LengthV0 `json:"min_checkpoint_period"`
RawMinValidationPeriod *LengthV0 `json:"min_validation_period"`
Expand Down Expand Up @@ -377,3 +377,10 @@ type SecurityConfigV0 struct {
type KerberosConfigV0 struct {
RawConfigFile string `json:"config_file"`
}

// RetentionPolicyConfigV0 has the retention policy.
//
//go:generate ../gen.sh
type RetentionPolicyConfigV0 struct {
RawLogRetentionDays *int16 `json:"log_retention_days,omitempty"`
}
1 change: 1 addition & 0 deletions master/pkg/schemas/expconf/latest.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ type (
RandomConfig = RandomConfigV0
ReproducibilityConfig = ReproducibilityConfigV0
ResourcesConfig = ResourcesConfigV0
RetentionPolicy = RetentionPolicyConfigV0
S3Config = S3ConfigV0
SearcherConfig = SearcherConfigV0
SharedFSConfig = SharedFSConfigV0
Expand Down
8 changes: 4 additions & 4 deletions master/pkg/schemas/expconf/zgen_experiment_config_v0.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

32 changes: 32 additions & 0 deletions master/pkg/schemas/expconf/zgen_retention_policy_config_v0.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading