-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support profiling metrics in generic metrics [MD-11] #8884
feat: support profiling metrics in generic metrics [MD-11] #8884
Conversation
✅ Deploy Preview for determined-ui canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Can you expand on your test plan and migration time expectations for customers?
// The number of batches trained on when these metrics were reported. | ||
int32 steps_completed = 3; | ||
optional int32 steps_completed = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: changing the field order when not necessary is discouraged afaik but we don't have direct external consumers so it shouldn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know, changed it back.
@@ -236,9 +239,9 @@ func (db *PgDB) addRawMetrics(ctx context.Context, tx *sqlx.Tx, mBody *metricsBo | |||
INSERT INTO metrics | |||
(trial_id, trial_run_id, end_time, metrics, total_batches, partition_type, metric_group) | |||
VALUES | |||
($1, $2, now(), $3, $4, $5, $6) | |||
($1, $2, COALESCE($3, now()), $4, $5, $6, $7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would there be concerns in terms of reportedtime's tz and the servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think so. the client-reported timestamps are all converted to time.Time
, which the postgres driver automatically converts to the correct server timezone.
b3d5870
to
2e7974b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## profiling-v2 #8884 +/- ##
================================================
+ Coverage 47.35% 47.36% +0.01%
================================================
Files 1162 1162
Lines 176133 176168 +35
Branches 2237 2236 -1
================================================
+ Hits 83402 83441 +39
+ Misses 92573 92569 -4
Partials 158 158
Flags with carried forward coverage won't be shown. Click here to find out more.
|
f1fac36
to
8975c11
Compare
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
8975c11
to
c489f19
Compare
merging this to feature branch |
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
generic_metrics: - DB schema changes - Changes to backend ReportTrialMetrics APIs
Description
Backend changes to support profiling metrics (with the determined profiler) in generic metrics (project doc). Major changes here relate to the schema of the
metrics
table and backend API changes needed to support it:total_batches
on themetrics
table is no longer a required field. It's not always sensible for generic, non-training metrics (i.e. system metrics) to be associated with a batch. With respect to DB/query performance, existing code will not be impacted, because themetrics
table is partitioned into children tables depending on metric type (see migrations), so the existing partitioned tables will each specifyNOT NULL
constraints as they need to.partition_type
on themetrics
table is now a postgrestext
type instead of anenum
type. This is because a) partition keys cannot be easily changed in postgres -- you have to detach all the partitions, recreate the parent table with a different partition key, then reattach all partitions, and b) enums cannot be easily updated (they must be renamed, recreated, and dropped). So having an enum as a partition key makes adding/removing partition types formetrics
incredibly painful for both developers and users (each re-attach of a partition is a non-trivial migration cost). I could not detect a noticeable difference in query time when benchmarking this change.end_time
is now reportable by the client. This allows for a more accurate timestamp for certain metrics (i.e. system profiling). If it isn't reported by the client, the backend will use the same existing logic for default timestamping the added metrics.PROFILING
partition is added to generic metrics. Client-side reporting changes will be in a separate PR, and these changes should be safe to land without them.These migrations took ~5 minutes on a benchmarking database with a sizeable number of rows (~60M) in the
metric
table, so user impact should be less than or around the same time, which should be called out when this is released.Test Plan
This PR consists of:
Since this PR does not contain read/client-side changes, testing should be done manually. You'll need query access to the database for the master you're testing this on.
Verify that the database migrations have run successfully
Assuming DB migrations have been run on the testing database, there are 3 changes to the
metrics
table that should be verified:The
NOT NULL
requirement for thetotal_batches
column was dropped from the parentmetrics
table, and added back on the individual child partitions of themetrics
table.metrics
table schema and make sure thetotal_batches
column is nullable with no default set.metrics
(raw_steps
,raw_validations
,generic_metrics
) have theNULL
requirement ontotal_batches
and a default value of0
.The type of the
partition_type
column onmetrics
was changed fromENUM
toTEXT
.partition_type
column onmetrics
table and all children partition tables and make sure they haveTEXT
type with appropriate defaults.A new
PROFILING
partition was added along with itssystem_metrics
partition table.metrics
table contains thesystem_metrics
table as a partition.system_metrics
table has aNULLABLE
total_batches
column with no default value, and a default ofPROFILING
for thepartition_key
column.Test APIs that write to generic metrics
Verify that the API to add metrics supports the above schema changes. Since this PR does not contain read/client changes, this testing will be relatively manual.
PROFILING
partition.metrics
table for thePROFILING
partition.NULL
total_batches
column and that theend_time
timestamp is in the server's timezone (show timezone;
).Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket