Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support profiling metrics in generic metrics [MD-11] #8884

Merged
merged 1 commit into from
Mar 5, 2024

Conversation

azhou-determined
Copy link
Contributor

@azhou-determined azhou-determined commented Feb 24, 2024

Description

Backend changes to support profiling metrics (with the determined profiler) in generic metrics (project doc). Major changes here relate to the schema of the metrics table and backend API changes needed to support it:

  • total_batches on the metrics table is no longer a required field. It's not always sensible for generic, non-training metrics (i.e. system metrics) to be associated with a batch. With respect to DB/query performance, existing code will not be impacted, because the metrics table is partitioned into children tables depending on metric type (see migrations), so the existing partitioned tables will each specify NOT NULL constraints as they need to.
  • partition_type on the metrics table is now a postgres text type instead of an enum type. This is because a) partition keys cannot be easily changed in postgres -- you have to detach all the partitions, recreate the parent table with a different partition key, then reattach all partitions, and b) enums cannot be easily updated (they must be renamed, recreated, and dropped). So having an enum as a partition key makes adding/removing partition types for metrics incredibly painful for both developers and users (each re-attach of a partition is a non-trivial migration cost). I could not detect a noticeable difference in query time when benchmarking this change.
  • end_time is now reportable by the client. This allows for a more accurate timestamp for certain metrics (i.e. system profiling). If it isn't reported by the client, the backend will use the same existing logic for default timestamping the added metrics.
  • a new PROFILING partition is added to generic metrics. Client-side reporting changes will be in a separate PR, and these changes should be safe to land without them.

These migrations took ~5 minutes on a benchmarking database with a sizeable number of rows (~60M) in the metric table, so user impact should be less than or around the same time, which should be called out when this is released.

Test Plan

This PR consists of:

  1. DB schema changes/migration
  2. Changes to backend write/persist APIs for generic metrics

Since this PR does not contain read/client-side changes, testing should be done manually. You'll need query access to the database for the master you're testing this on.

Verify that the database migrations have run successfully

Assuming DB migrations have been run on the testing database, there are 3 changes to the metrics table that should be verified:

  1. The NOT NULL requirement for the total_batches column was dropped from the parent metrics table, and added back on the individual child partitions of the metrics table.

    • Look at the metrics table schema and make sure the total_batches column is nullable with no default set.
      • Query:
        SELECT
          column_name,
          data_type,
          is_nullable,
          column_default
        FROM
          information_schema.columns
        WHERE
          table_name = 'metrics' and column_name='total_batches';
        
      • Expected output:
        column_name  | data_type | is_nullable | column_default
        ---------------+-----------+-------------+----------------
        total_batches | integer   | YES         |
        
    • Check that the 3 child partitions of metrics (raw_steps, raw_validations, generic_metrics) have the NULL requirement on total_batches and a default value of 0.
      • Query:
        SELECT
          column_name,
          data_type,
          is_nullable,
          column_default
        FROM
          information_schema.columns
        WHERE
          table_name IN ('raw_steps', 'raw_validations', 'generic_metrics') AND column_name = 'total_batches';
        
      • Expected output:
        column_name  | data_type | is_nullable | column_default
        ---------------+-----------+-------------+----------------
        total_batches | integer   | NO          | 0
        total_batches | integer   | NO          | 0
        total_batches | integer   | NO          | 0
        
  2. The type of the partition_type column on metrics was changed from ENUM to TEXT.

    • Check the partition_type column on metrics table and all children partition tables and make sure they have TEXT type with appropriate defaults.
      • Query:
        select
          table_name,
          column_name,
          data_type,
          is_nullable,
          column_default
        FROM
          information_schema.columns
        WHERE
          table_name IN ('metrics', 'raw_steps', 'raw_validations', 'generic_metrics') AND column_name = 'partition_type';
        
      • Expected output:
        table_name    |  column_name   | data_type | is_nullable |   column_default
        -----------------+----------------+-----------+-------------+--------------------
        generic_metrics | partition_type | text      | NO          | 'GENERIC'::text
        raw_steps       | partition_type | text      | NO          | 'TRAINING'::text
        raw_validations | partition_type | text      | NO          | 'VALIDATION'::text
        metrics         | partition_type | text      | NO          | 'GENERIC'::text
        (4 rows)
        
  3. A new PROFILING partition was added along with its system_metrics partition table.

    • Check that the parent metrics table contains the system_metrics table as a partition.
      • Query:
      SELECT
        child.relname AS child_partition
      FROM pg_inherits
        JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
        JOIN pg_class child ON pg_inherits.inhrelid   = child.oid
      WHERE parent.relname='metrics';
      
      • Expected output:
      child_partition
      -----------------
      generic_metrics
      raw_validations
      raw_steps
      system_metrics
      (4 rows)
      
    • Check that the system_metrics table has a NULLABLE total_batches column with no default value, and a default of PROFILING for the partition_key column.
      • Query:
        select
          table_name,
          column_name,
          data_type,
          is_nullable,
          column_default
        FROM
          information_schema.columns
        WHERE
          table_name = 'system_metrics' AND column_name IN ('total_batches', 'partition_type');
        
      • Expected output:
        table_name   |  column_name   | data_type | is_nullable |  column_default
        ----------------+----------------+-----------+-------------+-------------------
        system_metrics | total_batches  | integer   | YES         |
        system_metrics | partition_type | text      | NO          | 'PROFILING'::text
        

Test APIs that write to generic metrics

Verify that the API to add metrics supports the above schema changes. Since this PR does not contain read/client changes, this testing will be relatively manual.

  1. Submit and run any example trial.
    • Make sure existing metrics functionality (training/validation metrics are reported and rendered in the UI) still works.
  2. The following code snippet will call the metrics API to add a test metric that should be inserted into the new PROFILING partition.
from determined.common.api import bindings
from determined import experimental

from datetime import datetime
import zoneinfo

DET_MASTER = "localhost:8080"
user = "user"
password = "********"


def main():
    client = experimental.Determined(
        master=DET_MASTER,
        user=user,
        password=password,
    )
    # Trial ID from previous test trial.
    trial_id = 363

    metrics = {
        "test_metric": 0.13,
    }
    group = "cpu"

    # Pick a timezone that is NOT the same as the master's timezone
    # so we can verify the timestamp is converted to server's timezone.
    timezone = zoneinfo.ZoneInfo("America/New_York")
    now = datetime.now(tz=timezone).isoformat()
    v1metrics = bindings.v1Metrics(avgMetrics=metrics)
    v1TrialMetrics = bindings.v1TrialMetrics(
        metrics=v1metrics,
        trialId=trial_id,
        trialRunId=1,
        reportTime=now,
    )
    body = bindings.v1ReportTrialMetricsRequest(metrics=v1TrialMetrics, group=group)
    bindings.post_ReportTrialMetrics(client._session, body=body, metrics_trialId=trial_id)


if __name__ == "__main__":
    main()
  • The above code should run without error and there should be a new row in the metrics table for the PROFILING partition.
    select * from metrics where partition_type='PROFILING';
    
    trial_id |           end_time            |        metrics        | total_batches | trial_run_id | archived |  id  | metric_group | partition_type
    ----------+-------------------------------+-----------------------+---------------+--------------+----------+------+--------------+----------------
    363 | 2024-02-29 10:04:49.388948-08 | {"test_metric": 0.13} |               |            1 | f        | 8444 | cpu          | PROFILING
    
  • Verify that the new metric row has a NULL total_batches column and that the end_time timestamp is in the server's timezone (show timezone;).

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

Copy link

netlify bot commented Feb 24, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 2e7974b
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65e216e21e0e6d0008d90607

master/internal/db/postgres_trial_metrics.go Outdated Show resolved Hide resolved
master/pkg/model/metrics.go Show resolved Hide resolved
Copy link
Contributor

@hamidzr hamidzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Can you expand on your test plan and migration time expectations for customers?

// The number of batches trained on when these metrics were reported.
int32 steps_completed = 3;
optional int32 steps_completed = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: changing the field order when not necessary is discouraged afaik but we don't have direct external consumers so it shouldn't matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, changed it back.

@@ -236,9 +239,9 @@ func (db *PgDB) addRawMetrics(ctx context.Context, tx *sqlx.Tx, mBody *metricsBo
INSERT INTO metrics
(trial_id, trial_run_id, end_time, metrics, total_batches, partition_type, metric_group)
VALUES
($1, $2, now(), $3, $4, $5, $6)
($1, $2, COALESCE($3, now()), $4, $5, $6, $7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be concerns in terms of reportedtime's tz and the servers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think so. the client-reported timestamps are all converted to time.Time, which the postgres driver automatically converts to the correct server timezone.

master/internal/db/postgres_trial.go Outdated Show resolved Hide resolved
@azhou-determined azhou-determined force-pushed the profiling-generic-metrics-backend branch 3 times, most recently from b3d5870 to 2e7974b Compare March 1, 2024 17:56
@azhou-determined azhou-determined changed the base branch from main to profiling-v2 March 5, 2024 19:04
Copy link

codecov bot commented Mar 5, 2024

Codecov Report

Attention: Patch coverage is 46.15385% with 42 lines in your changes are missing coverage. Please review.

Project coverage is 47.36%. Comparing base (6ecd81e) to head (c489f19).

Additional details and impacted files
@@               Coverage Diff                @@
##           profiling-v2    #8884      +/-   ##
================================================
+ Coverage         47.35%   47.36%   +0.01%     
================================================
  Files              1162     1162              
  Lines            176133   176168      +35     
  Branches           2237     2236       -1     
================================================
+ Hits              83402    83441      +39     
+ Misses            92573    92569       -4     
  Partials            158      158              
Flag Coverage Δ
backend 42.70% <43.33%> (+0.06%) ⬆️
harness 63.92% <20.00%> (-0.03%) ⬇️
web 42.55% <60.60%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/pkg/model/metrics.go 66.66% <ø> (ø)
master/internal/db/postgres_test_utils.go 83.21% <0.00%> (ø)
master/internal/db/postgres_trial_metrics.go 91.53% <77.77%> (-0.94%) ⬇️
master/internal/populate_metrics.go 0.00% <0.00%> (ø)
master/internal/db/postgres_trial.go 67.44% <42.85%> (-0.93%) ⬇️
harness/determined/common/api/bindings.py 40.20% <20.00%> (-0.03%) ⬇️
webui/react/src/services/api-ts-sdk/api.ts 47.65% <60.60%> (+0.01%) ⬆️

... and 7 files with indirect coverage changes

@azhou-determined azhou-determined force-pushed the profiling-generic-metrics-backend branch 3 times, most recently from f1fac36 to 8975c11 Compare March 5, 2024 19:17
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
@azhou-determined azhou-determined force-pushed the profiling-generic-metrics-backend branch from 8975c11 to c489f19 Compare March 5, 2024 19:22
@azhou-determined azhou-determined merged commit 9b94dfc into profiling-v2 Mar 5, 2024
63 of 80 checks passed
@azhou-determined azhou-determined deleted the profiling-generic-metrics-backend branch March 5, 2024 20:24
@azhou-determined
Copy link
Contributor Author

merging this to feature branch

azhou-determined added a commit that referenced this pull request Mar 8, 2024
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
azhou-determined added a commit that referenced this pull request Mar 8, 2024
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
azhou-determined added a commit that referenced this pull request Mar 13, 2024
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
azhou-determined added a commit that referenced this pull request Mar 20, 2024
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
azhou-determined added a commit that referenced this pull request Mar 26, 2024
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
azhou-determined added a commit that referenced this pull request Mar 26, 2024
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
azhou-determined added a commit that referenced this pull request Mar 26, 2024
generic_metrics:
- DB schema changes
- Changes to backend ReportTrialMetrics APIs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants