[ML] Add probability values in decision path visualization for classification data frame analytics #80229

qn895 · 2020-10-12T22:09:17Z

Summary

This PR is part of #77874. Now that we have the feature_importance_baseline exposed as part of the trained model metadata elastic/elasticsearch#63172, we can now use the stored baseline to make the decision path in the data frame analytics exploration more complete. Changes include:

Regression

Removed the /api/ml/data_frame/analytics/{analyticsId}/baseline endpoint which was previously used to calculate the baseline for regression jobs and switch the use the baseline exposed by the trained_model metadata.

Binary classification

 // the sum of feature importance until this point in the decision path
logOddSoFar = baselineClassName + featureImportance0 + featureImportance1 + ...;
predictionProbabilitySoFar = exp(logOddSoFar)/(exp(logOddSoFar) + 1);

Multi-class classification

The prediction probability calculated for feature is calculated as following:

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
This was checked for cross-browser compatibility

…one class

peteharverson · 2020-10-13T12:03:35Z

.../public/application/components/data_grid/feature_importance/use_classification_path_data.tsx

  const filteredFeatureImportance = mappedFeatureImportance.filter(
    (f) => f !== undefined
  ) as ExtendedFeatureImportance[];

-  return buildDecisionPathData(filteredFeatureImportance);
+  const finalResult: DecisionPathPlotData = filteredFeatureImportance
+    // sort so absolute importance so it goes from bottom (baseline) to top


Nit - typo? Sort by absolute importance... ?

Thanks for catching that. Updated here: 8121fe6

peteharverson · 2020-10-13T12:15:20Z

Not directly related to the changes here, but I wonder if we should limit the precision for these columns for classification jobs? I only get this level of precision for some jobs - do you know why?

elasticmachine · 2020-10-13T20:12:01Z

Pinging @elastic/ml-ui (:ml)

x-pack/plugins/ml/common/types/feature_importance.ts

...gins/ml/public/application/components/data_grid/feature_importance/decision_path_popover.tsx

.../public/application/components/data_grid/feature_importance/use_classification_path_data.tsx

walterra

Regarding the test mock data, I wonder if we could live with some smaller and more artificial minimal dataset for the jest tests? On the other hand, could we do a test relying on a more real-world dataset using an API integration test (not necessarily in this PR)?

walterra · 2020-10-14T13:18:22Z

...lugins/ml/public/application/components/data_grid/feature_importance/decision_path_chart.tsx

-    ],
+  const baselineData: LineAnnotationDatum[] | undefined = useMemo(
+    () =>
+      baseline && isRegressionFeatureImportanceBaseline(baseline)


With Dima's suggestion making the type guard accept any, the baseline && part here might then no longer be necessary.

Updated here 10eae2d

qn895 · 2020-10-19T01:42:15Z

Regarding the test mock data, I wonder if we could live with some smaller and more artificial minimal dataset for the jest tests? On the other hand, could we do a test relying on a more real-world dataset using an API integration test (not necessarily in this PR)?

@walter That's a good point. I think it will be beneficial to also have functional test to see if the decision path is matching up with what we are showing in the other columns in the data grid. I'll add a follow up PR to this.

.../public/application/components/data_grid/feature_importance/use_classification_path_data.tsx

darnautov

Latest edits LGTM

…um_top_feature_importance

qn895 · 2020-10-27T23:40:21Z

@elasticmachine merge upstream

peteharverson

One minor comment on the code. Gave this a good test, and the baseline calculation looks correct for every classification and regression job I ran until this one on the mushroom data set:

Job config:

{
  "id": "mushroom_cap_color_class",
  "create_time": 1603980394530,
  "version": "8.0.0",
  "description": "",
  "source": {
    "index": [
      "mushroom"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "mushroom_cap_color_class",
    "results_field": "ml"
  },
  "analysis": {
    "classification": {
      "dependent_variable": "cap-color",
      "num_top_feature_importance_values": 6,
      "class_assignment_objective": "maximize_minimum_recall",
      "num_top_classes": -1,
      "prediction_field_name": "cap-color_prediction",
      "training_percent": 18,
      "randomize_seed": -5653826009821974000
    }
  },
  "analyzed_fields": {
    "includes": [
      "bruises",
      "cap-color",
      "cap-shape",
      "cap-surface",
      "edibility",
      "gill-attachment",
      "gill-color",
      "gill-size",
      "gill-spacing",
      "habitat",
      "odor",
      "population",
      "ring-number",
      "ring-type",
      "spore-print-color",
      "stalk-color-above-ring",
      "stalk-color-below-ring",
      "stalk-root",
      "stalk-shape",
      "stalk-surface-above-ring",
      "stalk-surface-below-ring",
      "veil-color",
      "veil-type"
    ],
    "excludes": []
  },
  "model_memory_limit": "70mb",
  "allow_lazy_start": false,
  "max_num_threads": 1
}

Tested latest update 191991e and the calculation is now looking correct for all my regression and classification jobs.

x-pack/plugins/ml/common/types/feature_importance.ts

qn895 · 2020-11-02T14:58:40Z

One minor comment on the code. Gave this a good test, and the baseline calculation looks correct for every classification and regression job I ran until this one on the mushroom data set:

@peteharverson Discussed with Valeriy and we decided to add an other row to the multiclass classification decision path to account for the other features for when num analyzed fields > num_top_feature_importance_values. I've updated it here 191991e.

peteharverson

Looks like this needs rebasing against #82334, but otherwise tested latest edits and the baseline calculations LGTM.

peteharverson · 2020-11-03T12:17:27Z

x-pack/plugins/ml/public/application/components/data_grid/common.ts

@@ -415,11 +415,20 @@ export const showDataGridColumnChartErrorMessageToast = (
 // helper function to transform { [key]: [val] } => { [key]: val }
 // for when `fields` is used in es.search since response is always an array of values
 // since response always returns an array of values for each field
-export const getProcessedFields = (originalObj: object) => {
+export const getProcessedFields = (originalObj: object, omitBy?: (key: string) => boolean) => {


Guess this can be removed now that #82334 is merged?

peteharverson · 2020-11-03T12:17:52Z

x-pack/plugins/ml/public/application/data_frame_analytics/common/get_index_data.ts

@@ -63,7 +63,13 @@ export const getIndexData = async (

      if (!options.didCancel) {
        setRowCount(resp.hits.total.value);
-        setTableItems(resp.hits.hits.map((d) => getProcessedFields(d.fields)));
+        setTableItems(


Guess this is not needed now that #82334 is merged?

kibanamachine · 2020-11-03T23:33:12Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 84e8ed1

Metrics [docs]

async chunks size

id	before	after	diff
`ml`	6.6MB	6.6MB	+6.7KB

distributable file count

id	before	after	diff
`default`	42720	42719	-1

History

💔 Build #85649 failed 71e8549
💚 Build #85312 succeeded 845339e
💚 Build #85202 succeeded 19ff877
💔 Build #85181 failed ecf321e
💚 Build #84413 succeeded ed538ff

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…fication data frame analytics (elastic#80229) Co-authored-by: Kibana Machine <[email protected]>

…fication data frame analytics (#80229) (#82551) Co-authored-by: Kibana Machine <[email protected]> Co-authored-by: Kibana Machine <[email protected]>

qn895 added 6 commits October 12, 2020 16:27

[ML] Update baseline calc to use trained_models info

4ff0a24

[ML] Update baseline calc for classification

b615eac

[ML] Update baseline calc for multi-class

d0d807b

[ML] Safeguard for hypothetically when for some reasons there's only …

ae105a5

…one class

[ML] Remove now unused analyticsFeatureImportanceProvider

218ddad

[ML] Update proper type for InferenceQueryParams.include

dfba16d

qn895 added :ml Feature:Data Frame Analytics ML data frame analytics features labels Oct 12, 2020

qn895 requested review from walterra, darnautov and peteharverson October 12, 2020 22:09

qn895 self-assigned this Oct 12, 2020

peteharverson reviewed Oct 13, 2020

View reviewed changes

qn895 added 5 commits October 13, 2020 13:47

[ML] Fix results inconsistent for multi class due to different types

8121fe6

[ML] Add unit test

01a1961

[ML] Add unit test

9c6f79a

Merge remote-tracking branch 'upstream/master' into ml-new-baseline-path

1f63405

[ML] Add unit test

59b43e8

qn895 marked this pull request as ready for review October 13, 2020 20:11

qn895 requested a review from a team as a code owner October 13, 2020 20:11

qn895 added v7.11.0 v8.0.0 release_note:enhancement labels Oct 13, 2020

qn895 added 3 commits October 13, 2020 18:29

[ML] Change to using formatSingleValue instead

8cc0fcc

Merge remote-tracking branch 'upstream/master' into ml-new-baseline-path

5452e04

[ML] Fix missing baseline

e338028

darnautov reviewed Oct 14, 2020

View reviewed changes

walterra reviewed Oct 14, 2020

View reviewed changes

[ML] Remove !

e5f33b3

darnautov reviewed Oct 19, 2020

View reviewed changes

.../public/application/components/data_grid/feature_importance/use_classification_path_data.tsx Outdated Show resolved Hide resolved

[ML] Rename functions to start with process for clarity

7526127

darnautov approved these changes Oct 19, 2020

View reviewed changes

qn895 added 2 commits October 22, 2020 14:40

[ML] Add extra other row to binary classification if num features > n…

b123e61

…um_top_feature_importance

Merge remote-tracking branch 'upstream/master' into ml-new-baseline-path

fd4672d

lcawl mentioned this pull request Oct 26, 2020

[DOCS] Augment feature importance details for classification lcawl/stack-docs#4

Closed

kibanamachine and others added 2 commits October 27, 2020 19:40

Merge branch 'master' into ml-new-baseline-path

ed9376e

[ML] Fix broken feature importance fields

ed538ff

peteharverson reviewed Oct 29, 2020

View reviewed changes

x-pack/plugins/ml/common/types/feature_importance.ts Outdated Show resolved Hide resolved

qn895 added 6 commits October 29, 2020 16:37

Merge remote-tracking branch 'upstream/master' into ml-new-baseline-path

5459fbf

Merge upstream/master into origin/ml-new-baseline-path

56764d4

[ML] Adjust for multiclass

191991e

[ML] Fix typo in import type

631fa87

[ML] Rename FeatureImportanceClassName

ecf321e

[ML] Fix FeatureImportanceClassName

19ff877

[ML] Fix fi broken if result is an array with only one element

845339e

peteharverson approved these changes Nov 3, 2020

View reviewed changes

qn895 added 2 commits November 3, 2020 14:36

Merge upstream/master into ml-new-baseline-path

71e8549

[ML] Remove analyticsFeatureImportanceProvider

84e8ed1

qn895 merged commit b8307b4 into elastic:master Nov 4, 2020

qn895 deleted the ml-new-baseline-path branch November 4, 2020 00:47

qn895 added a commit to qn895/kibana that referenced this pull request Nov 4, 2020

[ML] Add probability values in decision path visualization for classi…

552a3a6

…fication data frame analytics (elastic#80229) Co-authored-by: Kibana Machine <[email protected]>

qn895 mentioned this pull request Nov 4, 2020

[7.x] [ML] Add probability values in decision path visualization for classification data frame analytics (#80229) #82551

Merged

lcawl mentioned this pull request Nov 21, 2020

[DOCS] Augment feature importance details for classification elastic/stack-docs#1469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add probability values in decision path visualization for classification data frame analytics #80229

[ML] Add probability values in decision path visualization for classification data frame analytics #80229

qn895 commented Oct 12, 2020

peteharverson Oct 13, 2020

qn895 Oct 13, 2020

peteharverson commented Oct 13, 2020

elasticmachine commented Oct 13, 2020

walterra left a comment

walterra Oct 14, 2020

qn895 Oct 19, 2020

qn895 commented Oct 19, 2020

darnautov left a comment

qn895 commented Oct 27, 2020

peteharverson left a comment •

edited

Loading

qn895 commented Nov 2, 2020

peteharverson left a comment

peteharverson Nov 3, 2020

peteharverson Nov 3, 2020

kibanamachine commented Nov 3, 2020

[ML] Add probability values in decision path visualization for classification data frame analytics #80229

[ML] Add probability values in decision path visualization for classification data frame analytics #80229

Conversation

qn895 commented Oct 12, 2020

Summary

Regression

Binary classification

Multi-class classification

Checklist

peteharverson Oct 13, 2020

Choose a reason for hiding this comment

qn895 Oct 13, 2020

Choose a reason for hiding this comment

peteharverson commented Oct 13, 2020

elasticmachine commented Oct 13, 2020

walterra left a comment

Choose a reason for hiding this comment

walterra Oct 14, 2020

Choose a reason for hiding this comment

qn895 Oct 19, 2020

Choose a reason for hiding this comment

qn895 commented Oct 19, 2020

darnautov left a comment

Choose a reason for hiding this comment

qn895 commented Oct 27, 2020

peteharverson left a comment • edited Loading

Choose a reason for hiding this comment

qn895 commented Nov 2, 2020

peteharverson left a comment

Choose a reason for hiding this comment

peteharverson Nov 3, 2020

Choose a reason for hiding this comment

peteharverson Nov 3, 2020

Choose a reason for hiding this comment

kibanamachine commented Nov 3, 2020

💚 Build Succeeded

Metrics [docs]

async chunks size

distributable file count

History

peteharverson left a comment •

edited

Loading