Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add probability values in decision path visualization for classification data frame analytics #80229

Merged
merged 33 commits into from
Nov 4, 2020

Conversation

qn895
Copy link
Member

@qn895 qn895 commented Oct 12, 2020

Summary

This PR is part of #77874. Now that we have the feature_importance_baseline exposed as part of the trained model metadata elastic/elasticsearch#63172, we can now use the stored baseline to make the decision path in the data frame analytics exploration more complete. Changes include:

Regression

Removed the /api/ml/data_frame/analytics/{analyticsId}/baseline endpoint which was previously used to calculate the baseline for regression jobs and switch the use the baseline exposed by the trained_model metadata.

Screen Shot 2020-10-12 at 4 38 03 PM

Binary classification

 // the sum of feature importance until this point in the decision path
logOddSoFar = baselineClassName + featureImportance0 + featureImportance1 + ...;
predictionProbabilitySoFar = exp(logOddSoFar)/(exp(logOddSoFar) + 1);

Screen Shot 2020-10-12 at 4 42 18 PM

Multi-class classification

The prediction probability calculated for feature is calculated as following:
CodeCogsEqn

Screen Shot 2020-10-12 at 4 35 35 PM

Checklist

Delete any items that are not applicable to this PR.

@qn895 qn895 added :ml Feature:Data Frame Analytics ML data frame analytics features labels Oct 12, 2020
@qn895 qn895 self-assigned this Oct 12, 2020
const filteredFeatureImportance = mappedFeatureImportance.filter(
(f) => f !== undefined
) as ExtendedFeatureImportance[];

return buildDecisionPathData(filteredFeatureImportance);
const finalResult: DecisionPathPlotData = filteredFeatureImportance
// sort so absolute importance so it goes from bottom (baseline) to top
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - typo? Sort by absolute importance... ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. Updated here: 8121fe6

@peteharverson
Copy link
Contributor

Not directly related to the changes here, but I wonder if we should limit the precision for these columns for classification jobs? I only get this level of precision for some jobs - do you know why?

image

image

@qn895 qn895 marked this pull request as ready for review October 13, 2020 20:11
@qn895 qn895 requested a review from a team as a code owner October 13, 2020 20:11
@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

Copy link
Contributor

@walterra walterra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the test mock data, I wonder if we could live with some smaller and more artificial minimal dataset for the jest tests? On the other hand, could we do a test relying on a more real-world dataset using an API integration test (not necessarily in this PR)?

],
const baselineData: LineAnnotationDatum[] | undefined = useMemo(
() =>
baseline && isRegressionFeatureImportanceBaseline(baseline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Dima's suggestion making the type guard accept any, the baseline && part here might then no longer be necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 10eae2d

@qn895
Copy link
Member Author

qn895 commented Oct 19, 2020

Regarding the test mock data, I wonder if we could live with some smaller and more artificial minimal dataset for the jest tests? On the other hand, could we do a test relying on a more real-world dataset using an API integration test (not necessarily in this PR)?

@walter That's a good point. I think it will be beneficial to also have functional test to see if the decision path is matching up with what we are showing in the other columns in the data grid. I'll add a follow up PR to this.

Copy link
Contributor

@darnautov darnautov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest edits LGTM

@qn895
Copy link
Member Author

qn895 commented Oct 27, 2020

@elasticmachine merge upstream

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment on the code. Gave this a good test, and the baseline calculation looks correct for every classification and regression job I ran until this one on the mushroom data set:

image

Job config:

{
  "id": "mushroom_cap_color_class",
  "create_time": 1603980394530,
  "version": "8.0.0",
  "description": "",
  "source": {
    "index": [
      "mushroom"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "mushroom_cap_color_class",
    "results_field": "ml"
  },
  "analysis": {
    "classification": {
      "dependent_variable": "cap-color",
      "num_top_feature_importance_values": 6,
      "class_assignment_objective": "maximize_minimum_recall",
      "num_top_classes": -1,
      "prediction_field_name": "cap-color_prediction",
      "training_percent": 18,
      "randomize_seed": -5653826009821974000
    }
  },
  "analyzed_fields": {
    "includes": [
      "bruises",
      "cap-color",
      "cap-shape",
      "cap-surface",
      "edibility",
      "gill-attachment",
      "gill-color",
      "gill-size",
      "gill-spacing",
      "habitat",
      "odor",
      "population",
      "ring-number",
      "ring-type",
      "spore-print-color",
      "stalk-color-above-ring",
      "stalk-color-below-ring",
      "stalk-root",
      "stalk-shape",
      "stalk-surface-above-ring",
      "stalk-surface-below-ring",
      "veil-color",
      "veil-type"
    ],
    "excludes": []
  },
  "model_memory_limit": "70mb",
  "allow_lazy_start": false,
  "max_num_threads": 1
}

Tested latest update 191991e and the calculation is now looking correct for all my regression and classification jobs.

x-pack/plugins/ml/common/types/feature_importance.ts Outdated Show resolved Hide resolved
@qn895
Copy link
Member Author

qn895 commented Nov 2, 2020

One minor comment on the code. Gave this a good test, and the baseline calculation looks correct for every classification and regression job I ran until this one on the mushroom data set:

@peteharverson Discussed with Valeriy and we decided to add an other row to the multiclass classification decision path to account for the other features for when num analyzed fields > num_top_feature_importance_values. I've updated it here 191991e.

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this needs rebasing against #82334, but otherwise tested latest edits and the baseline calculations LGTM.

@@ -415,11 +415,20 @@ export const showDataGridColumnChartErrorMessageToast = (
// helper function to transform { [key]: [val] } => { [key]: val }
// for when `fields` is used in es.search since response is always an array of values
// since response always returns an array of values for each field
export const getProcessedFields = (originalObj: object) => {
export const getProcessedFields = (originalObj: object, omitBy?: (key: string) => boolean) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this can be removed now that #82334 is merged?

@@ -63,7 +63,13 @@ export const getIndexData = async (

if (!options.didCancel) {
setRowCount(resp.hits.total.value);
setTableItems(resp.hits.hits.map((d) => getProcessedFields(d.fields)));
setTableItems(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this is not needed now that #82334 is merged?

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

async chunks size

id before after diff
ml 6.6MB 6.6MB +6.7KB

distributable file count

id before after diff
default 42720 42719 -1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@qn895 qn895 merged commit b8307b4 into elastic:master Nov 4, 2020
@qn895 qn895 deleted the ml-new-baseline-path branch November 4, 2020 00:47
qn895 added a commit to qn895/kibana that referenced this pull request Nov 4, 2020
…fication data frame analytics (elastic#80229)

Co-authored-by: Kibana Machine <[email protected]>
qn895 added a commit that referenced this pull request Nov 4, 2020
…fication data frame analytics (#80229) (#82551)

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants