diff --git a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc index be8679f95..b962e5cdd 100644 --- a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc +++ b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc @@ -346,7 +346,8 @@ GET model-flight-delay-classification/_search -------------------------------------------------- // TEST[skip:TBD] -The snippet below shows a part of a document with the annotated results: +The snippet below shows the probability and score details for a document in the +destination index: [source,console-result] -------------------------------------------------- @@ -369,37 +370,10 @@ The snippet below shows a part of a document with the annotated results: ], "prediction_probability" : 0.9427605087816684, "prediction_score" : 0.3462468700158476, - "feature_importance" : [ - { - "feature_name" : "DistanceMiles", - "classes" : [ - { - "class_name" : false, - "importance" : -1.4766536146534828 - }, - { - "class_name" : true, - "importance" : 1.4766536146534828 - } - ] - }, - { - "feature_name" : "FlightTimeMin", - "classes" : [ - { - "class_name" : false, - "importance" : 1.0919201754729184 - }, - { - "class_name" : true, - "importance" : -1.0919201754729184 - } - ] - }, - ... + ... -------------------------------------------------- <1> An array of values specifying the probability of the prediction and the -score for each class. +score for each class. The class with the highest score is the prediction. In this example, `false` has a `class_score` of 0.35 while `true` has only 0.06, so the prediction will be @@ -427,15 +401,18 @@ form of a decision plot: [role="screenshot"] image::images/flights-classification-importance.png["A decision plot for {feat-imp} values in {kib}"] -The features with the most significant positive or negative impact appear at the -top of the decision plot. Thus in this example, the features related to flight -time and distance had the most significant influence on this prediction. This -type of information can help you to understand how models arrive at their -predictions. It can also indicate which aspects of your data set are most -influential or least useful when you are training and tuning your model. +In {kib}, the decision path shows the relative impact of each feature on the +probability of the prediction. The features with the most significant positive +or negative impact appear at the top of the decision plot. Thus in this example, +the features related to flight time and distance had the most significant +influence on the probability value for this prediction. This type of information +can help you to understand how models arrive at their predictions. It can also +indicate which aspects of your data set are most influential or least useful +when you are training and tuning your model. -If you do not use {kib}, you can see summarized {feat-imp} values by using the -{ref}/get-inference.html[get trained model API]. +If you do not use {kib}, you can see the summarized {feat-imp} values by using +the {ref}/get-inference.html[get trained model API] and the individual values by +searching the destination index. .API example [%collapsible] @@ -446,8 +423,8 @@ GET _ml/trained_models/model-flight-delay-classification*?include=total_feature_ -------------------------------------------------- // TEST[skip:TBD] -The snippet below shows an example of the total {feat-imp} details in the -trained model metadata: +The snippet below shows an example of the total and baseline {feat-imp} details +in the trained model metadata: [source,console-result] -------------------------------------------------- @@ -459,6 +436,18 @@ trained model metadata: ... "metadata" : { ... + "feature_importance_baseline" : { <1> + "classes" : [ + { + "class_name" : true, + "baseline" : -1.5869016940485443 + }, + { + "class_name" : false, + "baseline" : 1.5869016940485443 + } + ] + }, "total_feature_importance" : [ { "feature_name" : "dayOfWeek", @@ -466,9 +455,9 @@ trained model metadata: { "class_name" : false, "importance" : { - "mean_magnitude" : 0.037513174351966404, <1> - "min" : -0.20132653028125566, <2> - "max" : 0.20132653028125566 <3> + "mean_magnitude" : 0.037513174351966404, <2> + "min" : -0.20132653028125566, <3> + "max" : 0.20132653028125566 <4> } }, { @@ -504,14 +493,71 @@ trained model metadata: }, ... -------------------------------------------------- -<1> This value is the average of the absolute {feat-imp} values for the +<1> This object contains the baselines that are used to calculate the {feat-imp} +decision paths in {kib}. +<2> This value is the average of the absolute {feat-imp} values for the `dayOfWeek` field across all the training data when the predicted class is `false`. -<2> This value is the minimum {feat-imp} value across all the training data for +<3> This value is the minimum {feat-imp} value across all the training data for this field when the predicted class is `false`. -<3> This value is the maximum {feat-imp} value across all the training data for +<4> This value is the maximum {feat-imp} value across all the training data for this field when the predicted class is `false`. +To see the top {feat-imp} values for each prediction, search the destination +index. For example: + +[source,console] +-------------------------------------------------- +GET model-flight-delay-classification/_search +-------------------------------------------------- +// TEST[skip:TBD] + +The snippet below shows an example of the {feat-imp} details for a document in +the search results: + +[source,console-result] +-------------------------------------------------- + ... + "FlightDelay" : false, + ... + "ml" : { + "FlightDelay_prediction" : false, + ... + "prediction_probability" : 0.9427605087816684, + "prediction_score" : 0.3462468700158476, + "feature_importance" : [ + { + "feature_name" : "DistanceMiles", + "classes" : [ + { + "class_name" : false, + "importance" : -1.4766536146534828 + }, + { + "class_name" : true, + "importance" : 1.4766536146534828 + } + ] + }, + { + "feature_name" : "FlightTimeMin", + "classes" : [ + { + "class_name" : false, + "importance" : 1.0919201754729184 + }, + { + "class_name" : true, + "importance" : -1.0919201754729184 + } + ] + }, + ... +-------------------------------------------------- + +The sum of the {feat-imp} values for each class in this data point approximates +the logarithm of its odds. + ==== [[flightdata-classification-evaluate]] diff --git a/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc index e8d991cb5..70cc96273 100644 --- a/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc +++ b/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc @@ -360,7 +360,7 @@ searching the destination index. ==== [source,console] -------------------------------------------------- -GET _ml/inference/model-flight-delays*?include=total_feature_importance +GET _ml/inference/model-flight-delays*?include=total_feature_importance,feature_importance_baseline -------------------------------------------------- // TEST[skip:TBD] @@ -377,13 +377,16 @@ the trained model metadata: ... "metadata" : { ... + "feature_importance_baseline" : { + "baseline" : 47.43643652716527 <1> + }, "total_feature_importance" : [ { "feature_name" : "dayOfWeek", "importance" : { - "mean_magnitude" : 0.38674590521018903, <1> - "min" : -9.42823116446923, <2> - "max" : 8.707461689065173 <3> + "mean_magnitude" : 0.38674590521018903, <2> + "min" : -9.42823116446923, <3> + "max" : 8.707461689065173 <4> } }, { @@ -395,11 +398,13 @@ the trained model metadata: } ... ---- -<1> This value is the average of the absolute {feat-imp} values for the +<1> This value is the baseline for the {feat-imp} decision path. It is the +average of the prediction values across all the training data. +<2> This value is the average of the absolute {feat-imp} values for the `dayOfWeek` field across all the training data. -<2> This value is the minimum {feat-imp} value across all the training data for +<3> This value is the minimum {feat-imp} value across all the training data for this field. -<3> This value is the maximum {feat-imp} value across all the training data for +<4> This value is the maximum {feat-imp} value across all the training data for this field. To see the top {feat-imp} values for each prediction, search the destination diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-decision-plot.png b/docs/en/stack/ml/df-analytics/images/flights-classification-decision-plot.png new file mode 100644 index 000000000..55cef1b13 Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/flights-classification-decision-plot.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-importance.jpg b/docs/en/stack/ml/df-analytics/images/flights-classification-importance.jpg new file mode 100644 index 000000000..1d705faa6 Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/flights-classification-importance.jpg differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-decision-plot.png b/docs/en/stack/ml/df-analytics/images/flights-regression-decision-plot.png index 840742463..88d72d3bd 100644 Binary files a/docs/en/stack/ml/df-analytics/images/flights-regression-decision-plot.png and b/docs/en/stack/ml/df-analytics/images/flights-regression-decision-plot.png differ diff --git a/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc b/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc index ac0a1850e..198e3de14 100644 --- a/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc @@ -32,24 +32,27 @@ how the impact of each field varies by class. For example: image::images/diamonds-classification-total-importance.png["Total {feat-imp} values for a {classification} {dfanalytics-job} in {kib}"] You can also examine the feature importance values for each individual -prediction. In {kib}, you can see these values in JSON objects or decision plots: - -[role="screenshot"] -image::images/flights-regression-decision-plot.png["Feature importance values for a {regression} {dfanalytics-job} in {kib}"] - +prediction. In {kib}, you can see these values in JSON objects or decision plots. For {reganalysis}, each decision plot starts at a shared baseline, which is the average of the prediction values for all the data points in the training data set. When you add all of the feature importance values for a particular data point to that baseline, you arrive at the numeric prediction value. If a {feat-imp} value is negative, it reduces the prediction value. If a {feat-imp} -value is positive, it increases the prediction value. - -For {classanalysis}, the baseline is the average of the probability values for a -specific class across all the data points in the training data set. When you add -the feature importance values for a particular data point to that baseline, you -arrive at the prediction probability for that class. If a {feat-imp} value is -negative, it reduces the prediction probability. If a {feat-imp} value is -positive, it increases the prediction probability. +value is positive, it increases the prediction value. For example: + +[role="screenshot"] +image::images/flights-regression-decision-plot.png["Feature importance values for a {regression} {dfanalytics-job} in {kib}"] + +For {classanalysis}, the sum of the {feat-imp} values approximates the predicted +logarithm of odds for each data point. The simplest way to understand {feat-imp} +in the context of {classanalysis} is to look at the decision plots in {kib}. For +each data point, there is a chart which shows the relative impact of each +feature on the prediction probability for that class. This information helps you +to understand which features reduces or increase the prediction probability. For +example: + +[role="screenshot"] +image::images/flights-classification-decision-plot.png["A decision plot in {kib}for a {classification} {dfanalytics-job}"] By default, {feat-imp} values are not calculated. To generate this information, when you create a {dfanalytics-job} you must specify the