Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Augment feature importance details for classification #4

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 92 additions & 46 deletions docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,8 @@ GET model-flight-delay-classification/_search
--------------------------------------------------
// TEST[skip:TBD]

The snippet below shows a part of a document with the annotated results:
The snippet below shows the probability and score details for a document in the
destination index:

[source,console-result]
--------------------------------------------------
Expand All @@ -369,37 +370,10 @@ The snippet below shows a part of a document with the annotated results:
],
"prediction_probability" : 0.9427605087816684,
"prediction_score" : 0.3462468700158476,
"feature_importance" : [
{
"feature_name" : "DistanceMiles",
"classes" : [
{
"class_name" : false,
"importance" : -1.4766536146534828
},
{
"class_name" : true,
"importance" : 1.4766536146534828
}
]
},
{
"feature_name" : "FlightTimeMin",
"classes" : [
{
"class_name" : false,
"importance" : 1.0919201754729184
},
{
"class_name" : true,
"importance" : -1.0919201754729184
}
]
},
...
...
--------------------------------------------------
<1> An array of values specifying the probability of the prediction and the
score for each class.
score for each class.

The class with the highest score is the prediction. In this example, `false` has
a `class_score` of 0.35 while `true` has only 0.06, so the prediction will be
Expand Down Expand Up @@ -427,15 +401,18 @@ form of a decision plot:
[role="screenshot"]
image::images/flights-classification-importance.png["A decision plot for {feat-imp} values in {kib}"]

The features with the most significant positive or negative impact appear at the
top of the decision plot. Thus in this example, the features related to flight
time and distance had the most significant influence on this prediction. This
type of information can help you to understand how models arrive at their
predictions. It can also indicate which aspects of your data set are most
influential or least useful when you are training and tuning your model.
In {kib}, the decision path shows the relative impact of each feature on the
probability of the prediction. The features with the most significant positive
or negative impact appear at the top of the decision plot. Thus in this example,
the features related to flight time and distance had the most significant
influence on the probability value for this prediction. This type of information
can help you to understand how models arrive at their predictions. It can also
indicate which aspects of your data set are most influential or least useful
when you are training and tuning your model.

If you do not use {kib}, you can see summarized {feat-imp} values by using the
{ref}/get-inference.html[get trained model API].
If you do not use {kib}, you can see the summarized {feat-imp} values by using
the {ref}/get-inference.html[get trained model API] and the individual values by
searching the destination index.

.API example
[%collapsible]
Expand All @@ -446,8 +423,8 @@ GET _ml/trained_models/model-flight-delay-classification*?include=total_feature_
--------------------------------------------------
// TEST[skip:TBD]

The snippet below shows an example of the total {feat-imp} details in the
trained model metadata:
The snippet below shows an example of the total and baseline {feat-imp} details
in the trained model metadata:

[source,console-result]
--------------------------------------------------
Expand All @@ -459,16 +436,28 @@ trained model metadata:
...
"metadata" : {
...
"feature_importance_baseline" : { <1>
"classes" : [
{
"class_name" : true,
"baseline" : -1.5869016940485443
},
{
"class_name" : false,
"baseline" : 1.5869016940485443
}
]
},
"total_feature_importance" : [
{
"feature_name" : "dayOfWeek",
"classes" : [
{
"class_name" : false,
"importance" : {
"mean_magnitude" : 0.037513174351966404, <1>
"min" : -0.20132653028125566, <2>
"max" : 0.20132653028125566 <3>
"mean_magnitude" : 0.037513174351966404, <2>
"min" : -0.20132653028125566, <3>
"max" : 0.20132653028125566 <4>
}
},
{
Expand Down Expand Up @@ -504,14 +493,71 @@ trained model metadata:
},
...
--------------------------------------------------
<1> This value is the average of the absolute {feat-imp} values for the
<1> This object contains the baselines that are used to calculate the {feat-imp}
decision paths in {kib}.
<2> This value is the average of the absolute {feat-imp} values for the
`dayOfWeek` field across all the training data when the predicted class is
`false`.
<2> This value is the minimum {feat-imp} value across all the training data for
<3> This value is the minimum {feat-imp} value across all the training data for
this field when the predicted class is `false`.
<3> This value is the maximum {feat-imp} value across all the training data for
<4> This value is the maximum {feat-imp} value across all the training data for
this field when the predicted class is `false`.

To see the top {feat-imp} values for each prediction, search the destination
index. For example:

[source,console]
--------------------------------------------------
GET model-flight-delay-classification/_search
--------------------------------------------------
// TEST[skip:TBD]

The snippet below shows an example of the {feat-imp} details for a document in
the search results:

[source,console-result]
--------------------------------------------------
...
"FlightDelay" : false,
...
"ml" : {
"FlightDelay_prediction" : false,
...
"prediction_probability" : 0.9427605087816684,
"prediction_score" : 0.3462468700158476,
"feature_importance" : [
{
"feature_name" : "DistanceMiles",
"classes" : [
{
"class_name" : false,
"importance" : -1.4766536146534828
},
{
"class_name" : true,
"importance" : 1.4766536146534828
}
]
},
{
"feature_name" : "FlightTimeMin",
"classes" : [
{
"class_name" : false,
"importance" : 1.0919201754729184
},
{
"class_name" : true,
"importance" : -1.0919201754729184
}
]
},
...
--------------------------------------------------

The sum of the {feat-imp} values for each class in this data point approximates
the logarithm of its odds.

====

[[flightdata-classification-evaluate]]
Expand Down
19 changes: 12 additions & 7 deletions docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -360,7 +360,7 @@ searching the destination index.
====
[source,console]
--------------------------------------------------
GET _ml/inference/model-flight-delays*?include=total_feature_importance
GET _ml/inference/model-flight-delays*?include=total_feature_importance,feature_importance_baseline
--------------------------------------------------
// TEST[skip:TBD]

Expand All @@ -377,13 +377,16 @@ the trained model metadata:
...
"metadata" : {
...
"feature_importance_baseline" : {
"baseline" : 47.43643652716527 <1>
},
"total_feature_importance" : [
{
"feature_name" : "dayOfWeek",
"importance" : {
"mean_magnitude" : 0.38674590521018903, <1>
"min" : -9.42823116446923, <2>
"max" : 8.707461689065173 <3>
"mean_magnitude" : 0.38674590521018903, <2>
"min" : -9.42823116446923, <3>
"max" : 8.707461689065173 <4>
}
},
{
Expand All @@ -395,11 +398,13 @@ the trained model metadata:
}
...
----
<1> This value is the average of the absolute {feat-imp} values for the
<1> This value is the baseline for the {feat-imp} decision path. It is the
average of the prediction values across all the training data.
<2> This value is the average of the absolute {feat-imp} values for the
`dayOfWeek` field across all the training data.
<2> This value is the minimum {feat-imp} value across all the training data for
<3> This value is the minimum {feat-imp} value across all the training data for
this field.
<3> This value is the maximum {feat-imp} value across all the training data for
<4> This value is the maximum {feat-imp} value across all the training data for
this field.

To see the top {feat-imp} values for each prediction, search the destination
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 15 additions & 7 deletions docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -32,19 +32,27 @@ how the impact of each field varies by class. For example:
image::images/diamonds-classification-total-importance.png["Total {feat-imp} values for a {classification} {dfanalytics-job} in {kib}"]

You can also examine the feature importance values for each individual
prediction. In {kib}, you can see these values in JSON objects or decision plots:

[role="screenshot"]
image::images/flights-regression-decision-plot.png["Feature importance values for a {regression} {dfanalytics-job} in {kib}"]

prediction. In {kib}, you can see these values in JSON objects or decision plots.
For {reganalysis}, each decision plot starts at a shared baseline, which is
the average of the prediction values for all the data points in the training
data set. When you add all of the feature importance values for a particular
data point to that baseline, you arrive at the numeric prediction value. If a
{feat-imp} value is negative, it reduces the prediction value. If a {feat-imp}
value is positive, it increases the prediction value.
value is positive, it increases the prediction value. For example:

//TBD: Add section about classification analysis.
[role="screenshot"]
image::images/flights-regression-decision-plot.png["Feature importance values for a {regression} {dfanalytics-job} in {kib}"]

For {classanalysis}, the sum of the {feat-imp} values approximates the predicted
logarithm of odds for each data point. The simplest way to understand {feat-imp}
in the context of {classanalysis} is to look at the decision plots in {kib}. For
each data point, there is a chart which shows the relative impact of each
feature on the prediction probability for that class. This information helps you
to understand which features reduces or increase the prediction probability. For
example:

[role="screenshot"]
image::images/flights-classification-decision-plot.png["A decision plot in {kib}for a {classification} {dfanalytics-job}"]

By default, {feat-imp} values are not calculated. To generate this information,
when you create a {dfanalytics-job} you must specify the
Expand Down