Skip to content

Commit

Permalink
[DOCS] Add feature importance examples
Browse files Browse the repository at this point in the history
  • Loading branch information
lcawl committed Sep 10, 2020
1 parent bc1a639 commit 31110ae
Show file tree
Hide file tree
Showing 7 changed files with 116 additions and 74 deletions.
9 changes: 4 additions & 5 deletions docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,8 @@ in the `analyzed_fields` object when configuring the {dfanalytics-job}.
== Interpreting {classification} results

The following sections help you understand and interpret the results of a
{classanalysis}.
{classanalysis}. To see example results, refer to
<<flightdata-classification-results>>.

[[dfa-classification-class-probability]]
=== `class_probability`
Expand All @@ -123,15 +124,13 @@ The value of `class_probability` shows how likely it is that a given data point
belongs to a certain class. It is a value between 0 and 1. The higher the
number, the higher the probability that the data point belongs to the named
class. This information is stored in the `top_classes` array for each document
in your destination index. See the
{ml-docs}/flightdata-classification.html#flightdata-classification-results[Viewing {classification} results]
section in the {classification} example.
in your destination index.

[[dfa-classification-class-score]]
=== `class_score`

The value of `class_score` controls the probability at which a class label is
assigned to a data point. In normal case – that you maximize the number of
assigned to a data point. In the normal case – that you maximize the number of
correct labels – a class label is assigned when its predicted probability is
greater than 0.5. The `class_score` makes it possible to change this behavior,
so it can be less than or greater than 0.5. For example, suppose our two classes
Expand Down
181 changes: 112 additions & 69 deletions docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,9 @@ image::images/flights-classification-job-1.png["Creating a {dfanalytics-job} in
[role="screenshot"]
image::images/flights-classification-job-2.png["Creating a {dfanalytics-job} in {kib} – continued"]

[role="screenshot"]
image::images/flights-classification-job-3.png["Creating a {dfanalytics-job} in {kib} – advanced options"]

.. Choose `kibana_sample_data_flights` as the source index.
.. Choose `classification` as the job type.
.. Choose `FlightDelay` as the dependent variable, which is the field that we
Expand All @@ -116,15 +119,18 @@ recommended to exclude fields that either contain erroneous data or describe the
source data for training. While that value is low for this example, for many
large data sets using a small training sample greatly reduces runtime without
impacting accuracy.
.. Use the default feature importance values.
.. If you want to experiment with <<ml-feature-importance,feature importance>>,
specify a value in the advanced configuration options. In this example, we
choose to return a maximum of 10 feature importance values per document. This
option affects the speed of the analysis, however, so by default it is disabled.
.. Use the default memory limit for the job. If the job requires more than this
amount of memory, it fails to start. If the available memory on the node is
limited, this setting makes it possible to prevent job execution.
.. Add a job ID and optionally a job description.
.. Add the name of the destination index that will contain the results of the
analysis. It will contain a copy of the source index data where each document is
annotated with the results. If the index does not exist, it will be created
automatically.
analysis. In {kib}, the index name matches the job ID by default. It will
contain a copy of the source index data where each document is annotated with
the results. If the index does not exist, it will be created automatically.


.API example
Expand All @@ -140,13 +146,15 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
]
},
"dest": {
"index": "df-flight-delayed",
"index": "model-flight-delay-classification",
"results_field": "ml" <1>
},
"analysis": {
"classification": {
"dependent_variable": "FlightDelay",
"training_percent": 10
"training_percent": 10,
"num_top_classes": 10,
"num_top_feature_importance_values": 10 <2>
}
},
"analyzed_fields": {
Expand All @@ -160,7 +168,8 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
}
--------------------------------------------------
// TEST[skip:setup kibana sample data]
<1> The field name in the `dest` index that contains the analysis results.
<1> The field name in the `dest` index that contains the analysis results.
<2> To disable feature importance calculations, omit this option.
====
--

Expand Down Expand Up @@ -259,32 +268,31 @@ The API call returns the following response:
},
"analysis_stats" : {
"classification_stats" : {
"timestamp" : 1597182490577,
"timestamp" : 1599684771114,
"iteration" : 18,
"hyperparameters" : {
"class_assignment_objective" : "maximize_minimum_recall",
"alpha" : 11.630957564710283,
"downsample_factor" : 0.9418550623091531,
"eta" : 0.032382816833064335,
"eta_growth_rate_per_tree" : 1.0198807182688074,
"feature_bag_fraction" : 0.5504020748926737,
"gamma" : 0.08388388780939579,
"lambda" : 0.08628826657684924,
"alpha" : 6.648298686326093,
"downsample_factor" : 0.7435400845721971,
"eta" : 0.039957516522980074,
"eta_growth_rate_per_tree" : 1.0168333294220058,
"feature_bag_fraction" : 0.49761652263010625,
"gamma" : 0.21224183609258152,
"lambda" : 0.2572621613644672,
"max_attempts_to_add_tree" : 3,
"max_optimization_rounds_per_hyperparameter" : 2,
"max_trees" : 644,
"max_trees" : 590,
"num_folds" : 5,
"num_splits_per_feature" : 75,
"soft_tree_depth_limit" : 7.550606337307592,
"soft_tree_depth_tolerance" : 0.13448633124842999
"soft_tree_depth_limit" : 3.2719032647442443,
"soft_tree_depth_tolerance" : 0.14970565884872958
},
"timing_stats" : {
"elapsed_time" : 44206,
"iteration_time" : 1884
"elapsed_time" : 37915,
"iteration_time" : 2552
},
"validation_loss" : {
"loss_type" : "binomial_logistic",
"fold_values" : [ ]
"loss_type" : "binomial_logistic"
}
}
}
Expand Down Expand Up @@ -330,7 +338,7 @@ values. A higher number means that the model is more confident.
====
[source,console]
--------------------------------------------------
GET df-flight-delayed/_search
GET model-flight-delay-classification/_search
--------------------------------------------------
// TEST[skip:TBD]
Expand All @@ -343,51 +351,85 @@ The snippet below shows a part of a document with the annotated results:
"FlightDelay" : false,
...
"ml" : {
"FlightDelay_prediction" : false,
"top_classes" : [ <1>
{
"class_probability" : 0.9198146781161334,
"class_score" : 0.36964390728677926,
"class_name" : false
"class_name" : false,
"class_probability" : 0.3933807062505216,
"class_score" : 0.3933807062505216
},
{
"class_probability" : 0.08018532188386665,
"class_score" : 0.08018532188386665,
"class_name" : true
"class_name" : true,
"class_probability" : 0.6066192937494784,
"class_score" : 0.22857258275913037
}
],
"prediction_score" : 0.36964390728677926,
"FlightDelay_prediction" : false,
"prediction_probability" : 0.9198146781161334,
"prediction_probability" : 0.3933807062505216,
"prediction_score" : 0.3933807062505216,
"feature_importance" : [
{
"feature_name" : "DistanceMiles",
"importance" : -3.039025449178423
"feature_name" : "FlightTimeMin",
"importance" : -2.823868829093038,
"classes" : [
{
"class_name" : false,
"importance" : -2.823868829093038
},
{
"class_name" : true,
"importance" : 2.823868829093038
}
]
},
{
"feature_name" : "FlightTimeMin",
"importance" : 2.4980756273399045
}
"feature_name" : "DistanceMiles",
"importance" : 0.9872151818111125,
"classes" : [
{
"class_name" : false,
"importance" : 0.9872151818111125
},
{
"class_name" : true,
"importance" : -0.9872151818111125
}
]
},
...
],
"is_training" : false
}
----
<1> An array of values specifying the probability of the prediction and the
`class_score` for each class.
The `top_classes` object contains the predicted classes with the highest
scores. The `class_probability` is a value between 0 and 1. The higher the
number, the more confident the model is that the data point belongs to the named
class. In the example above, `false` has a `class_probability` of 0.91 while
`true` has only 0.08, so the prediction will be `false`. The `class_score` is a
function of the probability.
////
It is chosen so that the decision to assign the
data point to the class with the highest score maximizes the minimum recall of
any class.
////
<1> An array of values specifying the probability of the prediction and the
score for each class.
The `top_classes` object contains the predicted classes with the highest scores.
Each score has a value between 0 and 1. The higher the number, the more
confident the model is that the data point belongs to the named class. In the
example above, `false` has a `class_score` of 0.39 while `true` has only 0.22,
so the prediction will be `false`. The score and probability for the chosen
class also appear in the `prediction_probability` and `prediction_score` fields.
For more details about these values, see <<dfa-classification-interpret>>.
====

If you chose to calculate feature importance, the destination index also
contains `feature_importance` objects. This information indicates which fields
(also known as _features_ of a data point) had the biggest impact on each
prediction. In {kib}, you can see this information displayed in the form of a
decision plot:

[role="screenshot"]
image::images/flights-classification-importance.png["A decision plot for feature importance values in {kib}"]

The values in the decision plot are sorted in descending order, such that the
features with the most significant positive or negative impact appear at the top.
Each feature importance value represents its effect on the decision-making of
the model, increasing or decreasing the likelihood of the prediction.
//Thus in this example, the `FlightTimeMin` and `DistanceMiles` had the largest
//negative and positive feature importance values respectively and had the most
//significant influence on the prediction for this particular data point.


[[flightdata-classification-evaluate]]
== Evaluating {classification} results

Expand All @@ -411,18 +453,18 @@ own results.
If you want to see the exact number of occurrences, select a quadrant in the
matrix. You can optionally filter the table to contain only testing data so you
can see how well the model performs on previously unseen data. In this example,
there are 2952 documents in the testing data that have the `true` class. 1893 of
them are predicted as `false`; this is called a _false negative_. 1059 are
there are 2952 documents in the testing data that have the `true` class. 2109 of
them are predicted as `false`; this is called a _false negative_. 843 are
predicted correctly as `true`; this is called a _true positive_. The confusion
matrix therefore shows us that 36% of the actual `true` values were correctly
predicted and 64% were incorrectly predicted in the test data set.
matrix therefore shows us that 29% of the actual `true` values were correctly
predicted and 71% were incorrectly predicted in the test data set.

Likewise if you select other quadrants in the matrix, it shows the number of
documents that have the `false` class as their actual value in the testing data
set. In this example, the model labeled 1033 documents out of 8802 correctly as
`false`; this is called a _true negative_. 7769 documents are predicted
incorrectly as `true`; this is called a _false positive_. Thus 12% of the actual
`false` values were correctly predicted and 88% were incorrectly predicted in
set. In this example, the model labeled 1544 documents out of 8802 correctly as
`false`; this is called a _true negative_. 7258 documents are predicted
incorrectly as `true`; this is called a _false positive_. Thus 18% of the actual
`false` values were correctly predicted and 82% were incorrectly predicted in
the test data set. When you perform {classanalysis} on your own data, it might
take multiple iterations before you are satisfied with the results and ready to
deploy the model.
Expand All @@ -441,7 +483,7 @@ performed on the training data set.
--------------------------------------------------
POST _ml/data_frame/_evaluate
{
"index": "df-flight-delayed",
"index": "model-flight-delay-classification",
"query": {
"term": {
"ml.is_training": {
Expand Down Expand Up @@ -470,7 +512,7 @@ performed on previously unseen data:
--------------------------------------------------
POST _ml/data_frame/_evaluate
{
"index": "df-flight-delayed",
"index": "model-flight-delay-classification",
"query": {
"term": {
"ml.is_training": {
Expand Down Expand Up @@ -509,11 +551,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
"predicted_classes" : [
{
"predicted_class" : "false", <3>
"count" : 1033 <4>
"count" : 1544 <4>
},
{
"predicted_class" : "true",
"count" : 7769
"count" : 7258
}
],
"other_predicted_class_doc_count" : 0
Expand All @@ -524,11 +566,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
"predicted_classes" : [
{
"predicted_class" : "false",
"count" : 1893
"count" : 2109
},
{
"predicted_class" : "true",
"count" : 1059
"count" : 843
}
],
"other_predicted_class_doc_count" : 0
Expand All @@ -551,6 +593,7 @@ When you have trained a satisfactory model, you can deploy it to make prediction
about new data. Those steps are not covered in this example. See
<<ml-inference>>.

If you don't want to keep the {dfanalytics-job}, you can delete it by using the
{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete
{dfanalytics-jobs}, the destination indices remain intact.
If you don't want to keep the {dfanalytics-job}, you can delete it in {kib} or
by using the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When
you delete {dfanalytics-jobs} in {kib}, you have the option to also remove the
destination indices and index patterns.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 31110ae

Please sign in to comment.