[DOCS] Add feature importance examples

elastic · Sep 10, 2020 · 31110ae · 31110ae
1 parent bc1a639
commit 31110ae
Show file tree

Hide file tree

Showing 7 changed files with 116 additions and 74 deletions.
diff --git a/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc b/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
@@ -114,7 +114,8 @@ in the `analyzed_fields` object when configuring the {dfanalytics-job}.
 == Interpreting {classification} results
 
 The following sections help you understand and interpret the results of a 
-{classanalysis}.
+{classanalysis}. To see example results, refer to
+<<flightdata-classification-results>>.
 
 [[dfa-classification-class-probability]]
 === `class_probability`
@@ -123,15 +124,13 @@ The value of `class_probability` shows how likely it is that a given data point
 belongs to a certain class. It is a value between 0 and 1. The higher the 
 number, the higher the probability that the data point belongs to the named 
 class. This information is stored in the `top_classes` array for each document 
-in your destination index. See the
-{ml-docs}/flightdata-classification.html#flightdata-classification-results[Viewing {classification} results]
-section in the {classification} example.
+in your destination index.
 
 [[dfa-classification-class-score]]
 === `class_score`
 
 The value of `class_score` controls the probability at which a class label is 
-assigned to a data point. In normal case – that you maximize the number of 
+assigned to a data point. In the normal case – that you maximize the number of 
 correct labels – a class label is assigned when its predicted probability is 
 greater than 0.5. The `class_score` makes it possible to change this behavior, 
 so it can be less than or greater than 0.5. For example, suppose our two classes 

diff --git a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc
@@ -104,6 +104,9 @@ image::images/flights-classification-job-1.png["Creating a {dfanalytics-job} in
 [role="screenshot"]
 image::images/flights-classification-job-2.png["Creating a {dfanalytics-job} in {kib} – continued"]
 
+[role="screenshot"]
+image::images/flights-classification-job-3.png["Creating a {dfanalytics-job} in {kib} – advanced options"]
+
 .. Choose `kibana_sample_data_flights` as the source index.
 .. Choose `classification` as the job type.
 .. Choose `FlightDelay` as the dependent variable, which is the field that we
@@ -116,15 +119,18 @@ recommended to exclude fields that either contain erroneous data or describe the
 source data for training. While that value is low for this example, for many
 large data sets using a small training sample greatly reduces runtime without 
 impacting accuracy.
-.. Use the default feature importance values.
+.. If you want to experiment with <<ml-feature-importance,feature importance>>,
+specify a value in the advanced configuration options. In this example, we
+choose to return a maximum of 10 feature importance values per document. This
+option affects the speed of the analysis, however, so by default it is disabled. 
 .. Use the default memory limit for the job. If the job requires more than this 
 amount of memory, it fails to start. If the available memory on the node is
 limited, this setting makes it possible to prevent job execution.
 .. Add a job ID and optionally a job description.
 .. Add the name of the destination index that will contain the results of the
-analysis. It will contain a copy of the source index data where each document is
-annotated with the results. If the index does not exist, it will be created
-automatically.
+analysis. In {kib}, the index name matches the job ID by default. It will
+contain a copy of the source index data where each document is annotated with
+the results. If the index does not exist, it will be created automatically.
 
 
 .API example
@@ -140,13 +146,15 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
     ]
   },
   "dest": {
-    "index": "df-flight-delayed",
+    "index": "model-flight-delay-classification",
     "results_field": "ml" <1>
   },
   "analysis": {
     "classification": {
       "dependent_variable": "FlightDelay",
-      "training_percent": 10
+      "training_percent": 10,
+      "num_top_classes": 10,
+      "num_top_feature_importance_values": 10 <2>
     }
   },
   "analyzed_fields": {
@@ -160,7 +168,8 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
 }
 --------------------------------------------------
 // TEST[skip:setup kibana sample data]
-<1> The field name in the `dest` index that contains the analysis results. 
+<1> The field name in the `dest` index that contains the analysis results.
+<2> To disable feature importance calculations, omit this option.
 ====
 --
 
@@ -259,32 +268,31 @@ The API call returns the following response:
       },
       "analysis_stats" : {
         "classification_stats" : {
-          "timestamp" : 1597182490577,
+          "timestamp" : 1599684771114,
           "iteration" : 18,
           "hyperparameters" : {
             "class_assignment_objective" : "maximize_minimum_recall",
-            "alpha" : 11.630957564710283,
-            "downsample_factor" : 0.9418550623091531,
-            "eta" : 0.032382816833064335,
-            "eta_growth_rate_per_tree" : 1.0198807182688074,
-            "feature_bag_fraction" : 0.5504020748926737,
-            "gamma" : 0.08388388780939579,
-            "lambda" : 0.08628826657684924,
+            "alpha" : 6.648298686326093,
+            "downsample_factor" : 0.7435400845721971,
+            "eta" : 0.039957516522980074,
+            "eta_growth_rate_per_tree" : 1.0168333294220058,
+            "feature_bag_fraction" : 0.49761652263010625,
+            "gamma" : 0.21224183609258152,
+            "lambda" : 0.2572621613644672,
             "max_attempts_to_add_tree" : 3,
             "max_optimization_rounds_per_hyperparameter" : 2,
-            "max_trees" : 644,
+            "max_trees" : 590,
             "num_folds" : 5,
             "num_splits_per_feature" : 75,
-            "soft_tree_depth_limit" : 7.550606337307592,
-            "soft_tree_depth_tolerance" : 0.13448633124842999
+            "soft_tree_depth_limit" : 3.2719032647442443,
+            "soft_tree_depth_tolerance" : 0.14970565884872958
           },
           "timing_stats" : {
-            "elapsed_time" : 44206,
-            "iteration_time" : 1884
+            "elapsed_time" : 37915,
+            "iteration_time" : 2552
           },
           "validation_loss" : {
-            "loss_type" : "binomial_logistic",
-            "fold_values" : [ ]
+            "loss_type" : "binomial_logistic"
           }
         }
       }
@@ -330,7 +338,7 @@ values. A higher number means that the model is more confident.
 ====
 [source,console]
 --------------------------------------------------
-GET df-flight-delayed/_search
+GET model-flight-delay-classification/_search
 --------------------------------------------------
 // TEST[skip:TBD]
 
@@ -343,51 +351,85 @@ The snippet below shows a part of a document with the annotated results:
           "FlightDelay" : false,
           ...
           "ml" : {
+            "FlightDelay_prediction" : false,
             "top_classes" : [ <1>
               {
-                "class_probability" : 0.9198146781161334, 
-               "class_score" : 0.36964390728677926, 
-               "class_name" : false
+                "class_name" : false,
+                "class_probability" : 0.3933807062505216,
+                "class_score" : 0.3933807062505216
               },
               {
-                "class_probability" : 0.08018532188386665,
-                 "class_score" : 0.08018532188386665,
-                 "class_name" : true
+                "class_name" : true,
+                "class_probability" : 0.6066192937494784,
+                "class_score" : 0.22857258275913037
               }
             ],
-            "prediction_score" : 0.36964390728677926,
-            "FlightDelay_prediction" : false,
-            "prediction_probability" : 0.9198146781161334,
+            "prediction_probability" : 0.3933807062505216,
+            "prediction_score" : 0.3933807062505216,
             "feature_importance" : [
               {
-                "feature_name" : "DistanceMiles",
-                "importance" : -3.039025449178423
+                "feature_name" : "FlightTimeMin",
+                  "importance" : -2.823868829093038,
+                  "classes" : [
+                    {
+                      "class_name" : false,
+                      "importance" : -2.823868829093038
+                    },
+                    {
+                      "class_name" : true,
+                      "importance" : 2.823868829093038
+                    }
+                  ]
               },
               {
-                "feature_name" : "FlightTimeMin",
-                "importance" : 2.4980756273399045
-              }
+                "feature_name" : "DistanceMiles",
+                "importance" : 0.9872151818111125,
+                "classes" : [
+                  {
+                    "class_name" : false,
+                    "importance" : 0.9872151818111125
+                  },
+                  {
+                    "class_name" : true,
+                    "importance" : -0.9872151818111125
+                  }
+                ]
+              },
+              ...
             ],
             "is_training" : false
           }
 ----
-<1> An array of values specifying the probability of the prediction and the 
-`class_score` for each class. 
-
-The `top_classes` object contains the predicted classes with the highest 
-scores. The `class_probability` is a value between 0 and 1. The higher the 
-number, the more confident the model is that the data point belongs to the named 
-class. In the example above, `false` has a `class_probability` of 0.91 while 
-`true` has only 0.08, so the prediction will be `false`. The `class_score` is a 
-function of the probability.
-
-////
-It is chosen so that the decision to assign the 
-data point to the class with the highest score maximizes the minimum recall of 
-any class.
-////
+<1> An array of values specifying the probability of the prediction and the
+score for each class. 
+
+The `top_classes` object contains the predicted classes with the highest scores.
+Each score has a value between 0 and 1. The higher the number, the more
+confident the model is that the data point belongs to the named class. In the
+example above, `false` has a `class_score` of 0.39 while `true` has only 0.22,
+so the prediction will be `false`. The score and probability for the chosen
+class also appear in the `prediction_probability` and `prediction_score` fields. 
+For more details about these values, see <<dfa-classification-interpret>>.
 ====
 
+If you chose to calculate feature importance, the destination index also
+contains `feature_importance` objects. This information indicates which fields
+(also known as _features_ of a data point) had the biggest impact on each
+prediction. In {kib}, you can see this information displayed in the form of a
+decision plot:
+
+[role="screenshot"]
+image::images/flights-classification-importance.png["A decision plot for feature importance values in {kib}"]
+
+The values in the decision plot are sorted in descending order, such that the
+features with the most significant positive or negative impact appear at the top.
+Each feature importance value represents its effect on the decision-making of
+the model, increasing or decreasing the likelihood of the prediction.
+//Thus in this example, the `FlightTimeMin` and `DistanceMiles` had the largest
+//negative and positive feature importance values respectively and had the most
+//significant influence on the prediction for this particular data point.
+
+
 [[flightdata-classification-evaluate]]
 == Evaluating {classification} results
 
@@ -411,18 +453,18 @@ own results.
 If you want to see the exact number of occurrences, select a quadrant in the
 matrix. You can optionally filter the table to contain only testing data so you
 can see how well the model performs on previously unseen data. In this example,
-there are 2952 documents in the testing data that have the `true` class. 1893 of
-them are predicted as `false`; this is called a _false negative_. 1059 are
+there are 2952 documents in the testing data that have the `true` class. 2109 of
+them are predicted as `false`; this is called a _false negative_. 843 are
 predicted correctly as `true`; this is called a _true positive_. The confusion
-matrix therefore shows us that 36% of the actual `true` values were correctly
-predicted and 64% were incorrectly predicted in the test data set.
+matrix therefore shows us that 29% of the actual `true` values were correctly
+predicted and 71% were incorrectly predicted in the test data set.
 
 Likewise if you select other quadrants in the matrix, it shows the number of
 documents that have the `false` class as their actual value in the testing data
-set. In this example, the model labeled 1033 documents out of 8802 correctly as
-`false`; this is called a _true negative_. 7769 documents are predicted
-incorrectly as `true`; this is called a _false positive_. Thus 12% of the actual
-`false` values were correctly predicted and 88% were incorrectly predicted in
+set. In this example, the model labeled 1544 documents out of 8802 correctly as
+`false`; this is called a _true negative_. 7258 documents are predicted
+incorrectly as `true`; this is called a _false positive_. Thus 18% of the actual
+`false` values were correctly predicted and 82% were incorrectly predicted in
 the test data set. When you perform {classanalysis} on your own data, it might
 take multiple iterations before you are satisfied with the results and ready to
 deploy the model.
@@ -441,7 +483,7 @@ performed on the training data set.
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
 {
- "index": "df-flight-delayed",
+ "index": "model-flight-delay-classification",
    "query": {
     "term": {
       "ml.is_training": {
@@ -470,7 +512,7 @@ performed on previously unseen data:
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
 {
- "index": "df-flight-delayed",
+ "index": "model-flight-delay-classification",
    "query": {
     "term": {
       "ml.is_training": {
@@ -509,11 +551,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
           "predicted_classes" : [
             {
               "predicted_class" : "false", <3>
-              "count" : 1033 <4>
+              "count" : 1544 <4>
             },
             {
               "predicted_class" : "true",
-              "count" : 7769
+              "count" : 7258
             }
           ],
           "other_predicted_class_doc_count" : 0
@@ -524,11 +566,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
           "predicted_classes" : [
             {
               "predicted_class" : "false",
-              "count" : 1893
+              "count" : 2109
             },
             {
               "predicted_class" : "true",
-              "count" : 1059
+              "count" : 843
             }
           ],
           "other_predicted_class_doc_count" : 0
@@ -551,6 +593,7 @@ When you have trained a satisfactory model, you can deploy it to make prediction
 about new data. Those steps are not covered in this example. See
 <<ml-inference>>.
 
-If you don't want to keep the {dfanalytics-job}, you can delete it by using the 
-{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete 
-{dfanalytics-jobs}, the destination indices remain intact.
+If you don't want to keep the {dfanalytics-job}, you can delete it in {kib} or
+by using the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When
+you delete {dfanalytics-jobs} in {kib}, you have the option to also remove the 
+destination indices and index patterns.
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-details.png b/docs/en/stack/ml/df-analytics/images/flights-classification-details.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png b/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-importance.png b/docs/en/stack/ml/df-analytics/images/flights-classification-importance.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-job-3.png b/docs/en/stack/ml/df-analytics/images/flights-classification-job-3.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-results.png b/docs/en/stack/ml/df-analytics/images/flights-classification-results.png