Skip to content

Commit

Permalink
[DOCS] More edits
Browse files Browse the repository at this point in the history
  • Loading branch information
lcawl committed Sep 14, 2020
1 parent 31110ae commit 8960791
Showing 1 changed file with 37 additions and 27 deletions.
64 changes: 37 additions & 27 deletions docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ impacting accuracy.
.. If you want to experiment with <<ml-feature-importance,feature importance>>,
specify a value in the advanced configuration options. In this example, we
choose to return a maximum of 10 feature importance values per document. This
option affects the speed of the analysis, however, so by default it is disabled.
option affects the speed of the analysis, so by default it is disabled.
.. Use the default memory limit for the job. If the job requires more than this
amount of memory, it fails to start. If the available memory on the node is
limited, this setting makes it possible to prevent job execution.
Expand Down Expand Up @@ -324,14 +324,13 @@ or testing data set. You can filter the table and the confusion matrix such that
they contain only testing or training data. You can also enable histogram charts
to get a better understanding of the distribution of values in your data.

If you examine this destination index more closely in the *Discover* app in
{kib} or use the standard {es} search command, you can see that the analysis
predicts the probability of all possible classes for the dependent variable (in
a `top_classes` object). In this case, there are two classes: `true` and
`false`. The most probable class is the prediction, which is what's shown in the
{classification} results table. If you want to understand how sure the model is
about the prediction, however, you might want to examine the class probability
values. A higher number means that the model is more confident.
If you want to understand how certain the model is about each prediction, you
can examine its probability and score (`ml.prediction_probability` and
`ml.prediction_score`). These values range between 0 and 1; the higher the
number, the more confident the model is that the data point belongs to the named
class. If you examine this destination index more closely in the *Discover* app
in {kib} or use the standard {es} search command, you can see that the analysis
predicts the probability of all possible classes for the dependent variable. The `top_classes` object contains the predicted classes with the highest scores.

.API example
[%collapsible]
Expand Down Expand Up @@ -403,31 +402,42 @@ The snippet below shows a part of a document with the annotated results:
<1> An array of values specifying the probability of the prediction and the
score for each class.
The `top_classes` object contains the predicted classes with the highest scores.
Each score has a value between 0 and 1. The higher the number, the more
confident the model is that the data point belongs to the named class. In the
example above, `false` has a `class_score` of 0.39 while `true` has only 0.22,
so the prediction will be `false`. The score and probability for the chosen
class also appear in the `prediction_probability` and `prediction_score` fields.
For more details about these values, see <<dfa-classification-interpret>>.
The class with the highest score is the prediction. In this example, `false` has
a `class_score` of 0.39 while `true` has only 0.22, so the prediction will be
`false`. For more details about these values, see
<<dfa-classification-interpret>>.
====

If you chose to calculate feature importance, the destination index also
contains `feature_importance` objects. This information indicates which fields
(also known as _features_ of a data point) had the biggest impact on each
prediction. In {kib}, you can see this information displayed in the form of a
decision plot:
contains `ml.feature_importance` objects. Every field that is included in the
{classanalysis} (known as a _feature_ of the data point) is assigned a feature
importance value. However, only the most significant values (in this case, the
top 10) are stored in the index. These values indicate which features had the
biggest impact (positive or negative) on each prediction. In {kib}, you can see
this information displayed in the form of a decision plot:

[role="screenshot"]
image::images/flights-classification-importance.png["A decision plot for feature importance values in {kib}"]

The values in the decision plot are sorted in descending order, such that the
features with the most significant positive or negative impact appear at the top.
Each feature importance value represents its effect on the decision-making of
the model, increasing or decreasing the likelihood of the prediction.
//Thus in this example, the `FlightTimeMin` and `DistanceMiles` had the largest
//negative and positive feature importance values respectively and had the most
//significant influence on the prediction for this particular data point.
The sum of the feature importance values for a class (in this example, `false`)
in this data point approximates the logarithm of its odds
(or {wikipedia}/Logit[log-odds]).
////
//Does this mean the sum of the feature importance values for false in this
example should equal the logit(p), where p is the class_probability for false?
//Does this imply that the feature importance value itself is the result of a logit function? Or that we use the function to merely represent the distribution of feature importance values?
////
While the probability of a class ranges between 0 and 1, its log-odds range
between negative and positive infinity. In {kib}, the decision path for each
class starts near zero, which represents a class probability of 0.5.
// Is this true for multi-class classification or just binary classification?
From there, the feature importance values are added to the decision path. The
features with the most significant positive or negative impact appear at the top.
Thus in this example, the features related to flight time and distance had the
most significant influence on this prediction. This type of information can
help you to understand how models arrive at their predictions. It can also
indicate which aspects of your data set are most influential or least useful
when you are training and tuning your model.


[[flightdata-classification-evaluate]]
Expand Down

0 comments on commit 8960791

Please sign in to comment.