diff --git a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc index 4d2306d09..0ad36963b 100644 --- a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc +++ b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc @@ -10,7 +10,7 @@ destination, and whether or not the flight was delayed. When you create a {dfanalytics-job} for {classanalysis}, it learns the relationships between the fields in your data in order to predict the value of the _dependent variable_, which in this case is the boolean `FlightDelay` field. For an overview of these -concepts, see <>. +concepts, see <> and <>. TIP: If you want to view this example in a Jupyter notebook, https://github.com/elastic/examples/tree/master/Machine%20Learning/Analytics%20Jupyter%20Notebooks[click here]. @@ -95,7 +95,7 @@ To predict whether a specific flight is delayed: . Create a {dfanalytics-job}. + -- -You can use the wizard on the *Machine Learning* > *Data Frame Analaytics* tab +You can use the wizard on the *{ml-app}* > *Data Frame Analytics* tab in {kib} or the {ref}/put-dfanalytics.html[create {dfanalytics-jobs}] API. [role="screenshot"] @@ -188,11 +188,10 @@ POST _ml/data_frame/analytics/model-flight-delay-classification/_start + -- [role="screenshot"] -image::images/flights-classification-details.jpg["Statistics for a {dfanalytics-job} in {kib}"] +image::images/flights-classification-details.png["Statistics for a {dfanalytics-job} in {kib}"] -The job has four phases (reindexing, loading data, analyzing, and writing -results). When all the phases have completed, the job stops and the results are -ready to view and evaluate. +When the job stops, the results are ready to view and evaluate. To learn more +about the job phases, see <>. .API example [%collapsible] @@ -224,47 +223,64 @@ The API call returns the following response: "progress_percent" : 100 }, { - "phase" : "analyzing", + "phase" : "feature_selection", + "progress_percent" : 100 + }, + { + "phase" : "coarse_parameter_search", + "progress_percent" : 100 + }, + { + "phase" : "fine_tuning_parameters", + "progress_percent" : 100 + }, + { + "phase" : "final_training", "progress_percent" : 100 }, { "phase" : "writing_results", "progress_percent" : 100 + }, + { + "phase" : "inference", + "progress_percent" : 100 } ], "data_counts" : { - "training_docs_count" : 1306, - "test_docs_count" : 11753, + "training_docs_count" : 1305, + "test_docs_count" : 11754, "skipped_docs_count" : 0 }, "memory_usage" : { - "timestamp" : 1587424103000, - "peak_usage_bytes" : 923471 + "timestamp" : 1597182490577, + "peak_usage_bytes" : 316613, + "status" : "ok" }, "analysis_stats" : { "classification_stats" : { - "timestamp" : 1587424103000, + "timestamp" : 1597182490577, "iteration" : 18, "hyperparameters" : { "class_assignment_objective" : "maximize_minimum_recall", - "alpha" : 1.4193562525205259, - "downsample_factor" : 0.9351209341515412, - "eta" : 0.02331774683318904, - "eta_growth_rate_per_tree" : 1.0143154178910303, + "alpha" : 11.630957564710283, + "downsample_factor" : 0.9418550623091531, + "eta" : 0.032382816833064335, + "eta_growth_rate_per_tree" : 1.0198807182688074, "feature_bag_fraction" : 0.5504020748926737, - "gamma" : 0.08856070622714199, - "lambda" : 0.09965307629033043, + "gamma" : 0.08388388780939579, + "lambda" : 0.08628826657684924, "max_attempts_to_add_tree" : 3, "max_optimization_rounds_per_hyperparameter" : 2, - "max_trees" : 894, + "max_trees" : 644, "num_folds" : 5, "num_splits_per_feature" : 75, - "soft_tree_depth_limit" : 1.2312092443493399, + "soft_tree_depth_limit" : 7.550606337307592, "soft_tree_depth_tolerance" : 0.13448633124842999 }, "timing_stats" : { - "elapsed_time" : 71060, - "iteration_time" : 4513 + "elapsed_time" : 44206, + "iteration_time" : 1884 }, "validation_loss" : { "loss_type" : "binomial_logistic", @@ -289,15 +305,16 @@ When you view the {classification} results in {kib}, it shows contents of the destination index in a tabular format: [role="screenshot"] -image::images/flights-classification-results.jpg["Results for a {dfanalytics-job} in {kib}"] +image::images/flights-classification-results.png["Results for a {dfanalytics-job} in {kib}"] In this example, the table shows a column for the dependent variable (`FlightDelay`), which contains the ground truth values that you are trying to predict. It also shows a column for the predicted values (`ml.FlightDelay_prediction`), which were generated by the {classanalysis}. The `ml.is_training` column indicates whether the document was used in the training -or testing data set. You can use this information to filter the table and the -confusion matrix such that they contain only testing or training data. +or testing data set. You can filter the table and the confusion matrix such that +they contain only testing or training data. You can also enable histogram charts +to get a better understanding of the distribution of values in your data. If you examine this destination index more closely in the *Discover* app in {kib} or use the standard {es} search command, you can see that the analysis @@ -384,7 +401,7 @@ occurrences where the analysis classified data points correctly with their actual class and the percentage of occurrences where it misclassified them. [role="screenshot"] -image::images/flights-classification-evaluation.jpg["Evaluation of a {dfanalytics-job} in {kib}"] +image::images/flights-classification-evaluation.png["Evaluation of a {dfanalytics-job} in {kib}"] NOTE: As the sample data may change when it is loaded into {kib}, the results of the {classanalysis} can vary even if you use the same configuration as the @@ -394,25 +411,26 @@ own results. If you want to see the exact number of occurrences, select a quadrant in the matrix. You can optionally filter the table to contain only testing data so you can see how well the model performs on previously unseen data. In this example, -there are 2952 documents in the testing data that have the `true` class. 914 of -them are predicted as `false`; this is called a _false negative_. 2038 are +there are 2952 documents in the testing data that have the `true` class. 1893 of +them are predicted as `false`; this is called a _false negative_. 1059 are predicted correctly as `true`; this is called a _true positive_. The confusion -matrix therefore shows us that 69% of the actual `true` values were correctly -predicted and 31% were incorrectly predicted in the test data set. +matrix therefore shows us that 36% of the actual `true` values were correctly +predicted and 64% were incorrectly predicted in the test data set. Likewise if you select other quadrants in the matrix, it shows the number of documents that have the `false` class as their actual value in the testing data -set. In this example, the model labeled 7035 documents out of 8801 correctly as -`false`; this is called a _true negative_. 1766 documents are predicted -incorrectly as `true`; this is called a _false positive_. Thus 80% of the actual -`false` values were correctly predicted and 20% were incorrectly predicted in -the test data set. - -For more information about interpreting the evaluation metrics, see -<>. +set. In this example, the model labeled 1033 documents out of 8802 correctly as +`false`; this is called a _true negative_. 7769 documents are predicted +incorrectly as `true`; this is called a _false positive_. Thus 12% of the actual +`false` values were correctly predicted and 88% were incorrectly predicted in +the test data set. When you perform {classanalysis} on your own data, it might +take multiple iterations before you are satisfied with the results and ready to +deploy the model. You can also generate these metrics with the -{ref}/evaluate-dfanalytics.html[{dfanalytics} evaluate API]. +{ref}/evaluate-dfanalytics.html[{dfanalytics} evaluate API]. For more +information about interpreting the evaluation metrics, see +<>. .API example [%collapsible] @@ -487,15 +505,15 @@ were misclassified (`actual_class` does not match `predicted_class`): "confusion_matrix" : [ { "actual_class" : "false", <1> - "actual_class_doc_count" : 8801, <2> + "actual_class_doc_count" : 8802, <2> "predicted_classes" : [ { "predicted_class" : "false", <3> - "count" : 7035 <4> + "count" : 1033 <4> }, { "predicted_class" : "true", - "count" : 1766 + "count" : 7769 } ], "other_predicted_class_doc_count" : 0 @@ -506,11 +524,11 @@ were misclassified (`actual_class` does not match `predicted_class`): "predicted_classes" : [ { "predicted_class" : "false", - "count" : 914 + "count" : 1893 }, { "predicted_class" : "true", - "count" : 2038 + "count" : 1059 } ], "other_predicted_class_doc_count" : 0 @@ -529,6 +547,10 @@ were misclassified (`actual_class` does not match `predicted_class`): predicted class. ==== +When you have trained a satisfactory model, you can deploy it to make predictions +about new data. Those steps are not covered in this example. See +<>. + If you don't want to keep the {dfanalytics-job}, you can delete it by using the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete {dfanalytics-jobs}, the destination indices remain intact. diff --git a/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc index a86ccc981..7a4383761 100644 --- a/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc +++ b/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc @@ -10,7 +10,8 @@ distances, carriers, and the number of minutes each flight was delayed. When you create a {dfanalytics-job} for {reganalysis}, it learns the relationships between the fields in your data in order to predict the value of a _dependent variable_, which in this case is the numeric `FlightDelayMins` field. -For an overview of these concepts, see <>. +For an overview of these concepts, see <> and +<>. [[flightdata-regression-data]] == Preparing your data @@ -453,6 +454,9 @@ POST _ml/data_frame/_evaluate <1> Evaluate only the documents that are not part of the training data. ==== +When you have trained a satisfactory model, you can deploy it to make predictions +about new data. Those steps are not covered in this example. See +<>. If you don't want to keep the {dfanalytics-job}, you can delete it. For example, use {kib} or the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-details.jpg b/docs/en/stack/ml/df-analytics/images/flights-classification-details.jpg deleted file mode 100644 index 9c2c7e11c..000000000 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-details.jpg and /dev/null differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-details.png b/docs/en/stack/ml/df-analytics/images/flights-classification-details.png new file mode 100644 index 000000000..08cc74bc1 Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/flights-classification-details.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.jpg b/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.jpg deleted file mode 100644 index 3d0c056dd..000000000 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.jpg and /dev/null differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png b/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png new file mode 100644 index 000000000..d6c8a2a4e Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-job-1.png b/docs/en/stack/ml/df-analytics/images/flights-classification-job-1.png index 328f575b4..ebca49508 100644 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-job-1.png and b/docs/en/stack/ml/df-analytics/images/flights-classification-job-1.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-job-2.png b/docs/en/stack/ml/df-analytics/images/flights-classification-job-2.png index 8ecbde697..1a77e8190 100644 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-job-2.png and b/docs/en/stack/ml/df-analytics/images/flights-classification-job-2.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-results.jpg b/docs/en/stack/ml/df-analytics/images/flights-classification-results.jpg deleted file mode 100644 index 059b08151..000000000 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-results.jpg and /dev/null differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-results.png b/docs/en/stack/ml/df-analytics/images/flights-classification-results.png new file mode 100644 index 000000000..31fe7a807 Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/flights-classification-results.png differ