Add Kmeans and AD command documentation (#493) (#497)

Signed-off-by: jackieyanghan <[email protected]> (cherry picked from commit ee4bce0)
opensearch-project · Mar 16, 2022 · be42907 · be42907
1 parent 31ee54b
commit be42907
Show file tree

Hide file tree

Showing 9 changed files with 2,082 additions and 2 deletions.
diff --git a/docs/category.json b/docs/category.json
@@ -7,6 +7,7 @@
     "user/admin/settings.rst"
   ],
   "ppl_cli": [
+    "user/ppl/cmd/ad.rst",
     "user/ppl/cmd/dedup.rst",
     "user/ppl/cmd/eval.rst",
     "user/ppl/cmd/fields.rst",

diff --git a/docs/user/ppl/cmd/ad.rst b/docs/user/ppl/cmd/ad.rst
@@ -0,0 +1,61 @@
+=============
+ad
+=============
+
+.. rubric:: Table of contents
+
+.. contents::
+   :local:
+   :depth: 2
+
+
+Description
+============
+| The ``ad`` command applies Random Cut Forest (RCF) algorithm in ml-commons plugin on the search result returned by a PPL command. Based on the input, two types of RCF algorithms will be utilized: fixed in time RCF for processing time-series data, batch RCF for processing non-time-series data.
+
+
+Fixed In Time RCF For Time-series Data Command Syntax
+=====================================================
+ad <shingle_size> <time_decay> <time_field>
+
+* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
+* time_decay: optional. It specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001.
+* time_field: mandatory. It specifies the time filed for RCF to use as time-series data.
+
+
+Batch RCF for Non-time-series Data Command Syntax
+=================================================
+ad <shingle_size> <time_decay>
+
+* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. The default value is 8.
+* time_decay: optional. It specifies how much of the recent past to consider when computing an anomaly score. The default value is 0.001.
+
+
+Example1: Detecting events in New York City from taxi ridership data with time-series data
+==========================================================================================
+
+The example trains a RCF model and use the model to detect anomalies in the time-series ridership data.
+
+PPL query::
+
+    os> source=nyc_taxi | fields value, timestamp | AD time_field='timestamp' | where value=10844.0'
+    +----------+---------------+-------+---------------+
+    | value    | timestamp     | score | anomaly_grade |
+    |----------+---------------+-------+---------------|
+    | 10844.0  | 1404172800000 | 0.0   |  0.0          |
+    +----------+---------------+-------+---------------+
+
+
+Example2: Detecting events in New York City from taxi ridership data with non-time-series data
+==============================================================================================
+
+The example trains a RCF model and use the model to detect anomalies in the non-time-series ridership data.
+
+PPL query::
+
+    os> source=nyc_taxi | fields value | AD | where value=10844.0'
+    +----------+--------+-----------+
+    | value    | score  | anomalous |
+    |----------+--------+-----------|
+    | 10844.0  | 0.0    | false     |
+    +----------+--------+-----------+
diff --git a/docs/user/ppl/cmd/kmeans.rst b/docs/user/ppl/cmd/kmeans.rst
@@ -0,0 +1,38 @@
+=============
+kmeans
+=============
+
+.. rubric:: Table of contents
+
+.. contents::
+   :local:
+   :depth: 2
+
+
+Description
+============
+| The ``kmeans`` command applies kmeans algorithm in ml-commons plugin on the search result returned by a PPL command.
+
+
+Syntax
+======
+kmeans <cluster-number>
+
+* cluster-number: mandatory. The number of clusters you want to group your data points into.
+
+
+Example: Clustering of Iris Dataset
+===================================
+
+The example shows how to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals.
+
+PPL query::
+
+    os> source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3
+    +--------------------+-------------------+--------------------+-------------------+-----------+
+    | sepal_length_in_cm | sepal_width_in_cm | petal_length_in_cm | petal_width_in_cm | ClusterID |
+    |--------------------+-------------------+--------------------+-------------------+-----------|
+    | 5.1                | 3.5               | 1.4                | 0.2               | 1         |
+    | 5.6                | 3.0               | 4.1                | 1.3               | 0         |
+    | 6.7                | 2.5               | 5.8                | 1.8               | 2         |
+    +--------------------+-------------------+--------------------+-------------------+-----------+
diff --git a/docs/user/ppl/index.rst b/docs/user/ppl/index.rst
@@ -36,12 +36,16 @@ The query start with search command and then flowing a set of command delimited
 
   - `Syntax <cmd/syntax.rst>`_
 
+  - `ad command <cmd/ad.rst>`_
+
   - `dedup command <cmd/dedup.rst>`_
 
   - `eval command <cmd/eval.rst>`_
 
   - `fields command <cmd/fields.rst>`_
 
+  - `kmeans command <cmd/kmeans.rst>`_
+
   - `parse command <cmd/parse.rst>`_
 
   - `rename command <cmd/rename.rst>`_

diff --git a/doctest/build.gradle b/doctest/build.gradle
@@ -3,6 +3,7 @@
  * SPDX-License-Identifier: Apache-2.0
  */
 
+import java.util.concurrent.Callable
 import org.opensearch.gradle.testclusters.RunTask
 
 plugins {
@@ -49,7 +50,18 @@ clean.dependsOn(cleanBootstrap)
 
 testClusters {
     docTestCluster {
-        plugin ':plugin'
+        plugin(provider(new Callable<RegularFile>(){
+            @Override
+            RegularFile call() throws Exception {
+                return new RegularFile() {
+                    @Override
+                    File getAsFile() {
+                        return fileTree("resources/ml-commons").getSingleFile()
+                    }
+                }
+            }
+        }))
+
         testDistribution = 'integ_test'
     }
 }

diff --git a/doctest/resources/ml-commons/opensearch-ml-1.3.0.0-SNAPSHOT.zip b/doctest/resources/ml-commons/opensearch-ml-1.3.0.0-SNAPSHOT.zip