From be6a058a67e3976c806e73f5bc3412e83a1e03d5 Mon Sep 17 00:00:00 2001 From: Liam Thompson <32779855+leemthompo@users.noreply.github.com> Date: Fri, 10 Jan 2025 18:17:15 +0100 Subject: [PATCH] =?UTF-8?q?[DOCS]=C2=A0Improve/fix=20documentation=20on=20?= =?UTF-8?q?stored=20scripts=20(#119921)=20(#119971)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Improve/fix documentation on stored scripts * Update docs/reference/scripting/using.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/scripting/using.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/transform/painless-examples.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --------- Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> (cherry picked from commit 1e608dc2236d38c09a1ae9a7e3d3741d6aa9ae72) Co-authored-by: Valentin Crettaz --- docs/reference/scripting/using.asciidoc | 9 +- .../transform/painless-examples.asciidoc | 206 +++++++++--------- 2 files changed, 110 insertions(+), 105 deletions(-) diff --git a/docs/reference/scripting/using.asciidoc b/docs/reference/scripting/using.asciidoc index d4b4fd91e3e37..7dc1e38c62e78 100644 --- a/docs/reference/scripting/using.asciidoc +++ b/docs/reference/scripting/using.asciidoc @@ -201,8 +201,13 @@ when you're creating <>. [[script-stored-scripts]] === Store and retrieve scripts You can store and retrieve scripts from the cluster state using the -<>. Stored scripts reduce compilation -time and make searches faster. +<>. Stored scripts allow you to reference +shared scripts for operations like scoring, aggregating, filtering, and +reindexing. Instead of embedding scripts inline in each query, you can reference +these shared operations. + +Stored scripts can also reduce request payload size. Depending on script size +and request frequency, this can help lower latency and data transfer costs. NOTE: Unlike regular scripts, stored scripts require that you specify a script language using the `lang` parameter. diff --git a/docs/reference/transform/painless-examples.asciidoc b/docs/reference/transform/painless-examples.asciidoc index 4b0802c79a340..3b4dd9bdb631d 100644 --- a/docs/reference/transform/painless-examples.asciidoc +++ b/docs/reference/transform/painless-examples.asciidoc @@ -8,8 +8,8 @@ IMPORTANT: The examples that use the `scripted_metric` aggregation are not supported on {es} Serverless. -These examples demonstrate how to use Painless in {transforms}. You can learn -more about the Painless scripting language in the +These examples demonstrate how to use Painless in {transforms}. You can learn +more about the Painless scripting language in the {painless}/painless-guide.html[Painless guide]. * <> @@ -20,24 +20,24 @@ more about the Painless scripting language in the * <> * <> -[NOTE] +[NOTE] -- -* While the context of the following examples is the {transform} use case, -the Painless scripts in the snippets below can be used in other {es} search +* While the context of the following examples is the {transform} use case, +the Painless scripts in the snippets below can be used in other {es} search aggregations, too. -* All the following examples use scripts, {transforms} cannot deduce mappings of -output fields when the fields are created by a script. {transforms-cap} don't -create any mappings in the destination index for these fields, which means they -get dynamically mapped. Create the destination index prior to starting the +* All the following examples use scripts, {transforms} cannot deduce mappings of +output fields when the fields are created by a script. {transforms-cap} don't +create any mappings in the destination index for these fields, which means they +get dynamically mapped. Create the destination index prior to starting the {transform} in case you want explicit mappings. -- [[painless-top-hits]] == Getting top hits by using scripted metric aggregation -This snippet shows how to find the latest document, in other words the document -with the latest timestamp. From a technical perspective, it helps to achieve -the function of a <> by using +This snippet shows how to find the latest document, in other words the document +with the latest timestamp. From a technical perspective, it helps to achieve +the function of a <> by using scripted metric aggregation in a {transform}, which provides a metric output. IMPORTANT: This example uses a `scripted_metric` aggregation which is not supported on {es} Serverless. @@ -45,12 +45,12 @@ IMPORTANT: This example uses a `scripted_metric` aggregation which is not suppor [source,js] -------------------------------------------------- "aggregations": { - "latest_doc": { + "latest_doc": { "scripted_metric": { "init_script": "state.timestamp_latest = 0L; state.last_doc = ''", <1> "map_script": """ <2> - def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli(); - if (current_date > state.timestamp_latest) + def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli(); + if (current_date > state.timestamp_latest) {state.timestamp_latest = current_date; state.last_doc = new HashMap(params['_source']);} """, @@ -59,7 +59,7 @@ IMPORTANT: This example uses a `scripted_metric` aggregation which is not suppor def last_doc = ''; def timestamp_latest = 0L; for (s in states) {if (s.timestamp_latest > (timestamp_latest)) - {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}} + {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}} return last_doc """ } @@ -68,23 +68,23 @@ IMPORTANT: This example uses a `scripted_metric` aggregation which is not suppor -------------------------------------------------- // NOTCONSOLE -<1> The `init_script` creates a long type `timestamp_latest` and a string type +<1> The `init_script` creates a long type `timestamp_latest` and a string type `last_doc` in the `state` object. -<2> The `map_script` defines `current_date` based on the timestamp of the -document, then compares `current_date` with `state.timestamp_latest`, finally -returns `state.last_doc` from the shard. By using `new HashMap(...)` you copy -the source document, this is important whenever you want to pass the full source +<2> The `map_script` defines `current_date` based on the timestamp of the +document, then compares `current_date` with `state.timestamp_latest`, finally +returns `state.last_doc` from the shard. By using `new HashMap(...)` you copy +the source document, this is important whenever you want to pass the full source object from one phase to the next. <3> The `combine_script` returns `state` from each shard. -<4> The `reduce_script` iterates through the value of `s.timestamp_latest` -returned by each shard and returns the document with the latest timestamp -(`last_doc`). In the response, the top hit (in other words, the `latest_doc`) is +<4> The `reduce_script` iterates through the value of `s.timestamp_latest` +returned by each shard and returns the document with the latest timestamp +(`last_doc`). In the response, the top hit (in other words, the `latest_doc`) is nested below the `latest_doc` field. -Check the <> for detailed +Check the <> for detailed explanation on the respective scripts. -You can retrieve the last value in a similar way: +You can retrieve the last value in a similar way: [source,js] -------------------------------------------------- @@ -93,17 +93,17 @@ You can retrieve the last value in a similar way: "scripted_metric": { "init_script": "state.timestamp_latest = 0L; state.last_value = ''", "map_script": """ - def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli(); - if (current_date > state.timestamp_latest) + def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli(); + if (current_date > state.timestamp_latest) {state.timestamp_latest = current_date; state.last_value = params['_source']['value'];} """, "combine_script": "return state", "reduce_script": """ def last_value = ''; - def timestamp_latest = 0L; - for (s in states) {if (s.timestamp_latest > (timestamp_latest)) - {timestamp_latest = s.timestamp_latest; last_value = s.last_value;}} + def timestamp_latest = 0L; + for (s in states) {if (s.timestamp_latest > (timestamp_latest)) + {timestamp_latest = s.timestamp_latest; last_value = s.last_value;}} return last_value """ } @@ -117,10 +117,10 @@ You can retrieve the last value in a similar way: [[top-hits-stored-scripts]] === Getting top hits by using stored scripts -You can also use the power of -{ref}/create-stored-script-api.html[stored scripts] to get the latest value. -Stored scripts reduce compilation time, make searches faster, and are -updatable. +You can also use the power of +{ref}/create-stored-script-api.html[stored scripts] to get the latest value. +Stored scripts are updatable, enable collaboration, and avoid duplication across +queries. 1. Create the stored scripts: + @@ -202,7 +202,7 @@ POST _scripts/last-value-reduce } -------------------------------------------------- // NOTCONSOLE -<1> The parameter `field_with_last_value` can be set any field that you want the +<1> The parameter `field_with_last_value` can be set any field that you want the latest value for. -- @@ -210,8 +210,8 @@ latest value for. [[painless-time-features]] == Getting time features by using aggregations -This snippet shows how to extract time based features by using Painless in a -{transform}. The snippet uses an index where `@timestamp` is defined as a `date` +This snippet shows how to extract time based features by using Painless in a +{transform}. The snippet uses an index where `@timestamp` is defined as a `date` type field. [source,js] @@ -225,11 +225,11 @@ type field. return date.getHour(); <4> """ } - } + } }, "avg_month_of_year": { <5> "avg":{ - "script": { <6> + "script": { <6> "source": """ ZonedDateTime date = doc['@timestamp'].value; <7> return date.getMonthValue(); <8> @@ -255,9 +255,9 @@ type field. [[painless-group-by]] == Using Painless in `group_by` -It is possible to base the `group_by` property of a {transform} on the output of -a script. The following example uses the {kib} sample web logs dataset. The goal -here is to make the {transform} output easier to understand through normalizing +It is possible to base the `group_by` property of a {transform} on the output of +a script. The following example uses the {kib} sample web logs dataset. The goal +here is to make the {transform} output easier to understand through normalizing the value of the fields that the data is grouped by. [source,console] @@ -274,12 +274,12 @@ POST _transform/_preview "agent": { "terms": { "script": { <2> - "source": """String agent = doc['agent.keyword'].value; - if (agent.contains("MSIE")) { + "source": """String agent = doc['agent.keyword'].value; + if (agent.contains("MSIE")) { return "internet explorer"; - } else if (agent.contains("AppleWebKit")) { - return "safari"; - } else if (agent.contains('Firefox')) { + } else if (agent.contains("AppleWebKit")) { + return "safari"; + } else if (agent.contains('Firefox')) { return "firefox"; } else { return agent }""", "lang": "painless" @@ -314,18 +314,18 @@ POST _transform/_preview "dest": { <4> "index": "pivot_logs" } -} +} -------------------------------------------------- // TEST[skip:setup kibana sample data] <1> Specifies the source index or indices. -<2> The script defines an `agent` string based on the `agent` field of the -documents, then iterates through the values. If an `agent` field contains -"MSIE", than the script returns "Internet Explorer". If it contains -`AppleWebKit`, it returns "safari". It returns "firefox" if the field value -contains "Firefox". Finally, in every other case, the value of the field is +<2> The script defines an `agent` string based on the `agent` field of the +documents, then iterates through the values. If an `agent` field contains +"MSIE", than the script returns "Internet Explorer". If it contains +`AppleWebKit`, it returns "safari". It returns "firefox" if the field value +contains "Firefox". Finally, in every other case, the value of the field is returned. -<3> The aggregations object contains filters that narrow down the results to +<3> The aggregations object contains filters that narrow down the results to documents that contains `200`, `404`, or `503` values in the `response` field. <4> Specifies the destination index of the {transform}. @@ -374,14 +374,14 @@ The API returns the following result: -------------------------------------------------- // NOTCONSOLE -You can see that the `agent` values are simplified so it is easier to interpret -them. The table below shows how normalization modifies the output of the +You can see that the `agent` values are simplified so it is easier to interpret +them. The table below shows how normalization modifies the output of the {transform} in our example compared to the non-normalized values. [width="50%"] |=== -| Non-normalized `agent` value | Normalized `agent` value +| Non-normalized `agent` value | Normalized `agent` value | "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" | "internet explorer" | "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24" | "safari" @@ -393,9 +393,9 @@ them. The table below shows how normalization modifies the output of the [[painless-bucket-script]] == Getting duration by using bucket script -This example shows you how to get the duration of a session by client IP from a -data log by using -<>. +This example shows you how to get the duration of a session by client IP from a +data log by using +<>. The example uses the {kib} sample web logs dataset. [source,console] @@ -440,22 +440,22 @@ PUT _transform/data_log // TEST[skip:setup kibana sample data] <1> To define the length of the sessions, we use a bucket script. -<2> The bucket path is a map of script variables and their associated path to -the buckets you want to use for the variable. In this particular case, `min` and +<2> The bucket path is a map of script variables and their associated path to +the buckets you want to use for the variable. In this particular case, `min` and `max` are variables mapped to `time_frame.gte.value` and `time_frame.lte.value`. -<3> Finally, the script substracts the start date of the session from the end +<3> Finally, the script substracts the start date of the session from the end date which results in the duration of the session. [[painless-count-http]] == Counting HTTP responses by using scripted metric aggregation -You can count the different HTTP response types in a web log data set by using -scripted metric aggregation as part of the {transform}. You can achieve a -similar function with filter aggregations, check the -{ref}/transform-examples.html#example-clientips[Finding suspicious client IPs] +You can count the different HTTP response types in a web log data set by using +scripted metric aggregation as part of the {transform}. You can achieve a +similar function with filter aggregations, check the +{ref}/transform-examples.html#example-clientips[Finding suspicious client IPs] example for details. -The example below assumes that the HTTP response codes are stored as keywords in +The example below assumes that the HTTP response codes are stored as keywords in the `response` field of the documents. IMPORTANT: This example uses a `scripted_metric` aggregation which is not supported on {es} Serverless. @@ -488,32 +488,32 @@ IMPORTANT: This example uses a `scripted_metric` aggregation which is not suppor """ } }, - ... + ... } -------------------------------------------------- // NOTCONSOLE <1> The `aggregations` object of the {transform} that contains all aggregations. <2> Object of the `scripted_metric` aggregation. -<3> This `scripted_metric` performs a distributed operation on the web log data +<3> This `scripted_metric` performs a distributed operation on the web log data to count specific types of HTTP responses (error, success, and other). -<4> The `init_script` creates a `responses` array in the `state` object with +<4> The `init_script` creates a `responses` array in the `state` object with three properties (`error`, `success`, `other`) with long data type. -<5> The `map_script` defines `code` based on the `response.keyword` value of the -document, then it counts the errors, successes, and other responses based on the +<5> The `map_script` defines `code` based on the `response.keyword` value of the +document, then it counts the errors, successes, and other responses based on the first digit of the responses. <6> The `combine_script` returns `state.responses` from each shard. -<7> The `reduce_script` creates a `counts` array with the `error`, `success`, -and `other` properties, then iterates through the value of `responses` returned -by each shard and assigns the different response types to the appropriate -properties of the `counts` object; error responses to the error counts, success -responses to the success counts, and other responses to the other counts. +<7> The `reduce_script` creates a `counts` array with the `error`, `success`, +and `other` properties, then iterates through the value of `responses` returned +by each shard and assigns the different response types to the appropriate +properties of the `counts` object; error responses to the error counts, success +responses to the success counts, and other responses to the other counts. Finally, returns the `counts` array with the response counts. [[painless-compare]] == Comparing indices by using scripted metric aggregations -This example shows how to compare the content of two indices by a {transform} +This example shows how to compare the content of two indices by a {transform} that uses a scripted metric aggregation. IMPORTANT: This example uses a `scripted_metric` aggregation which is not supported on {es} Serverless. @@ -570,19 +570,19 @@ POST _transform/_preview <2> The `dest` index contains the results of the comparison. <3> The `group_by` field needs to be a unique identifier for each document. <4> Object of the `scripted_metric` aggregation. -<5> The `map_script` defines `doc` in the state object. By using -`new HashMap(...)` you copy the source document, this is important whenever you +<5> The `map_script` defines `doc` in the state object. By using +`new HashMap(...)` you copy the source document, this is important whenever you want to pass the full source object from one phase to the next. <6> The `combine_script` returns `state` from each shard. -<7> The `reduce_script` checks if the size of the indices are equal. If they are -not equal, than it reports back a `count_mismatch`. Then it iterates through all -the values of the two indices and compare them. If the values are equal, then it +<7> The `reduce_script` checks if the size of the indices are equal. If they are +not equal, than it reports back a `count_mismatch`. Then it iterates through all +the values of the two indices and compare them. If the values are equal, then it returns a `match`, otherwise returns a `mismatch`. [[painless-web-session]] == Getting web session details by using scripted metric aggregation -This example shows how to derive multiple features from a single transaction. +This example shows how to derive multiple features from a single transaction. Let's take a look on the example source document from the data: .Source document @@ -628,8 +628,8 @@ Let's take a look on the example source document from the data: ===== -By using the `sessionid` as a group-by field, you are able to enumerate events -through the session and get more details of the session by using scripted metric +By using the `sessionid` as a group-by field, you are able to enumerate events +through the session and get more details of the session by using scripted metric aggregation. IMPORTANT: This example uses a `scripted_metric` aggregation which is not supported on {es} Serverless. @@ -650,7 +650,7 @@ POST _transform/_preview } }, "aggregations": { <2> - "distinct_paths": { + "distinct_paths": { "cardinality": { "field": "apache.access.path" } @@ -665,21 +665,21 @@ POST _transform/_preview "init_script": "state.docs = []", <3> "map_script": """ <4> Map span = [ - '@timestamp':doc['@timestamp'].value, + '@timestamp':doc['@timestamp'].value, 'url':doc['apache.access.url'].value, 'referrer':doc['apache.access.referrer'].value - ]; + ]; state.docs.add(span) """, "combine_script": "return state.docs;", <5> "reduce_script": """ <6> - def all_docs = []; - for (s in states) { - for (span in s) { - all_docs.add(span); + def all_docs = []; + for (s in states) { + for (span in s) { + all_docs.add(span); } } - all_docs.sort((HashMap o1, HashMap o2)->o1['@timestamp'].toEpochMilli().compareTo(o2['@timestamp'].toEpochMilli())); + all_docs.sort((HashMap o1, HashMap o2)->o1['@timestamp'].toEpochMilli().compareTo(o2['@timestamp'].toEpochMilli())); def size = all_docs.size(); def min_time = all_docs[0]['@timestamp']; def max_time = all_docs[size-1]['@timestamp']; @@ -705,17 +705,17 @@ POST _transform/_preview // NOTCONSOLE <1> The data is grouped by `sessionid`. -<2> The aggregations counts the number of paths and enumerate the viewed pages +<2> The aggregations counts the number of paths and enumerate the viewed pages during the session. <3> The `init_script` creates an array type `doc` in the `state` object. -<4> The `map_script` defines a `span` array with a timestamp, a URL, and a -referrer value which are based on the corresponding values of the document, then +<4> The `map_script` defines a `span` array with a timestamp, a URL, and a +referrer value which are based on the corresponding values of the document, then adds the value of the `span` array to the `doc` object. <5> The `combine_script` returns `state.docs` from each shard. -<6> The `reduce_script` defines various objects like `min_time`, `max_time`, and -`duration` based on the document fields, then declares a `ret` object, and -copies the source document by using `new HashMap ()`. Next, the script defines -`first_time`, `last_time`, `duration` and other fields inside the `ret` object +<6> The `reduce_script` defines various objects like `min_time`, `max_time`, and +`duration` based on the document fields, then declares a `ret` object, and +copies the source document by using `new HashMap ()`. Next, the script defines +`first_time`, `last_time`, `duration` and other fields inside the `ret` object based on the corresponding object defined earlier, finally returns `ret`. The API call results in a similar response: