diff --git a/docs/reference/transform/api-quickref.asciidoc b/docs/reference/transform/api-quickref.asciidoc new file mode 100644 index 0000000000000..7750331a0273d --- /dev/null +++ b/docs/reference/transform/api-quickref.asciidoc @@ -0,0 +1,21 @@ +[role="xpack"] +[[df-api-quickref]] +== API quick reference + +All {dataframe-transform} endpoints have the following base: + +[source,js] +---- +/_data_frame/transforms/ +---- +// NOTCONSOLE + +* {ref}/put-data-frame-transform.html[Create {dataframe-transforms}] +* {ref}/delete-data-frame-transform.html[Delete {dataframe-transforms}] +* {ref}/get-data-frame-transform.html[Get {dataframe-transforms}] +* {ref}/get-data-frame-transform-stats.html[Get {dataframe-transforms} statistics] +* {ref}/preview-data-frame-transform.html[Preview {dataframe-transforms}] +* {ref}/start-data-frame-transform.html[Start {dataframe-transforms}] +* {ref}/stop-data-frame-transform.html[Stop {dataframe-transforms}] + +For the full list, see {ref}/data-frame-apis.html[{dataframe-transform-cap} APIs]. diff --git a/docs/reference/transform/checkpoints.asciidoc b/docs/reference/transform/checkpoints.asciidoc new file mode 100644 index 0000000000000..808ce071ede7d --- /dev/null +++ b/docs/reference/transform/checkpoints.asciidoc @@ -0,0 +1,88 @@ +[role="xpack"] +[[ml-transform-checkpoints]] +== How {dataframe-transform} checkpoints work +++++ +How checkpoints work +++++ + +beta[] + +Each time a {dataframe-transform} examines the source indices and creates or +updates the destination index, it generates a _checkpoint_. + +If your {dataframe-transform} runs only once, there is logically only one +checkpoint. If your {dataframe-transform} runs continuously, however, it creates +checkpoints as it ingests and transforms new source data. + +To create a checkpoint, the {cdataframe-transform}: + +. Checks for changes to source indices. ++ +Using a simple periodic timer, the {dataframe-transform} checks for changes to +the source indices. This check is done based on the interval defined in the +transform's `frequency` property. ++ +If the source indices remain unchanged or if a checkpoint is already in progress +then it waits for the next timer. + +. Identifies which entities have changed. ++ +The {dataframe-transform} searches to see which entities have changed since the +last time it checked. The transform's `sync` configuration object identifies a +time field in the source indices. The transform uses the values in that field to +synchronize the source and destination indices. + +. Updates the destination index (the {dataframe}) with the changed entities. ++ +-- +The {dataframe-transform} applies changes related to either new or changed +entities to the destination index. The set of changed entities is paginated. For +each page, the {dataframe-transform} performs a composite aggregation using a +`terms` query. After all the pages of changes have been applied, the checkpoint +is complete. +-- + +This checkpoint process involves both search and indexing activity on the +cluster. We have attempted to favor control over performance while developing +{dataframe-transforms}. We decided it was preferable for the +{dataframe-transform} to take longer to complete, rather than to finish quickly +and take precedence in resource consumption. That being said, the cluster still +requires enough resources to support both the composite aggregation search and +the indexing of its results. + +TIP: If the cluster experiences unsuitable performance degradation due to the +{dataframe-transform}, stop the transform. Consider whether you can apply a +source query to the {dataframe-transform} to reduce the scope of data it +processes. Also consider whether the cluster has sufficient resources in place +to support both the composite aggregation search and the indexing of its +results. + +[discrete] +[[ml-transform-checkpoint-errors]] +==== Error handling + +Failures in {dataframe-transforms} tend to be related to searching or indexing. +To increase the resiliency of {dataframe-transforms}, the cursor positions of +the aggregated search and the changed entities search are tracked in memory and +persisted periodically. + +Checkpoint failures can be categorized as follows: + +* Temporary failures: The checkpoint is retried. If 10 consecutive failures +occur, the {dataframe-transform} has a failed status. For example, this +situation might occur when there are shard failures and queries return only +partial results. +* Irrecoverable failures: The {dataframe-transform} immediately fails. For +example, this situation occurs when the source index is not found. +* Adjustment failures: The {dataframe-transform} retries with adjusted settings. +For example, if a parent circuit breaker memory errors occur during the +composite aggregation, the transform receives partial results. The aggregated +search is retried with a smaller number of buckets. This retry is performed at +the interval defined in the transform's `frequency` property. If the search +is retried to the point where it reaches a minimal number of buckets, an +irrecoverable failure occurs. + +If the node running the {dataframe-transforms} fails, the transform restarts +from the most recent persisted cursor position. This recovery process might +repeat some of the work the transform had already done, but it ensures data +consistency. diff --git a/docs/reference/transform/dataframe-examples.asciidoc b/docs/reference/transform/dataframe-examples.asciidoc new file mode 100644 index 0000000000000..31abb4787a9b5 --- /dev/null +++ b/docs/reference/transform/dataframe-examples.asciidoc @@ -0,0 +1,335 @@ +[role="xpack"] +[testenv="basic"] +[[dataframe-examples]] +== {dataframe-transform-cap} examples +++++ +Examples +++++ + +beta[] + +These examples demonstrate how to use {dataframe-transforms} to derive useful +insights from your data. All the examples use one of the +{kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed, +step-by-step example, see +<>. + +* <> +* <> +* <> +* <> + +include::ecommerce-example.asciidoc[] + +[[example-best-customers]] +=== Finding your best customers + +In this example, we use the eCommerce orders sample dataset to find the customers +who spent the most in our hypothetical webshop. Let's transform the data such +that the destination index contains the number of orders, the total price of +the orders, the amount of unique products and the average price per order, +and the total amount of ordered products for each customer. + +[source,console] +---------------------------------- +POST _data_frame/transforms/_preview +{ + "source": { + "index": "kibana_sample_data_ecommerce" + }, + "dest" : { <1> + "index" : "sample_ecommerce_orders_by_customer" + }, + "pivot": { + "group_by": { <2> + "user": { "terms": { "field": "user" }}, + "customer_id": { "terms": { "field": "customer_id" }} + }, + "aggregations": { + "order_count": { "value_count": { "field": "order_id" }}, + "total_order_amt": { "sum": { "field": "taxful_total_price" }}, + "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }}, + "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }}, + "total_unique_products": { "cardinality": { "field": "products.product_id" }} + } + } +} +---------------------------------- +// TEST[skip:setup kibana sample data] + +<1> This is the destination index for the {dataframe}. It is ignored by +`_preview`. +<2> Two `group_by` fields have been selected. This means the {dataframe} will +contain a unique row per `user` and `customer_id` combination. Within this +dataset both these fields are unique. By including both in the {dataframe} it +gives more context to the final results. + +NOTE: In the example above, condensed JSON formatting has been used for easier +readability of the pivot object. + +The preview {dataframe-transforms} API enables you to see the layout of the +{dataframe} in advance, populated with some sample values. For example: + +[source,js] +---------------------------------- +{ + "preview" : [ + { + "total_order_amt" : 3946.9765625, + "order_count" : 59.0, + "total_unique_products" : 116.0, + "avg_unique_products_per_order" : 2.0, + "customer_id" : "10", + "user" : "recip", + "avg_amt_per_order" : 66.89790783898304 + }, + ... + ] + } +---------------------------------- +// NOTCONSOLE + +This {dataframe} makes it easier to answer questions such as: + +* Which customers spend the most? + +* Which customers spend the most per order? + +* Which customers order most often? + +* Which customers ordered the least number of different products? + +It's possible to answer these questions using aggregations alone, however +{dataframes} allow us to persist this data as a customer centric index. This +enables us to analyze data at scale and gives more flexibility to explore and +navigate data from a customer centric perspective. In some cases, it can even +make creating visualizations much simpler. + +[[example-airline]] +=== Finding air carriers with the most delays + +In this example, we use the Flights sample dataset to find out which air carrier +had the most delays. First, we filter the source data such that it excludes all +the cancelled flights by using a query filter. Then we transform the data to +contain the distinct number of flights, the sum of delayed minutes, and the sum +of the flight minutes by air carrier. Finally, we use a +{ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_script`] +to determine what percentage of the flight time was actually delay. + +[source,console] +---------------------------------- +POST _data_frame/transforms/_preview +{ + "source": { + "index": "kibana_sample_data_flights", + "query": { <1> + "bool": { + "filter": [ + { "term": { "Cancelled": false } } + ] + } + } + }, + "dest" : { <2> + "index" : "sample_flight_delays_by_carrier" + }, + "pivot": { + "group_by": { <3> + "carrier": { "terms": { "field": "Carrier" }} + }, + "aggregations": { + "flights_count": { "value_count": { "field": "FlightNum" }}, + "delay_mins_total": { "sum": { "field": "FlightDelayMin" }}, + "flight_mins_total": { "sum": { "field": "FlightTimeMin" }}, + "delay_time_percentage": { <4> + "bucket_script": { + "buckets_path": { + "delay_time": "delay_mins_total.value", + "flight_time": "flight_mins_total.value" + }, + "script": "(params.delay_time / params.flight_time) * 100" + } + } + } + } +} +---------------------------------- +// TEST[skip:setup kibana sample data] + +<1> Filter the source data to select only flights that were not cancelled. +<2> This is the destination index for the {dataframe}. It is ignored by +`_preview`. +<3> The data is grouped by the `Carrier` field which contains the airline name. +<4> This `bucket_script` performs calculations on the results that are returned +by the aggregation. In this particular example, it calculates what percentage of +travel time was taken up by delays. + +The preview shows you that the new index would contain data like this for each +carrier: + +[source,js] +---------------------------------- +{ + "preview" : [ + { + "carrier" : "ES-Air", + "flights_count" : 2802.0, + "flight_mins_total" : 1436927.5130677223, + "delay_time_percentage" : 9.335543983955839, + "delay_mins_total" : 134145.0 + }, + ... + ] +} +---------------------------------- +// NOTCONSOLE + +This {dataframe} makes it easier to answer questions such as: + +* Which air carrier has the most delays as a percentage of flight time? + +NOTE: This data is fictional and does not reflect actual delays +or flight stats for any of the featured destination or origin airports. + + +[[example-clientips]] +=== Finding suspicious client IPs by using scripted metrics + +With {dataframe-transforms}, you can use +{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[scripted +metric aggregations] on your data. These aggregations are flexible and make +it possible to perform very complex processing. Let's use scripted metrics to +identify suspicious client IPs in the web log sample dataset. + +We transform the data such that the new index contains the sum of bytes and the +number of distinct URLs, agents, incoming requests by location, and geographic +destinations for each client IP. We also use a scripted field to count the +specific types of HTTP responses that each client IP receives. Ultimately, the +example below transforms web log data into an entity centric index where the +entity is `clientip`. + +[source,console] +---------------------------------- +POST _data_frame/transforms/_preview +{ + "source": { + "index": "kibana_sample_data_logs", + "query": { <1> + "range" : { + "timestamp" : { + "gte" : "now-30d/d" + } + } + } + }, + "dest" : { <2> + "index" : "sample_weblogs_by_clientip" + }, + "pivot": { + "group_by": { <3> + "clientip": { "terms": { "field": "clientip" } } + }, + "aggregations": { + "url_dc": { "cardinality": { "field": "url.keyword" }}, + "bytes_sum": { "sum": { "field": "bytes" }}, + "geo.src_dc": { "cardinality": { "field": "geo.src" }}, + "agent_dc": { "cardinality": { "field": "agent.keyword" }}, + "geo.dest_dc": { "cardinality": { "field": "geo.dest" }}, + "responses.total": { "value_count": { "field": "timestamp" }}, + "responses.counts": { <4> + "scripted_metric": { + "init_script": "state.responses = ['error':0L,'success':0L,'other':0L]", + "map_script": """ + def code = doc['response.keyword'].value; + if (code.startsWith('5') || code.startsWith('4')) { + state.responses.error += 1 ; + } else if(code.startsWith('2')) { + state.responses.success += 1; + } else { + state.responses.other += 1; + } + """, + "combine_script": "state.responses", + "reduce_script": """ + def counts = ['error': 0L, 'success': 0L, 'other': 0L]; + for (responses in states) { + counts.error += responses['error']; + counts.success += responses['success']; + counts.other += responses['other']; + } + return counts; + """ + } + }, + "timestamp.min": { "min": { "field": "timestamp" }}, + "timestamp.max": { "max": { "field": "timestamp" }}, + "timestamp.duration_ms": { <5> + "bucket_script": { + "buckets_path": { + "min_time": "timestamp.min.value", + "max_time": "timestamp.max.value" + }, + "script": "(params.max_time - params.min_time)" + } + } + } + } +} +---------------------------------- +// TEST[skip:setup kibana sample data] + +<1> This range query limits the transform to documents that are within the last +30 days at the point in time the {dataframe-transform} checkpoint is processed. +For batch {dataframes} this occurs once. +<2> This is the destination index for the {dataframe}. It is ignored by +`_preview`. +<3> The data is grouped by the `clientip` field. +<4> This `scripted_metric` performs a distributed operation on the web log data +to count specific types of HTTP responses (error, success, and other). +<5> This `bucket_script` calculates the duration of the `clientip` access based +on the results of the aggregation. + +The preview shows you that the new index would contain data like this for each +client IP: + +[source,js] +---------------------------------- +{ + "preview" : [ + { + "geo" : { + "src_dc" : 12.0, + "dest_dc" : 9.0 + }, + "clientip" : "0.72.176.46", + "agent_dc" : 3.0, + "responses" : { + "total" : 14.0, + "counts" : { + "other" : 0, + "success" : 14, + "error" : 0 + } + }, + "bytes_sum" : 74808.0, + "timestamp" : { + "duration_ms" : 4.919943239E9, + "min" : "2019-06-17T07:51:57.333Z", + "max" : "2019-08-13T06:31:00.572Z" + }, + "url_dc" : 11.0 + }, + ... + } +---------------------------------- +// NOTCONSOLE + +This {dataframe} makes it easier to answer questions such as: + +* Which client IPs are transferring the most amounts of data? + +* Which client IPs are interacting with a high number of different URLs? + +* Which client IPs have high error rates? + +* Which client IPs are interacting with a high number of destination countries? \ No newline at end of file diff --git a/docs/reference/transform/ecommerce-example.asciidoc b/docs/reference/transform/ecommerce-example.asciidoc new file mode 100644 index 0000000000000..ce4193aa66584 --- /dev/null +++ b/docs/reference/transform/ecommerce-example.asciidoc @@ -0,0 +1,262 @@ +[role="xpack"] +[testenv="basic"] +[[ecommerce-dataframes]] +=== Transforming the eCommerce sample data + +beta[] + +<> enable you to retrieve information +from an {es} index, transform it, and store it in another index. Let's use the +{kibana-ref}/add-sample-data.html[{kib} sample data] to demonstrate how you can +pivot and summarize your data with {dataframe-transforms}. + + +. If the {es} {security-features} are enabled, obtain a user ID with sufficient +privileges to complete these steps. ++ +-- +You need `manage_data_frame_transforms` cluster privileges to preview and create +{dataframe-transforms}. Members of the built-in `data_frame_transforms_admin` +role have these privileges. + +You also need `read` and `view_index_metadata` index privileges on the source +index and `read`, `create_index`, and `index` privileges on the destination +index. + +For more information, see <> and <>. +-- + +. Choose your _source index_. ++ +-- +In this example, we'll use the eCommerce orders sample data. If you're not +already familiar with the `kibana_sample_data_ecommerce` index, use the +*Revenue* dashboard in {kib} to explore the data. Consider what insights you +might want to derive from this eCommerce data. +-- + +. Play with various options for grouping and aggregating the data. ++ +-- +For example, you might want to group the data by product ID and calculate the +total number of sales for each product and its average price. Alternatively, you +might want to look at the behavior of individual customers and calculate how +much each customer spent in total and how many different categories of products +they purchased. Or you might want to take the currencies or geographies into +consideration. What are the most interesting ways you can transform and +interpret this data? + +_Pivoting_ your data involves using at least one field to group it and applying +at least one aggregation. You can preview what the transformed data will look +like, so go ahead and play with it! + +For example, go to *Machine Learning* > *Data Frames* in {kib} and use the +wizard to create a {dataframe-transform}: + +[role="screenshot"] +image::images/ecommerce-pivot1.jpg["Creating a simple {dataframe-transform} in {kib}"] + +In this case, we grouped the data by customer ID and calculated the sum of +products each customer purchased. + +Let's add some more aggregations to learn more about our customers' orders. For +example, let's calculate the total sum of their purchases, the maximum number of +products that they purchased in a single order, and their total number of orders. +We'll accomplish this by using the +{ref}/search-aggregations-metrics-sum-aggregation.html[`sum` aggregation] on the +`taxless_total_price` field, the +{ref}/search-aggregations-metrics-max-aggregation.html[`max` aggregation] on the +`total_quantity` field, and the +{ref}/search-aggregations-metrics-cardinality-aggregation.html[`cardinality` aggregation] +on the `order_id` field: + +[role="screenshot"] +image::images/ecommerce-pivot2.jpg["Adding multiple aggregations to a {dataframe-transform} in {kib}"] + +TIP: If you're interested in a subset of the data, you can optionally include a +{ref}/search-request-body.html#request-body-search-query[query] element. In this +example, we've filtered the data so that we're only looking at orders with a +`currency` of `EUR`. Alternatively, we could group the data by that field too. +If you want to use more complex queries, you can create your {dataframe} from a +{kibana-ref}/save-open-search.html[saved search]. + +If you prefer, you can use the +{ref}/preview-data-frame-transform.html[preview {dataframe-transforms} API]: + +[source,js] +-------------------------------------------------- +POST _data_frame/transforms/_preview +{ + "source": { + "index": "kibana_sample_data_ecommerce", + "query": { + "bool": { + "filter": { + "term": {"currency": "EUR"} + } + } + } + }, + "pivot": { + "group_by": { + "customer_id": { + "terms": { + "field": "customer_id" + } + } + }, + "aggregations": { + "total_quantity.sum": { + "sum": { + "field": "total_quantity" + } + }, + "taxless_total_price.sum": { + "sum": { + "field": "taxless_total_price" + } + }, + "total_quantity.max": { + "max": { + "field": "total_quantity" + } + }, + "order_id.cardinality": { + "cardinality": { + "field": "order_id" + } + } + } + } +} +-------------------------------------------------- +// CONSOLE +// TEST[skip:set up sample data] +-- + +. When you are satisfied with what you see in the preview, create the +{dataframe-transform}. ++ +-- +.. Supply a job ID and the name of the target (or _destination_) index. + +.. Decide whether you want the {dataframe-transform} to run once or continuously. +-- ++ +-- +Since this sample data index is unchanging, let's use the default behavior and +just run the {dataframe-transform} once. + +[role="screenshot"] +image::images/ecommerce-batch.jpg["Specifying the {dataframe-transform} options in {kib}"] + +If you want to try it out, however, go ahead and click on *Continuous mode*. +You must choose a field that the {dataframe-transform} can use to check which +entities have changed. In general, it's a good idea to use the ingest timestamp +field. In this example, however, you can use the `order_date` field. + +If you prefer, you can use the +{ref}/put-data-frame-transform.html[create {dataframe-transforms} API]. For +example: + +[source,js] +-------------------------------------------------- +PUT _data_frame/transforms/ecommerce-customer-transform +{ + "source": { + "index": [ + "kibana_sample_data_ecommerce" + ], + "query": { + "bool": { + "filter": { + "term": { + "currency": "EUR" + } + } + } + } + }, + "pivot": { + "group_by": { + "customer_id": { + "terms": { + "field": "customer_id" + } + } + }, + "aggregations": { + "total_quantity.sum": { + "sum": { + "field": "total_quantity" + } + }, + "taxless_total_price.sum": { + "sum": { + "field": "taxless_total_price" + } + }, + "total_quantity.max": { + "max": { + "field": "total_quantity" + } + }, + "order_id.cardinality": { + "cardinality": { + "field": "order_id" + } + } + } + }, + "dest": { + "index": "ecommerce-customers" + } +} +-------------------------------------------------- +// CONSOLE +// TEST[skip:setup kibana sample data] +-- + +. Start the {dataframe-transform}. ++ +-- + +TIP: Even though resource utilization is automatically adjusted based on the +cluster load, a {dataframe-transform} increases search and indexing load on your +cluster while it runs. If you're experiencing an excessive load, however, you +can stop it. + +You can start, stop, and manage {dataframe-transforms} in {kib}: + +[role="screenshot"] +image::images/dataframe-transforms.jpg["Managing {dataframe-transforms} in {kib}"] + +Alternatively, you can use the +{ref}/start-data-frame-transform.html[start {dataframe-transforms}] and +{ref}/stop-data-frame-transform.html[stop {dataframe-transforms}] APIs. For +example: + +[source,js] +-------------------------------------------------- +POST _data_frame/transforms/ecommerce-customer-transform/_start +-------------------------------------------------- +// CONSOLE +// TEST[skip:setup kibana sample data] + +-- + +. Explore the data in your new index. ++ +-- +For example, use the *Discover* application in {kib}: + +[role="screenshot"] +image::images/ecommerce-results.jpg["Exploring the new index in {kib}"] + +-- + +TIP: If you do not want to keep the {dataframe-transform}, you can delete it in +{kib} or use the +{ref}/delete-data-frame-transform.html[delete {dataframe-transform} API]. When +you delete a {dataframe-transform}, its destination index and {kib} index +patterns remain. diff --git a/docs/reference/transform/images/dataframe-transforms.jpg b/docs/reference/transform/images/dataframe-transforms.jpg new file mode 100644 index 0000000000000..927678f894d4b Binary files /dev/null and b/docs/reference/transform/images/dataframe-transforms.jpg differ diff --git a/docs/reference/transform/images/ecommerce-batch.jpg b/docs/reference/transform/images/ecommerce-batch.jpg new file mode 100644 index 0000000000000..bed3fedd4cf01 Binary files /dev/null and b/docs/reference/transform/images/ecommerce-batch.jpg differ diff --git a/docs/reference/transform/images/ecommerce-continuous.jpg b/docs/reference/transform/images/ecommerce-continuous.jpg new file mode 100644 index 0000000000000..f144fc8cb9541 Binary files /dev/null and b/docs/reference/transform/images/ecommerce-continuous.jpg differ diff --git a/docs/reference/transform/images/ecommerce-pivot1.jpg b/docs/reference/transform/images/ecommerce-pivot1.jpg new file mode 100644 index 0000000000000..b55b88b8acfb0 Binary files /dev/null and b/docs/reference/transform/images/ecommerce-pivot1.jpg differ diff --git a/docs/reference/transform/images/ecommerce-pivot2.jpg b/docs/reference/transform/images/ecommerce-pivot2.jpg new file mode 100644 index 0000000000000..9af5a3c46b740 Binary files /dev/null and b/docs/reference/transform/images/ecommerce-pivot2.jpg differ diff --git a/docs/reference/transform/images/ecommerce-results.jpg b/docs/reference/transform/images/ecommerce-results.jpg new file mode 100644 index 0000000000000..f483c3b3c3627 Binary files /dev/null and b/docs/reference/transform/images/ecommerce-results.jpg differ diff --git a/docs/reference/transform/images/ml-dataframepivot.jpg b/docs/reference/transform/images/ml-dataframepivot.jpg new file mode 100644 index 0000000000000..c0c7946cf4441 Binary files /dev/null and b/docs/reference/transform/images/ml-dataframepivot.jpg differ diff --git a/docs/reference/transform/index.asciidoc b/docs/reference/transform/index.asciidoc new file mode 100644 index 0000000000000..2a45e1709dd01 --- /dev/null +++ b/docs/reference/transform/index.asciidoc @@ -0,0 +1,82 @@ +[role="xpack"] +[[ml-dataframes]] += {dataframe-transforms-cap} + +[partintro] +-- + +beta[] + +{es} aggregations are a powerful and flexible feature that enable you to +summarize and retrieve complex insights about your data. You can summarize +complex things like the number of web requests per day on a busy website, broken +down by geography and browser type. If you use the same data set to try to +calculate something as simple as a single number for the average duration of +visitor web sessions, however, you can quickly run out of memory. + +Why does this occur? A web session duration is an example of a behavioral +attribute not held on any one log record; it has to be derived by finding the +first and last records for each session in our weblogs. This derivation requires +some complex query expressions and a lot of memory to connect all the data +points. If you have an ongoing background process that fuses related events from +one index into entity-centric summaries in another index, you get a more useful, +joined-up picture--this is essentially what _{dataframes}_ are. + + +[discrete] +[[ml-dataframes-usage]] +== When to use {dataframes} + +You might want to consider using {dataframes} instead of aggregations when: + +* You need a complete _feature index_ rather than a top-N set of items. ++ +In {ml}, you often need a complete set of behavioral features rather just the +top-N. For example, if you are predicting customer churn, you might look at +features such as the number of website visits in the last week, the total number +of sales, or the number of emails sent. The {stack} {ml-features} create models +based on this multi-dimensional feature space, so they benefit from full feature +indices ({dataframes}). ++ +This scenario also applies when you are trying to search across the results of +an aggregation or multiple aggregations. Aggregation results can be ordered or +filtered, but there are +{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering] +and +{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector] +is constrained by the maximum number of buckets returned. If you want to search +all aggregation results, you need to create the complete {dataframe}. If you +need to sort or filter the aggregation results by multiple fields, {dataframes} +are particularly useful. + +* You need to sort aggregation results by a pipeline aggregation. ++ +{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used +for sorting. Technically, this is because pipeline aggregations are run during +the reduce phase after all other aggregations have already completed. If you +create a {dataframe}, you can effectively perform multiple passes over the data. + +* You want to create summary tables to optimize queries. ++ +For example, if you +have a high level dashboard that is accessed by a large number of users and it +uses a complex aggregation over a large dataset, it may be more efficient to +create a {dataframe} to cache results. Thus, each user doesn't need to run the +aggregation query. + +Though there are multiple ways to create {dataframes}, this content pertains +to one specific method: _{dataframe-transforms}_. + +* <> +* <> +* <> +* <> +* <> +-- + +include::overview.asciidoc[] +include::checkpoints.asciidoc[] +include::api-quickref.asciidoc[] +include::dataframe-examples.asciidoc[] +include::troubleshooting.asciidoc[] +include::limitations.asciidoc[] \ No newline at end of file diff --git a/docs/reference/transform/limitations.asciidoc b/docs/reference/transform/limitations.asciidoc new file mode 100644 index 0000000000000..c019018bce509 --- /dev/null +++ b/docs/reference/transform/limitations.asciidoc @@ -0,0 +1,219 @@ +[role="xpack"] +[[dataframe-limitations]] +== {dataframe-transform-cap} limitations +[subs="attributes"] +++++ +Limitations +++++ + +beta[] + +The following limitations and known problems apply to the 7.4 release of +the Elastic {dataframe} feature: + +[float] +[[df-compatibility-limitations]] +=== Beta {dataframe-transforms} do not have guaranteed backwards or forwards compatibility + +Whilst {dataframe-transforms} are beta, it is not guaranteed that a +{dataframe-transform} created in a previous version of the {stack} will be able +to start and operate in a future version. Neither can support be provided for +{dataframe-transform} tasks to be able to operate in a cluster with mixed node +versions. +Please note that the output of a {dataframe-transform} is persisted to a +destination index. This is a normal {es} index and is not affected by the beta +status. + +[float] +[[df-ui-limitation]] +=== {dataframe-cap} UI will not work during a rolling upgrade from 7.2 + +If your cluster contains mixed version nodes, for example during a rolling +upgrade from 7.2 to a newer version, and {dataframe-transforms} have been +created in 7.2, the {dataframe} UI will not work. Please wait until all nodes +have been upgraded to the newer version before using the {dataframe} UI. + + +[float] +[[df-datatype-limitations]] +=== {dataframe-cap} data type limitation + +{dataframes-cap} do not (yet) support fields containing arrays – in the UI or +the API. If you try to create one, the UI will fail to show the source index +table. + +[float] +[[df-ccs-limitations]] +=== {ccs-cap} is not supported + +{ccs-cap} is not supported for {dataframe-transforms}. + +[float] +[[df-kibana-limitations]] +=== Up to 1,000 {dataframe-transforms} are supported + +A single cluster will support up to 1,000 {dataframe-transforms}. +When using the +{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] a total +`count` of transforms is returned. Use the `size` and `from` parameters to +enumerate through the full list. + +[float] +[[df-aggresponse-limitations]] +=== Aggregation responses may be incompatible with destination index mappings + +When a {dataframe-transform} is first started, it will deduce the mappings +required for the destination index. This process is based on the field types of +the source index and the aggregations used. If the fields are derived from +{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[`scripted_metrics`] +or {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_scripts`], +{ref}/dynamic-mapping.html[dynamic mappings] will be used. In some instances the +deduced mappings may be incompatible with the actual data. For example, numeric +overflows might occur or dynamically mapped fields might contain both numbers +and strings. Please check {es} logs if you think this may have occurred. As a +workaround, you may define custom mappings prior to starting the +{dataframe-transform}. For example, +{ref}/indices-create-index.html[create a custom destination index] or +{ref}/indices-templates.html[define an index template]. + +[float] +[[df-batch-limitations]] +=== Batch {dataframe-transforms} may not account for changed documents + +A batch {dataframe-transform} uses a +{ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregation] +which allows efficient pagination through all buckets. Composite aggregations +do not yet support a search context, therefore if the source data is changed +(deleted, updated, added) while the batch {dataframe} is in progress, then the +results may not include these changes. + +[float] +[[df-consistency-limitations]] +=== {cdataframe-cap} consistency does not account for deleted or updated documents + +While the process for {cdataframe-transforms} allows the continual recalculation +of the {dataframe-transform} as new data is being ingested, it does also have +some limitations. + +Changed entities will only be identified if their time field +has also been updated and falls within the range of the action to check for +changes. This has been designed in principle for, and is suited to, the use case +where new data is given a timestamp for the time of ingest. + +If the indices that fall within the scope of the source index pattern are +removed, for example when deleting historical time-based indices, then the +composite aggregation performed in consecutive checkpoint processing will search +over different source data, and entities that only existed in the deleted index +will not be removed from the {dataframe} destination index. + +Depending on your use case, you may wish to recreate the {dataframe-transform} +entirely after deletions. Alternatively, if your use case is tolerant to +historical archiving, you may wish to include a max ingest timestamp in your +aggregation. This will allow you to exclude results that have not been recently +updated when viewing the {dataframe} destination index. + + +[float] +[[df-deletion-limitations]] +=== Deleting a {dataframe-transform} does not delete the {dataframe} destination index or {kib} index pattern + +When deleting a {dataframe-transform} using `DELETE _data_frame/transforms/index` +neither the {dataframe} destination index nor the {kib} index pattern, should +one have been created, are deleted. These objects must be deleted separately. + +[float] +[[df-aggregation-page-limitations]] +=== Handling dynamic adjustment of aggregation page size + +During the development of {dataframe-transforms}, control was favoured over +performance. In the design considerations, it is preferred for the +{dataframe-transform} to take longer to complete quietly in the background +rather than to finish quickly and take precedence in resource consumption. + +Composite aggregations are well suited for high cardinality data enabling +pagination through results. If a {ref}/circuit-breaker.html[circuit breaker] +memory exception occurs when performing the composite aggregated search then we +try again reducing the number of buckets requested. This circuit breaker is +calculated based upon all activity within the cluster, not just activity from +{dataframe-transforms}, so it therefore may only be a temporary resource +availability issue. + +For a batch {dataframe-transform}, the number of buckets requested is only ever +adjusted downwards. The lowering of value may result in a longer duration for the +transform checkpoint to complete. For {cdataframes}, the number of +buckets requested is reset back to its default at the start of every checkpoint +and it is possible for circuit breaker exceptions to occur repeatedly in the +{es} logs. + +The {dataframe-transform} retrieves data in batches which means it calculates +several buckets at once. Per default this is 500 buckets per search/index +operation. The default can be changed using `max_page_search_size` and the +minimum value is 10. If failures still occur once the number of buckets +requested has been reduced to its minimum, then the {dataframe-transform} will +be set to a failed state. + +[float] +[[df-dynamic-adjustments-limitations]] +=== Handling dynamic adjustments for many terms + +For each checkpoint, entities are identified that have changed since the last +time the check was performed. This list of changed entities is supplied as a +{ref}/query-dsl-terms-query.html[terms query] to the {dataframe-transform} +composite aggregation, one page at a time. Then updates are applied to the +destination index for each page of entities. + +The page `size` is defined by `max_page_search_size` which is also used to +define the number of buckets returned by the composite aggregation search. The +default value is 500, the minimum is 10. + +The index setting +{ref}/index-modules.html#dynamic-index-settings[`index.max_terms_count`] defines +the maximum number of terms that can be used in a terms query. The default value +is 65536. If `max_page_search_size` exceeds `index.max_terms_count` the +transform will fail. + +Using smaller values for `max_page_search_size` may result in a longer duration +for the transform checkpoint to complete. + +[float] +[[df-scheduling-limitations]] +=== {cdataframe-cap} scheduling limitations + +A {cdataframe} periodically checks for changes to source data. The functionality +of the scheduler is currently limited to a basic periodic timer which can be +within the `frequency` range from 1s to 1h. The default is 1m. This is designed +to run little and often. When choosing a `frequency` for this timer consider +your ingest rate along with the impact that the {dataframe-transform} +search/index operations has other users in your cluster. Also note that retries +occur at `frequency` interval. + +[float] +[[df-failed-limitations]] +=== Handling of failed {dataframe-transforms} + +Failed {dataframe-transforms} remain as a persistent task and should be handled +appropriately, either by deleting it or by resolving the root cause of the +failure and re-starting. + +When using the API to delete a failed {dataframe-transform}, first stop it using +`_stop?force=true`, then delete it. + +If starting a failed {dataframe-transform}, after the root cause has been +resolved, the `_start?force=true` parameter must be specified. + +[float] +[[df-availability-limitations]] +=== {cdataframes-cap} may give incorrect results if documents are not yet available to search + +After a document is indexed, there is a very small delay until it is available +to search. + +A {cdataframe-transform} periodically checks for changed entities between the +time since it last checked and `now` minus `sync.time.delay`. This time window +moves without overlapping. If the timestamp of a recently indexed document falls +within this time window but this document is not yet available to search then +this entity will not be updated. + +If using a `sync.time.field` that represents the data ingest time and using a +zero second or very small `sync.time.delay`, then it is more likely that this +issue will occur. \ No newline at end of file diff --git a/docs/reference/transform/overview.asciidoc b/docs/reference/transform/overview.asciidoc new file mode 100644 index 0000000000000..c0a7856e28314 --- /dev/null +++ b/docs/reference/transform/overview.asciidoc @@ -0,0 +1,71 @@ +[role="xpack"] +[[ml-transform-overview]] +== {dataframe-transform-cap} overview +++++ +Overview +++++ + +beta[] + +A _{dataframe}_ is a two-dimensional tabular data structure. In the context of +the {stack}, it is a transformation of data that is indexed in {es}. For +example, you can use {dataframes} to _pivot_ your data into a new entity-centric +index. By transforming and summarizing your data, it becomes possible to +visualize and analyze it in alternative and interesting ways. + +A lot of {es} indices are organized as a stream of events: each event is an +individual document, for example a single item purchase. {dataframes-cap} enable +you to summarize this data, bringing it into an organized, more +analysis-friendly format. For example, you can summarize all the purchases of a +single customer. + +You can create {dataframes} by using {dataframe-transforms}. +{dataframe-transforms-cap} enable you to define a pivot, which is a set of +features that transform the index into a different, more digestible format. +Pivoting results in a summary of your data, which is the {dataframe}. + +To define a pivot, first you select one or more fields that you will use to +group your data. You can select categorical fields (terms) and numerical fields +for grouping. If you use numerical fields, the field values are bucketed using +an interval that you specify. + +The second step is deciding how you want to aggregate the grouped data. When +using aggregations, you practically ask questions about the index. There are +different types of aggregations, each with its own purpose and output. To learn +more about the supported aggregations and group-by fields, see +{ref}/data-frame-transform-resource.html[{dataframe-transform-cap} resources]. + +As an optional step, you can also add a query to further limit the scope of the +aggregation. + +The {dataframe-transform} performs a composite aggregation that +paginates through all the data defined by the source index query. The output of +the aggregation is stored in a destination index. Each time the +{dataframe-transform} queries the source index, it creates a _checkpoint_. You +can decide whether you want the {dataframe-transform} to run once (batch +{dataframe-transform}) or continuously ({cdataframe-transform}). A batch +{dataframe-transform} is a single operation that has a single checkpoint. +{cdataframe-transforms-cap} continually increment and process checkpoints as new +source data is ingested. + +.Example + +Imagine that you run a webshop that sells clothes. Every order creates a document +that contains a unique order ID, the name and the category of the ordered product, +its price, the ordered quantity, the exact date of the order, and some customer +information (name, gender, location, etc). Your dataset contains all the transactions +from last year. + +If you want to check the sales in the different categories in your last fiscal +year, define a {dataframe-transform} that groups the data by the product +categories (women's shoes, men's clothing, etc.) and the order date. Use the +last year as the interval for the order date. Then add a sum aggregation on the +ordered quantity. The result is a {dataframe} that shows the number of sold +items in every product category in the last year. + +[role="screenshot"] +image::images/ml-dataframepivot.jpg["Example of a data frame pivot in {kib}"] + +IMPORTANT: The {dataframe-transform} leaves your source index intact. It +creates a new index that is dedicated to the {dataframe}. + diff --git a/docs/reference/transform/troubleshooting.asciidoc b/docs/reference/transform/troubleshooting.asciidoc new file mode 100644 index 0000000000000..4ea0dd8cc830d --- /dev/null +++ b/docs/reference/transform/troubleshooting.asciidoc @@ -0,0 +1,29 @@ +[[dataframe-troubleshooting]] +== Troubleshooting {dataframe-transforms} +[subs="attributes"] +++++ +Troubleshooting +++++ + +Use the information in this section to troubleshoot common problems. + +include::{stack-repo-dir}/help.asciidoc[tag=get-help] + +If you encounter problems with your {dataframe-transforms}, you can gather more +information from the following files and APIs: + +* Lightweight audit messages are stored in `.data-frame-notifications-*`. Search +by your `transform_id`. +* The +{ref}/get-data-frame-transform-stats.html[get {dataframe-transform} statistics API] +provides information about the transform status and failures. +* If the {dataframe-transform} exists as a task, you can use the +{ref}/tasks.html[task management API] to gather task information. For example: +`GET _tasks?actions=data_frame/transforms*&detailed`. Typically, the task exists +when the transform is in a started or failed state. +* The {es} logs from the node that was running the {dataframe-transform} might +also contain useful information. You can identify the node from the notification +messages. Alternatively, if the task still exists, you can get that information +from the get {dataframe-transform} statistics API. For more information, see +{ref}/logging.html[Logging configuration]. +