Modify tsdb track to allow ingesting into a data stream #275

martijnvg · 2022-06-23T12:18:11Z

This change adds an ingest_mode track option to the tsdb track. If this track option is set to data_stream
then this track sets up a composable index templates that allows for ingesting into a tsdb data stream.

If the ingest_modetrack param isn't set or set to any other value than data_stream then it fall back to ingesting into a tsdb index, which is what the track is doing today. By default the ingest_mode track option is not specified so ingesting into a tsdb index is the default behaviour.

Note that the mappings between index.json and index-template.json are identical. Only the settings
section differ slightly. In the case for tsdb data stream the following settings get generated upon data stream creation:

index.time_series.start_time - based on current time and the index.look_ahead_time index setting
index.time_series.end_time - based on current time and the index.look_ahead_time index setting
index.routing_path (based from the mapping, essentially all fields with time_series_dimension=true attribute)

Another difference between the data stream and index ingest mode is that the ingest mode=data_stream also
uses a pipeline. This to update the @timestamp field of documents to be close to current time. A painless script
is used to compute the number of days between 2021-04-28T17:18:23.410Z (first timestamp in the data set) and current time. Then the pipeline adds that number of days to the @timestamp of documents, this makes the timestamp close to current time and maintains the original timestamp distribution. The data set roughly contains a days worth of metrics.

Also dedupe.py tool is added to the _tools directory. When ingesting into a tsdb data stream the op_type=create is required and so when a document has the same dimension field values and the same timestamp (dimension fields and timestamp are used to create the _id of a tsdb document) then a version conflict occurs (409). This tool removes the duplicates, so that no 409 are retuning when ingesting the data via the bulk api. Note that when indexing into a tsdb data stream the default op_type=index is used and then duplicates just get overwritten and Rally doesn't return an errors, so before it wasn't an issue from a benchmark perspective.

nik9000 · 2022-06-23T12:25:43Z

oh! I thought you'd make this --track-option on the tsdb track.

martijnvg · 2022-06-23T12:45:46Z

oh! I thought you'd make this --track-option on the tsdb track.

This is just how started out, because initially the benchmark looked very different (with ilm and all).
I can change this PR to embed the data stream specific bits into the tsdb track and add an track option to ingest
into a data stream.

nik9000 · 2022-06-23T12:46:37Z

I can change this PR to embed the data stream specific bits into the tsdb track and add an track option to ingest
into a data stream.

That'd be lovely!

…xisting tsdb track

martijnvg · 2022-06-24T07:26:16Z

@nik9000 I've added a ingest_mode param to tsdb track and inlined the tsdb data stream track into tsdb track: 9c5077e

nik9000

This probably wants an update to the README with instruction on how to run the dedup process.

nik9000 · 2022-06-27T11:48:39Z

tsdb/_tools/dedupe.py

+    key = parsed_line['kubernetes']['container']['name']
+    key += parsed_line['kubernetes']['pod']['name']
+    key += parsed_line['kubernetes']['node']['name']
+    container_id = safeget(parsed_line, 'kubernetes', 'container', 'id')


I wonder if parsed_line.get('kubernetes', []).get('container', []).get('id') is more normal here. Someone from the perf team will want to review all of this anyway and they should know better than me.

this is what I was looking for

pushed: c0e211e

nik9000 · 2022-06-27T11:51:04Z

tsdb/track.json

  "corpora": [
-    {%- if ingest_order is defined and ingest_order == "sorted" %}
+    {%- if ingest_mode is defined and ingest_mode == "data_stream" %}


For those following along, we'll likely switch everything to the deduped version in a follow up.

martijnvg · 2022-06-27T13:07:31Z

This probably wants an update to the README with instruction on how to run the dedup process.

Added docs: fc67579

nik9000

LGTM. I'd wait for a review from one of the perf folks, but I like it.

michaelbaamonde

Thanks for putting this together @martijnvg and @nik9000! This all worked for me locally and generally looks good. If you could fix the typo and indentation in the README, we'll be good to merge. I did leave a minor suggestion regarding the dedupe script that you can take or leave. Not a big deal.

michaelbaamonde · 2022-06-28T13:48:16Z

tsdb/README.md

@@ -120,12 +120,31 @@ rm -rf tmp
 head -n 1000 documents-sorted.json > documents-sorted-1k.json
 ```

+Finally you'll also need a deduped version of the data in order to to support the `ingest_mode` that benchmarks ingesting into a tsdb data stream (`data_stream`). Use the `dedupe.py` tool in the `_tools` directory. This tool needs `documents-sorted.json` as input via standard in and generates a deduped


A few nitpicky things:

Can you make the indentation/line length consistent with the rest of the file?

Typo: varians -> variant

michaelbaamonde · 2022-06-28T14:02:43Z

tsdb/_tools/dedupe.py

+def generate_pod_key(parsed_line):
+    return parsed_line['kubernetes']['pod']['name'] + parsed_line['kubernetes']['node']['name']
+
+def generate_node_key(parsed_line):


A minor thing, but this function and generate_state_node_key are identical. There's also several places where you could replace parsed_line['kubernetes']['node']['name'] with generate_node_key(parsed_line) just to cut down on duplication if you'd like. Not a big deal, though.

martijnvg · 2022-06-29T08:48:47Z

Thanks @michaelbaamonde for reviewing! I made the suggested changes.

michaelbaamonde

LGTM, thanks @martijnvg!

martijnvg added 17 commits May 30, 2022 10:44

wip

d05ed9d

fixed mistake

720e4b6

added pipeline

63dc47c

default pipeline

139318f

fixed typo

56de329

track now works with --test-mode

0b8b6af

Use ilm policy and no manual force merge

c2b4e5a

experiment with pipeline

b127de7

tweaked look ahead time

bfddf93

fix mistake in setting name

33ff93d

iter

315b3bc

iter

bb8f790

remove ilm/rollover aspect from benchmark track

b44ab85

update lookahead time

88cbd35

undo mistake

a1fc96d

update track to use deduped data set

ad84077

added dedupe script

3906fb2

add ingest_mode track param and inlined tsbd data stream track into e…

9c5077e

…xisting tsdb track

martijnvg marked this pull request as ready for review June 27, 2022 08:59

martijnvg requested a review from nik9000 June 27, 2022 08:59

nik9000 reviewed Jun 27, 2022

View reviewed changes

martijnvg added 2 commits June 27, 2022 14:26

remove dedupe function

c0e211e

added docs

fc67579

nik9000 approved these changes Jun 27, 2022

View reviewed changes

dliappis assigned michaelbaamonde Jun 28, 2022

dliappis added the enhancement label Jun 28, 2022

martijnvg changed the title ~~Add data stream tsdb track~~ Modify tsdb track to ingest into a data stream Jun 28, 2022

martijnvg changed the title ~~Modify tsdb track to ingest into a data stream~~ Modify tsdb track to allow ingesting into a data stream Jun 28, 2022

michaelbaamonde suggested changes Jun 28, 2022

View reviewed changes

martijnvg added 2 commits June 29, 2022 10:43

fix readme

83dc567

reuse generate_node_key(...) function

ef538a7

martijnvg requested a review from michaelbaamonde June 29, 2022 08:48

michaelbaamonde approved these changes Jun 29, 2022

View reviewed changes

martijnvg merged commit 592651b into elastic:master Jun 29, 2022

gizas mentioned this pull request Jun 30, 2022

Support data streams in create-track elastic/rally#1531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify tsdb track to allow ingesting into a data stream #275

Modify tsdb track to allow ingesting into a data stream #275

martijnvg commented Jun 23, 2022 •

edited

Loading

nik9000 commented Jun 23, 2022

martijnvg commented Jun 23, 2022

nik9000 commented Jun 23, 2022

martijnvg commented Jun 24, 2022

nik9000 left a comment

nik9000 Jun 27, 2022

martijnvg Jun 27, 2022

martijnvg Jun 27, 2022

nik9000 Jun 27, 2022

martijnvg commented Jun 27, 2022

nik9000 left a comment

michaelbaamonde left a comment

michaelbaamonde Jun 28, 2022

michaelbaamonde Jun 28, 2022 •

edited

Loading

martijnvg commented Jun 29, 2022

michaelbaamonde left a comment

Modify tsdb track to allow ingesting into a data stream #275

Modify tsdb track to allow ingesting into a data stream #275

Conversation

martijnvg commented Jun 23, 2022 • edited Loading

nik9000 commented Jun 23, 2022

martijnvg commented Jun 23, 2022

nik9000 commented Jun 23, 2022

martijnvg commented Jun 24, 2022

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 Jun 27, 2022

Choose a reason for hiding this comment

martijnvg Jun 27, 2022

Choose a reason for hiding this comment

martijnvg Jun 27, 2022

Choose a reason for hiding this comment

nik9000 Jun 27, 2022

Choose a reason for hiding this comment

martijnvg commented Jun 27, 2022

nik9000 left a comment

Choose a reason for hiding this comment

michaelbaamonde left a comment

Choose a reason for hiding this comment

michaelbaamonde Jun 28, 2022

Choose a reason for hiding this comment

michaelbaamonde Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

martijnvg commented Jun 29, 2022

michaelbaamonde left a comment

Choose a reason for hiding this comment

martijnvg commented Jun 23, 2022 •

edited

Loading

michaelbaamonde Jun 28, 2022 •

edited

Loading