Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify tsdb track to allow ingesting into a data stream #275

Merged
merged 22 commits into from
Jun 29, 2022

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Jun 23, 2022

This change adds an ingest_mode track option to the tsdb track. If this track option is set to data_stream
then this track sets up a composable index templates that allows for ingesting into a tsdb data stream.

If the ingest_modetrack param isn't set or set to any other value than data_stream then it fall back to ingesting into a tsdb index, which is what the track is doing today. By default the ingest_mode track option is not specified so ingesting into a tsdb index is the default behaviour.

Note that the mappings between index.json and index-template.json are identical. Only the settings
section differ slightly. In the case for tsdb data stream the following settings get generated upon data stream creation:

  • index.time_series.start_time - based on current time and the index.look_ahead_time index setting
  • index.time_series.end_time - based on current time and the index.look_ahead_time index setting
  • index.routing_path (based from the mapping, essentially all fields with time_series_dimension=true attribute)

Another difference between the data stream and index ingest mode is that the ingest mode=data_stream also
uses a pipeline. This to update the @timestamp field of documents to be close to current time. A painless script
is used to compute the number of days between 2021-04-28T17:18:23.410Z (first timestamp in the data set) and current time. Then the pipeline adds that number of days to the @timestamp of documents, this makes the timestamp close to current time and maintains the original timestamp distribution. The data set roughly contains a days worth of metrics.

Also dedupe.py tool is added to the _tools directory. When ingesting into a tsdb data stream the op_type=create is required and so when a document has the same dimension field values and the same timestamp (dimension fields and timestamp are used to create the _id of a tsdb document) then a version conflict occurs (409). This tool removes the duplicates, so that no 409 are retuning when ingesting the data via the bulk api. Note that when indexing into a tsdb data stream the default op_type=index is used and then duplicates just get overwritten and Rally doesn't return an errors, so before it wasn't an issue from a benchmark perspective.

@nik9000
Copy link
Member

nik9000 commented Jun 23, 2022

oh! I thought you'd make this --track-option on the tsdb track.

@martijnvg
Copy link
Member Author

oh! I thought you'd make this --track-option on the tsdb track.

This is just how started out, because initially the benchmark looked very different (with ilm and all).
I can change this PR to embed the data stream specific bits into the tsdb track and add an track option to ingest
into a data stream.

@nik9000
Copy link
Member

nik9000 commented Jun 23, 2022

I can change this PR to embed the data stream specific bits into the tsdb track and add an track option to ingest
into a data stream.

That'd be lovely!

@martijnvg
Copy link
Member Author

@nik9000 I've added a ingest_mode param to tsdb track and inlined the tsdb data stream track into tsdb track: 9c5077e

@martijnvg martijnvg marked this pull request as ready for review June 27, 2022 08:59
@martijnvg martijnvg requested a review from nik9000 June 27, 2022 08:59
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably wants an update to the README with instruction on how to run the dedup process.

key = parsed_line['kubernetes']['container']['name']
key += parsed_line['kubernetes']['pod']['name']
key += parsed_line['kubernetes']['node']['name']
container_id = safeget(parsed_line, 'kubernetes', 'container', 'id')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if parsed_line.get('kubernetes', []).get('container', []).get('id') is more normal here. Someone from the perf team will want to review all of this anyway and they should know better than me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what I was looking for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed: c0e211e

"corpora": [
{%- if ingest_order is defined and ingest_order == "sorted" %}
{%- if ingest_mode is defined and ingest_mode == "data_stream" %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those following along, we'll likely switch everything to the deduped version in a follow up.

@martijnvg
Copy link
Member Author

This probably wants an update to the README with instruction on how to run the dedup process.

Added docs: fc67579

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'd wait for a review from one of the perf folks, but I like it.

@martijnvg martijnvg changed the title Add data stream tsdb track Modify tsdb track to ingest into a data stream Jun 28, 2022
@martijnvg martijnvg changed the title Modify tsdb track to ingest into a data stream Modify tsdb track to allow ingesting into a data stream Jun 28, 2022
Copy link
Contributor

@michaelbaamonde michaelbaamonde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together @martijnvg and @nik9000! This all worked for me locally and generally looks good. If you could fix the typo and indentation in the README, we'll be good to merge. I did leave a minor suggestion regarding the dedupe script that you can take or leave. Not a big deal.

tsdb/README.md Outdated
@@ -120,12 +120,31 @@ rm -rf tmp
head -n 1000 documents-sorted.json > documents-sorted-1k.json
```

Finally you'll also need a deduped version of the data in order to to support the `ingest_mode` that benchmarks ingesting into a tsdb data stream (`data_stream`). Use the `dedupe.py` tool in the `_tools` directory. This tool needs `documents-sorted.json` as input via standard in and generates a deduped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicky things:

  • Can you make the indentation/line length consistent with the rest of the file?
  • Typo: varians -> variant

def generate_pod_key(parsed_line):
return parsed_line['kubernetes']['pod']['name'] + parsed_line['kubernetes']['node']['name']

def generate_node_key(parsed_line):
Copy link
Contributor

@michaelbaamonde michaelbaamonde Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor thing, but this function and generate_state_node_key are identical. There's also several places where you could replace parsed_line['kubernetes']['node']['name'] with generate_node_key(parsed_line) just to cut down on duplication if you'd like. Not a big deal, though.

@martijnvg
Copy link
Member Author

Thanks @michaelbaamonde for reviewing! I made the suggested changes.

Copy link
Contributor

@michaelbaamonde michaelbaamonde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @martijnvg!

@martijnvg martijnvg merged commit 592651b into elastic:master Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants