-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify tsdb track to allow ingesting into a data stream #275
Modify tsdb track to allow ingesting into a data stream #275
Conversation
oh! I thought you'd make this |
This is just how started out, because initially the benchmark looked very different (with ilm and all). |
That'd be lovely! |
…xisting tsdb track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably wants an update to the README
with instruction on how to run the dedup process.
tsdb/_tools/dedupe.py
Outdated
key = parsed_line['kubernetes']['container']['name'] | ||
key += parsed_line['kubernetes']['pod']['name'] | ||
key += parsed_line['kubernetes']['node']['name'] | ||
container_id = safeget(parsed_line, 'kubernetes', 'container', 'id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if parsed_line.get('kubernetes', []).get('container', []).get('id')
is more normal here. Someone from the perf team will want to review all of this anyway and they should know better than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what I was looking for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed: c0e211e
"corpora": [ | ||
{%- if ingest_order is defined and ingest_order == "sorted" %} | ||
{%- if ingest_mode is defined and ingest_mode == "data_stream" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those following along, we'll likely switch everything to the deduped version in a follow up.
Added docs: fc67579 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'd wait for a review from one of the perf folks, but I like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together @martijnvg and @nik9000! This all worked for me locally and generally looks good. If you could fix the typo and indentation in the README, we'll be good to merge. I did leave a minor suggestion regarding the dedupe script that you can take or leave. Not a big deal.
tsdb/README.md
Outdated
@@ -120,12 +120,31 @@ rm -rf tmp | |||
head -n 1000 documents-sorted.json > documents-sorted-1k.json | |||
``` | |||
|
|||
Finally you'll also need a deduped version of the data in order to to support the `ingest_mode` that benchmarks ingesting into a tsdb data stream (`data_stream`). Use the `dedupe.py` tool in the `_tools` directory. This tool needs `documents-sorted.json` as input via standard in and generates a deduped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nitpicky things:
- Can you make the indentation/line length consistent with the rest of the file?
- Typo:
varians
->variant
def generate_pod_key(parsed_line): | ||
return parsed_line['kubernetes']['pod']['name'] + parsed_line['kubernetes']['node']['name'] | ||
|
||
def generate_node_key(parsed_line): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor thing, but this function and generate_state_node_key
are identical. There's also several places where you could replace parsed_line['kubernetes']['node']['name']
with generate_node_key(parsed_line)
just to cut down on duplication if you'd like. Not a big deal, though.
Thanks @michaelbaamonde for reviewing! I made the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @martijnvg!
This change adds an
ingest_mode
track option to the tsdb track. If this track option is set todata_stream
then this track sets up a composable index templates that allows for ingesting into a tsdb data stream.
If the
ingest_mode
track param isn't set or set to any other value thandata_stream
then it fall back to ingesting into a tsdb index, which is what the track is doing today. By default theingest_mode
track option is not specified so ingesting into a tsdb index is the default behaviour.Note that the mappings between
index.json
andindex-template.json
are identical. Only the settingssection differ slightly. In the case for tsdb data stream the following settings get generated upon data stream creation:
index.time_series.start_time
- based on current time and theindex.look_ahead_time
index settingindex.time_series.end_time
- based on current time and theindex.look_ahead_time
index settingindex.routing_path
(based from the mapping, essentially all fields withtime_series_dimension=true
attribute)Another difference between the data stream and index ingest mode is that the
ingest mode=data_stream
alsouses a pipeline. This to update the
@timestamp
field of documents to be close to current time. A painless scriptis used to compute the number of days between
2021-04-28T17:18:23.410Z
(first timestamp in the data set) and current time. Then the pipeline adds that number of days to the@timestamp
of documents, this makes the timestamp close to current time and maintains the original timestamp distribution. The data set roughly contains a days worth of metrics.Also
dedupe.py
tool is added to the_tools
directory. When ingesting into a tsdb data stream theop_type=create
is required and so when a document has the same dimension field values and the same timestamp (dimension fields and timestamp are used to create the _id of a tsdb document) then a version conflict occurs (409). This tool removes the duplicates, so that no 409 are retuning when ingesting the data via the bulk api. Note that when indexing into a tsdb data stream the defaultop_type=index
is used and then duplicates just get overwritten and Rally doesn't return an errors, so before it wasn't an issue from a benchmark perspective.