Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample data generation #984

Closed
ruflin opened this issue Sep 20, 2022 · 48 comments
Closed

Sample data generation #984

ruflin opened this issue Sep 20, 2022 · 48 comments
Assignees
Labels
Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Ecosystem Label for the Packages Ecosystem team

Comments

@ruflin
Copy link
Member

ruflin commented Sep 20, 2022

When building integration packages, sample data is important to develop ingest pipeline and build dashboards. Unfortunately in most cases, real sample data is limited and often tricky to produce. This issues proposes a tool as part of elastic-package that can generate and load sample data.

Important: The following is only an initial proposal to better explain the problem and share existing ideas. A proper design is still required.

Why part of elastic-package

Generating sample data is not a new problem and there are several tools which already provide partial solutions to this. A tool to generate sample data in elastic-package is needed to make it available in a simple way to each package developers. How sample data should look and be generated becomes part of the package spec. Like this, someone building a package directly also gets the possibility of generating sample data and use it as part of the developer experience.

Data generation - metrics / logs

For the data generation, two different types of data exist. Metrics and traces are mostly already in the format that will be ingested into Elasticsearch and require very little processing. Logs on the other hand often come as raw messages and require ingest pipelines or runtime fields to structure the data. The goal is that the tool can generate both types of data but it can happen in iterations.

Metrics generation

For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool built by @aspacca. Instead of having to build separate config files, the config params for each field would be directly in the fields.yml of each data stream. The definition could look similar to the following:

- name: kubernetes.pod.network.rx.bytes
  type: long
  format: bytes
  unit: byte
  metric_type: counter
  description: |
    Received bytes
  _data_generation: # Discuss exact name
    fuzziness: 1000
    range: 10000

The exact syntax for each field type needs definition.

Logs generation

For logs generation, inspiration can be taken from the tool spigot by @leehinman. Ideally we could simplify this by allowing users to specify the message patterns something like {@timestamp} {source.ip} and the specify for these fields what the values should be. Then the tool would take over the generation of sample logs.

Important is that the log generation outputs message fields pre ingest pipeline.

Generated data format

The proposed data structure generated by the tool is the one used by esrally. It contains 1 JSON doc per line with all the fields inside. This makes it simple to just deliver the data to Elasticsearch and makes it possible to potentially reuse some of this generated data with rally tracks.

Non goals

A non goal of the data generation on loading of data is to replace rally. Rally measures the exact performance and builds reproducible benchmarks. When generating and loading data with elastic-package it is about testing ingest pipelines, testing dashboards and test queries on larger sets of data in an easy way. The focus is on package development.

Another non goal is to generate events that are related to each other. For some solutions it is important that if a host.name shows up other parts of the data contain the same host.name to be able to browse through the solution. This might be added at a later point but is not part of the scope.

Sample data storage

As the sample data can always be generated on the fly, it is not required to store it. If some of the sample data sets should be stored for later use, package-spec should provide a schema to reference sample datasets.

Command line

Command line arguments must be available to generate sample data for a dataset or a package and load it into Elasticsearch. Ideally package-spec allows to store some config files around which data sets can be generated so a package developer can share these configs as part of the package.

Initial packages to start with

I recommend to pick 2 initial packages to start with around the data generation. As k8s and AWS are both more complex package that also generate lots of data, this could be a good start focusing on the metrics part.

Future ideas

  • Use the data generation to test expected storage use per dataset. This can be used to compare storage use across versions but also help users predict how much storage will be needed.
  • Load sample data and run a report on the dashboards and export performance metrics as part of a pull request. The report will also help to see if some parts of a dashboard are broken
  • Real time event generation: Instead of pre-generating sample data, elastic-package could keep continuing shipping events to Elasticsearch
@jlind23 jlind23 added the Team:Ecosystem Label for the Packages Ecosystem team label Sep 20, 2022
@jsoriano
Copy link
Member

Pinging @marc-gr as he was thinking also about this problem in the context of pipelines benchmarking.

@aspacca
Copy link
Contributor

aspacca commented Sep 21, 2022

For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool

I'd like to give some context about the rationale when developing the tool.
I had two main goals:

  1. relying on a source of truth (that's why based fields.yml)
  2. being able to define cardinality and fuzziness

Satisfying point 1., at the current status, implies the data generated are post-ingest, but nothing prevents us to expand package-spec to include a definition of the pre-ingest schema.
I've considered alternatives to this, from "playing backwards" the ingest pipelines to analysing a sample of pre-ingest data and the more I'm convinced that making space in package-spec for such schema is a preferable solution.

I still have to look in the tool spigot by @leehinman.
I imagine the pre-ingest schema as a common solution for both metrics and logs, borrowing (as it is or from scratch) the possibility to define cardinality and fuzziness.

Ideally a third "tool", part of elastic package, can be developed to extract cardinality and fuzziness from existing ingested data in a cluster in order to initially feed the two above.

@ruflin
Copy link
Member Author

ruflin commented Sep 22, 2022

Satisfying point 1., at the current status, implies the data generated are post-ingest, but nothing prevents us to expand package-spec to include a definition of the pre-ingest schema.

I was initially thinking that especially in the metrics case, this should not be too big of an issue. But we are also moving more processing around metrics to ingest pipelines instead of Elastic Agent so the metrics data coming in might look quite a bit different. Something to investigate further.

@leehinman
Copy link

I don't know if this needs to be part of elastic-package code base. I'm wondering if it could be something like https://github.com/elastic/stream/ where we supply a config and it is spun up as a docker instance during elastic-package test.

Ideally it would be nice if the sample generation tool could be run several ways:

  1. in elastic-package test environment
  2. as an "input" that elastic-agent could deploy, so customers could generate data for integrations (demo or testing use case)
  3. stand alone, so you could generate rally tracks, local testing, etc.

@aspacca
Copy link
Contributor

aspacca commented Sep 23, 2022

I don't know if this needs to be part of elastic-package code base.

I agree with this in the terms of separated repos that provides both standalone cli commands and packages to be consumed directly from elastic-package, without the need to wrap them in a separate process

@leehinman
Copy link

With vpcflow we found a "problem" with the logs that spigot generated. The vpcflow data was fine, but if we made a rally track out of that data, it was missing fields that the filebeat awss3 input adds to the event. Without those fields we couldn't drive the issue.

So I think we really want to have one or more things that can generate logs, and then for both filebeat & logstash have outputs that make rally tracks. That way we don't have to duplicate the fields added by inputs and we don't have to duplicate any user defined processors either, filebeat or logstash will do that like they do in production. This also has another benefit, if we don't have a log generation tool, we can capture real data and make a rally track.

@aspacca
Copy link
Contributor

aspacca commented Nov 3, 2022

The vpcflow data was fine, but if we made a rally track out of that data, it was missing fields that the filebeat awss3 input adds to the event.
So I think we really want to have one or more things that can generate logs, and then for both filebeat & logstash have outputs that make rally tracks.

I guess the missing fields could be generated by https://github.com/elastic/elastic-integration-corpus-generator-tool, since they should be part of "fields.yml", but for _id

the tool at the moment can only generate post-ingestion pipeline documents, but with some jq post-processing I was able to generate the source vpcflow logs. if we defined some specs for generating back to pre-ingestion pipeline from post one it should not a big deal to incorporate this feature in the tool instead of relying on external post-processing

please let me know if you'd like to work together on the topic :)

@leehinman
Copy link

I guess the missing fields could be generated by https://github.com/elastic/elastic-integration-corpus-generator-tool, since they should be part of "fields.yml", but for _id

the fields.yml files are getting more complete, but there will always be gaps. maybe if we could mix in additional fields that we find are missing that might take care of the gaps.

the tool at the moment can only generate post-ingestion pipeline documents, but with some jq post-processing I was able to generate the source vpcflow logs. if we defined some specs for generating back to pre-ingestion pipeline from post one it should not a big deal to incorporate this feature in the tool instead of relying on external post-processing

vpcflow processing is pretty minimal, some other integrations will make putting the original back together from the result very complicated. Others like Cloudtrail would be pretty easy to re-assemble from the results.

please let me know if you'd like to work together on the topic :)

I'm definitely interested in making sure we can generate source documents for all of our integrations.

Even if we get the corpus generator to be nearly perfect, I'm still in favor of providing a turn key way for customers to generate rally tracks from their own data. That way we can run tests with the exact data that is causing the problem.

@cmacknz
Copy link
Member

cmacknz commented Nov 4, 2022

I'm still in favor of providing a turn key way for customers to generate rally tracks from their own data

Do we need them to generate rally tracks, or is just capturing the raw events enough and we could post-process them into a rally track ourselves? I wonder if adding the ability to tee events to both a file and the actual target output would help here.

I definitely like the idea of having something built into Agent that users can enable in production to give us the exact events they are experiencing issues with. I think this will eliminate a lot of back and forth and wasted time because we could have exactly the data that is causing problems, with possible caveats like having to sanitize out personally identifiable information or credentials.

@leehinman
Copy link

Do we need them to generate rally tracks, or is just capturing the raw events enough and we could post-process them into a rally track ourselves? I wonder if adding the ability to tee events to both a file and the actual target output would help here.

Doesn't have to be rally, but we should have a simple way of converting to rally. We could probably have a small utility that takes the existing file output and turns that into rally tracks.

I really like the tee idea. We would need to ignore ACKs from secondary output, and make sure that a slow secondary output doesn't slow down the primary output.

@cconboy
Copy link

cconboy commented Nov 22, 2022

Not certain if it offers anything beyond other tools mentioned above, but there also exists Logen for generating logs
https://github.com/elastic/logen
https://docs.google.com/presentation/d/1I2ZKQo-Rbr18l05Lrp-lnUk-vIM3ZRpnb-xjaPdxShQ/edit#slide=id.p1

@cavokz
Copy link

cavokz commented Nov 23, 2022

Talking of tools, there is also elastic/geneve. We don't have a good summary of what it does though there are some technical docs at https://github.com/elastic/geneve/tree/main/docs and https://github.com/elastic/geneve/tree/main/tests/reports.

The juice is that you describe (in a so called data model) what kind of documents you need and then Geneve will generate as many as you want. A data model can be as simple as "these fields need to be present" to something more complex like "the documents need to have this relation: first doc has some content inprocess.name, the second doc has process.parent.name set to whatever was generated for the first one".

Geneve was born to generate documents that would trigger detection rules in the Security app but nothing forbids to describe other kinds of fields/documents relations and use the generated documents for other purposes. Indeed we are currently working with the Analyst Experience team to help them filling their stack with data in a flexible way, this would allow them to use and develop Kibana in ways that are now not easily feasible.

An example of data model is:

sequence by host.id with maxspan=1m
 [file where event.type != "deletion" and file.path in ("/System/Library/LaunchDaemons/*", "/Library/LaunchDaemons/*")]
 [process where event.type in ("start", "process_started") and process.name == "launchctl" and process.args == "load"]

Example of the four (*) pairs of documents that can be generate:

[{'event': {'type': ['ZFy'], 'category': ['file']}, 'file': {'path': '/System/Library/LaunchDaemons/UyyFjSvILOoOHmx'}, 'host': {'id': 'BnL'}, '@timestamp': 0},
 {'event': {'type': ['start'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'BnL'}, '@timestamp': 1},
 {'event': {'type': ['eOA'], 'category': ['file']}, 'file': {'path': '/System/Library/LaunchDaemons/gaiFqsyzKNyyQ'}, 'host': {'id': 'DpU'}, '@timestamp': 2},
 {'event': {'type': ['process_started'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'DpU'}, '@timestamp': 3},
 {'event': {'type': ['EUD'], 'category': ['file']}, 'file': {'path': '/Library/LaunchDaemons/xVTOLWtimrFgT'}, 'host': {'id': 'msh'}, '@timestamp': 4},
 {'event': {'type': ['start'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'msh'}, '@timestamp': 5},
 {'event': {'type': ['CeL'], 'category': ['file']}, 'file': {'path': '/Library/LaunchDaemons/L'}, 'host': {'id': 'Sjo'}, '@timestamp': 6},
 {'event': {'type': ['process_started'], 'category': ['process']}, 'process': {'name': 'launchctl', 'args': ['load']}, 'host': {'id': 'Sjo'}, '@timestamp': 7}]

* Why four pairs? Because the model above has four branches and Geneve can explore all of them individually.

In principle Geneve is a "constraints solver", the data model is indeed a way to describes constraints to the data generation process. Relations between fields/documents are indeed constraints to the otherwise completely open solution space from which geneve draws its "solutions".

When it happens that the solution space is empty, then some conflicting constraints are present (eg: destination.port == 22 and destination.port in (80, 443)) and no solutions can be found, an error is reported. This is a very useful way to detect queries that cannot possibly be ever satisfied by any dataset.

TBC (in some more suitable place)

@ruflin ruflin self-assigned this Dec 1, 2022
@aspacca aspacca self-assigned this Dec 1, 2022
@ruflin
Copy link
Member Author

ruflin commented Dec 14, 2022

In the last few weeks, work has been done on improving the elastic-integration-corpus-generator-tool and trying out multiple template approaches: elastic/elastic-integration-corpus-generator-tool#39 Even though this work is not completed, I'm putting together here a more concret proposal on how all the pieces could work together to build an end-to-end experience around elastic-package. Below I bring up very specific examples but the exact names are less important than the concepts. If we go down the path of implementation, the details will likely change.

Data schemas

When collecting data with Elastic Agent and shipping to Elasticsearch, there are 4 different data schemas. This is important as the data schemas look different and we must align during generation on what data schemas we are talking about. In the diagram below, the schemas A, B, C and D are shown:

flowchart LR
    D[Data/Endpoint]
    EA[Elastic Agent]
    ES[Elasticearch]
    IP[Ingest Pipeline]
    D -->|Schema A| EA -->|Schema B| IP -->|Schema C| ES -->|Schema D| Query;
Loading
  • Schema A: This is the schema the Elastic Agent collects. It could be a line in a log file, response of an http request, syslog event etc. The Elastic Agent input knows how to handle this structure.
  • Schema B: This is the schema the Elastic Agent ships to Elasticsearch (to the ingest pipeline). This is a JSON document which contains all the processing the Elastic Agent did on schema A. For example the content of the log line is the in the message field and meta information around the host was added to the event.
  • Schema C: In case an ingest pipeline exists, the ingest pipeline converts schema B to schema C. This can be taking apart a log message with grok or enriching data with geoip. If there is no ingest pipeline, schema B and C are equal.
  • Schema D: This is the schema users write queries on in Elasticsearch. Schema C can be different from D in the scenario of runtime fields, otherwise C and D are equal.

Schema C is the one that is defined in integration packages in the fields.yml files for each data stream. In the following, our foucs is on Schema B with mentions of schema A.

Schema B generation

Schema B is always in JSON format and is the output generated by Elastic Agent. It contains the meta information about the event itself like host or k8s metadata. As schema B is in JSON format, shipping it to Elasticsearch in theory could be done by a curl request taking the json doc as the body of the document. Sending it to the correct data stream, processing would also happen and data is persisted.

The elastic-integration-corpus-generator-tool has ways to generate data based on some config optins and templates. In elastic/elastic-integration-corpus-generator-tool#39 multiple approaches for different templating are discussed. What all have in common are:

  • Event template: Template for the event to be generated with variables inside
  • Fields Configuration: Definition of each field on how it should be generated

What I skipped above is the fields definition for Elaticsearch which is contained in the tool but is not needed in the context of packages as this is Schema C is already defined as part of the package. What is needed in addition is a configuration file for the data generator to deside how much data should be generated, time range etc. In the tool this is currently done through command line parameters.

The assumption is that for a single dataset in an integration package, different scenarios could be generated. Lets take package foo with dataset bar as an example. The following files would exist:

  • foo/data_stream/bar/_dev/data_generation/config.yml
  • foo/data_stream/bar/_dev/data_generation/template1.tmpl
  • foo/data_stream/bar/_dev/data_generation/template2.tmpl
  • foo/data_stream/bar/_dev/data_generation/template1-config.yml
  • foo/data_stream/bar/_dev/data_generation/template2-config.yml

In the example above, 2 templates each with a config file are used. The .tmpl file contains the json template for the event and template1-config.yml contains the definition of the fields for the template. It would be possible to have just one definition for each template.

The config.yml contains a list of scenarios that should be generated. It could look similar to:

data_generation:
- name: short-sample
  timerange: 2d
  events: 1000
  template: template1
  # See spigot for more options https://github.com/leehinman/spigot
  output: elasticsearch
- name: middle-sample
  timerange: 10d
  events: 10000
  template: template1
  output: elasticsearch
- name: large-sample
  timerange: 2d
  events: 1000000
  template: template2
  output: rally

More config options could be added. The goal is to show that multiple data generations can be configured. Having all the setup done, elastic-package can be used to generate the data:

elastic-package data generate --package=bar --dataset=foo --name=large-sample

The parameters are optional. If the command is run inside a package, it would apply it to all dataset and all tasks by default or one can be selected. As can be seen in the above example, an output format can also be specified. The data can be stored in rally track format or sent to Elasticsearch directly.

Behind the scenes, elastic-integration-corpus-generator-tool is used to generated the events out of the templates and spigot to generate the relevant outputs.

Schema A generation

The generation of schema A would look very similar. But to ship schema A to Elasticsearch, a running Elastic Agent is needed. Similar to schema B, package-spec could contain config options on how to generate it. It would require in addition some logic around how schema A is collected to run an Elastic Agent for collection of it. There is a good chance, all these configs already exist in the data stream and can be used.

Generation of schema B I see as only second priority of this project.

Rally track generation

One of the goals of the generation of schema B is to be able to create rally track compatible data. To create a full rally track, it is required to also export templates, ingest pipelines from the package. elastic-package already has an export command to make use of this.

In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.

@alexsapran
Copy link
Contributor

Some thoughts on the benchmark topics, but don't want to sidetrack the discussion, maybe we can sync offline.

In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.

It would be nice to have elastic-package orchestrate whatever is required to execute a benchmark in the context of the integration itself, meaning start agent, configure agent, configure remote ES (index-template, pipelines, ....), start the agent, monitor the agent, tear down the agent, collect results. Having elastic-package` focused only on the component and not orchestrating a more complicated full environment would be a good starting point.

@marc-gr
Copy link
Contributor

marc-gr commented Dec 14, 2022

Some thoughts on the benchmark topics, but don't want to sidetrack the discussion, maybe we can sync offline.

In an ideal scenarios, a user could run elastic-package benchmark --dataset=foo and behind the scenes, data would be generated, rally track created, setup is done with esbench, pipelines and templates are loaded and data is ingested to Elasticsearch through the pipeline and measurements are provided on the performance for this run. At first, there might be some additional manual steps required.

It would be nice to have elastic-package orchestrate whatever is required to execute a benchmark in the context of the integration itself, meaning start agent, configure agent, configure remote ES (index-template, pipelines, ....), start the agent, monitor the agent, tear down the agent, collect results. Having elastic-package` focused only on the component and not orchestrating a more complicated full environment would be a good starting point.

In these lines, I drafted this for my own reference, so is not intended to be complete but just a bigger picture of how this would be for cases such as the Schema A scenarios mentioned above

image

So ideally we would have all required things to run benchmarks self contained in elastic-package and as part of the integration definitions, and esbench is going to be more a description for a more permanent benchmark setup as it is today for other cases if I am not mistaken.

@endorama
Copy link
Member

Following up on what @marc-gr wrote, we can also consider that data generation and data usage may not necessarily happen consequently and keep this decoupling in mind. The benefit would be that generator tools would be able to generate and store data in a generic storage and tool that leverages those data would be able to "replay" those data without the generation step, which may be compute intensive (thus either being bounded by compute resources or requiring extensive resources to be run at the desired scale). (As discussed with @ruflin some tools, like https://github.com/elastic/rally, already supports loading from S3).

@ruflin
Copy link
Member Author

ruflin commented Dec 15, 2022

can also consider that data generation and data usage may not necessarily happen consequently and keep this decoupling in mind

++, we should not only consider it but make sure it is decouple. I expect by default when data generation is used, the data is written to disk (in some format, maybe rally format?). That doesn't mean there can be eventually commands that bring it all together in one flow.

@ruflin ruflin removed their assignment Jan 5, 2023
@ruflin ruflin added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Jan 5, 2023
@aspacca
Copy link
Contributor

aspacca commented Jan 9, 2023

Status update:

  • PR introducing the possibility to generate data based on a template (full go text/template package plus sprig functions, stripped go text/template package supporting only placeholders): this unlocks the possibility to generate any Schema X data

Next steps (priority order to be defined):

@susan-shu-c
Copy link
Member

Thank you @aspacca !

I have an additional question about elastic-integration-corpus-generator-tool: could it be used to resolve an issue where in a fresh cluster install, before the first alert is generated, the .alerts-security.alerts-default] index doesn't exist? For example, an integration package that has a transform that reads from that index, will error with no such index [.alerts-security.alerts-default]; until we go in and make sure an alert is created. Example, the CI job failing on this PR

@aspacca
Copy link
Contributor

aspacca commented Jan 10, 2023

hi @susan-shu-c

I have an additional question about elastic-integration-corpus-generator-tool: could it be used to resolve an issue where in a fresh cluster install, before the first alert is generated, the .alerts-security.alerts-default] index doesn't exist? For example, an integration package that has a transform that reads from that index, will error with no such index [.alerts-security.alerts-default];

elastic-integration-corpus-generator-tool just generates data on the machine you run it on: nothing prevents you to generate a template that contains the payload of a bulk request for .alerts-security.alerts-default, still you have to send it to the cluster. you have to add this in your CI/wherever as preemptive steps.

unless you need to generate a lot of data with random content, there's probably no need to make use of elastic-integration-corpus-generator-tool, since what would solve the issue in CI (as long as I understand the issue) is ingesting documents in .alerts-security.alerts-default.

on the broader scope of having an elastic-package benchmark command your case is anyway something that has to be addressed, because potentially you should be able to run something like elastic-package benchmark host_risk_score

I'm not familiar with security integrations: I see for example that hot_risk_score does not have a ./data_stream/billing/fields folder, that's where elastic-integration-corpus-generator-tool gets the information in order to generate "Schema C" data

@aspacca
Copy link
Contributor

aspacca commented Jan 12, 2023

cc @leehinman

@endorama
Copy link
Member

endorama commented Feb 14, 2023

We are still in the first iteration, with the same goals. There have been slow progresses in the last week due to SDH duties.

Week 6 (Feb 6-10):
PR opened:

Week 7 (Feb 13-17):
PR Merged:

Week still in progress, we expect to open a new PR with aws.ec2_logs template and merge aws.billing template.

@endorama
Copy link
Member

endorama commented Mar 7, 2023

Week 9 (Feb 27 - Mar 2):

New PRs:

Merged PRs:

  • Various updates to dependencies

@endorama
Copy link
Member

endorama commented Mar 7, 2023

First iteration is still ongoing. Progress have been made in elastic-package and templates.

Week 10 (Mar 6 - 10):

Merged:

@endorama
Copy link
Member

endorama commented Mar 23, 2023

With today we conclude the first iteration. All planned templates are available.

Week 12 (Mar 20 - 24):

Merged:

Releases: v0.5.0

Related:

@bturquet
Copy link

bturquet commented Jun 15, 2023

2023-06-15 Update on the project

1st iteration, Creating templates (complete)

Done

During the first iteration we created schema B templates for aws.ec2_logs, aws.ec2_metrics, aws.billing, aws.sqs and k8s

2nd iteration, Adding commands to elastic-package (in progress)

Done

Library refactoring

In progress

Adding benchmark rally command

Next (dependency with the previous one)

  • Duplicate the content directly to elastic-package (PR needs to be created, est. 1 day)
  • Update the benchmark generate-data command to use assets from package instead of generator’s repo (PR needs to be created, est. a few days)

@bturquet
Copy link

Regarding

The package-spec PR has been merged, we are now waiting for a new release, before doing it we have this pending PR (under review)

@bturquet
Copy link

bturquet commented Jun 22, 2023

All the PR dependencies have been merged, @aspacca could you launch the next steps please ?

@bturquet
Copy link

@aspacca Could we share an update of where we are today and what are the next and remaining steps please ?

@aspacca
Copy link
Contributor

aspacca commented Sep 25, 2023

2023-09-25 Update on the project

2nd iteration

In progress

  • Change size param to totEvents param in generator's API [PR]
  • Fix system benchmark in elastic-package according to new generator's API [package-spec PR, elastic-package issue]
  • Add support to rally benchmark in elastic-package [issue]
  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]
  • Update the benchmark generate-data command to use assets from package instead of generator’s repo [issue]

@aspacca
Copy link
Contributor

aspacca commented Oct 12, 2023

2023-10-12 Update on the project

2nd iteration

In progress

  • Change size param to totEvents param in generator's API [PR]
  • Fix system benchmark in elastic-package according to new generator's API [package-spec PR, elastic-package issue]. no work left, blocked by package-spec@v3
  • Add support to rally benchmark in elastic-package [issue]. est. 5 days of coding, with potential external dependencies
  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. no external dependencies
  • Update the benchmark generate-data command to use assets from package instead of generator’s repo [issue]. est. 5 days of coding, external dependency (task above) can be mocked.

@ruflin
Copy link
Member Author

ruflin commented Oct 17, 2023

no work left, blocked by package-spec@v3

@aspacca Can you share some more details on what part of v3 this is blocked on? Anything we can do on our end to get this unblocked?

Duplicating schemas

Will the schemas change in some ways when moving to elastic-package? Context I'm asking: After the moving 1-2 datasets as examplse, could the teams themself move the assets over?

@aspacca
Copy link
Contributor

aspacca commented Oct 17, 2023

Can you share some more details on what part of v3 this is blocked on? Anything we can do on our end to get this unblocked?

we dropped a deprecated field and renamed the new one, taking the occasion of the breaking change in v3 (no one but for package-spec and tests in elastic-package is using yet the spec for system/rally benchmark anyway, so we didn't need to support a migration path).
the PR is blocked waiting for the spec changes described above will be merged in v3 in order to pass CI

Duplicating schemas
Will the schemas change in some ways when moving to elastic-package? Context I'm asking: After the moving 1-2 datasets as examplse, could the teams themself move the assets over?

no, they won't change, but of reviewing flattened objects notation that I personally forgot to consider when creating the schemas in the first place.
the teams would be then independent to duplicate/migrate the schemas.

@aspacca
Copy link
Contributor

aspacca commented Oct 17, 2023

2023-10-17 Update on the project

2nd iteration

In progress

  • Change size param to totEvents param in generator's API [PR]
  • Fix system benchmark in elastic-package according to new generator's API [package-spec PR #1, package-spec PR #2, elastic-package issue], elastic-package PR]. no work left, waiting for CI to be green with the release of package-spec@v3
  • Add support to rally benchmark in elastic-package [issue]. est. 5 days of coding, with potential external dependencies
  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. no external dependencies
  • Update the benchmark generate-data command to use assets from package instead of generator’s repo [issue]. est. 5 days of coding, external dependency (task above) can be mocked.

@aspacca
Copy link
Contributor

aspacca commented Oct 24, 2023

2023-10-24 Update on the project

2nd iteration

In progress

  • Change size param to totEvents param in generator's API [PR]
  • Fix system benchmark in elastic-package according to new generator's API [package-spec PR, elastic-package issue, elastic-package PR].
  • Add support to rally benchmark in elastic-package [issue, PR]. PR waiting for review.
  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. no external dependencies
  • Update the benchmark generate-data command to use assets from package instead of generator’s repo [issue]. est. 5 days of coding, external dependency (task above) can be mocked.

@aspacca
Copy link
Contributor

aspacca commented Oct 31, 2023

2023-10-31 Update on the project

2nd iteration

In progress

  • Change size param to totEvents param in generator's API [PR]
  • Fix system benchmark in elastic-package according to new generator's API [package-spec PR, elastic-package issue, elastic-package PR].
  • Add support to rally benchmark in elastic-package [issue, PR]. PR in review.
  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. no external dependencies
  • Remove benchmark generate-data command [issue]. est. 1 day.

@aspacca
Copy link
Contributor

aspacca commented Nov 7, 2023

2023-11-07 Update on the project

2nd iteration

In progress

  • Change size param to totEvents param in generator's API [PR]
  • Fix system benchmark in elastic-package according to new generator's API [package-spec PR, elastic-package issue, elastic-package PR].
  • Add support to rally benchmark in elastic-package [issue, PR].
  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 2/6 through
  • Remove benchmark generate-data command [issue]. est. 1 day.

@aspacca
Copy link
Contributor

aspacca commented Nov 10, 2023

2023-11-10 Update on the project

3nd iteration

In progress

  • Remove benchmark generate-data command [issue]. waiting for review
  • Add benchmark stream command [issue]. est. 5 days.
  • Add loading a custom template for TSDB range in benchmark rally command [issue]. est. 3 days.

@aspacca
Copy link
Contributor

aspacca commented Nov 14, 2023

2023-11-14 Update on the project

2nd iteration

  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 6/6 through, waiting for review

3nd iteration

In progress

  • Remove benchmark generate-data command [issue, PR]. merged
  • Add benchmark stream command [issue]. est. 5 days
  • Add loading a custom template for TSDB range in benchmark rally command [issue, PR]. waiting for review
  • Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]

@aspacca
Copy link
Contributor

aspacca commented Nov 22, 2023

2023-11-23 Update on the project

2nd iteration

  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 3/6 merged, 3/6 waiting for review

3nd iteration

In progress

  • Remove benchmark generate-data command [issue, PR]. merged
  • Add benchmark stream command [issue]. est. 5 days
  • Add loading a custom template for TSDB range in benchmark rally command [issue, PR]. ready to merge
  • Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
  • Add range.from/to for type date in the generator tool [PR]. waiting for review

@aspacca
Copy link
Contributor

aspacca commented Nov 24, 2023

2023-11-24 Update on the project

2nd iteration

  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 5/6 merged, 1/6 waiting for review

3nd iteration

In progress

  • Remove benchmark generate-data command [issue, PR]. merged
  • Add benchmark stream command [issue]. est. 5 days
  • Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
  • Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
  • Add range.from/to for type date in the generator tool [PR]. waiting for review
  • elastic-package benchmark rally: support install package from registry and local corpus. est. 1 day

@aspacca
Copy link
Contributor

aspacca commented Nov 30, 2023

2023-11-30 Update on the project

2nd iteration

  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue]. est. 1 day of coding for each dataset's integration. 5/6 merged, 1/6 waiting for review

3nd iteration

In progress

  • Remove benchmark generate-data command [issue, PR]. merged
  • Add benchmark stream command [issue]. est. 5 days
  • Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
  • Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
  • Add range.from/to for type date in the generator tool [PR]. waiting for review
  • elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review

@aspacca
Copy link
Contributor

aspacca commented Dec 4, 2023

2023-12-04 Update on the project

2nd iteration

  • Duplicate the schema-b content from the generator's repo directly to elastic-package [issue].

3nd iteration

In progress

  • Remove benchmark generate-data command [issue, PR]. merged
  • Add benchmark stream command [issue. PR]. waiting for review
  • Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
  • Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
  • Add range.from/to for type date in the generator tool [PR]
  • elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review
  • dump all the integrations ES assets for benchmark rally command[issue]

@aspacca
Copy link
Contributor

aspacca commented Jan 9, 2024

2024-01-09 Update on the project

3nd iteration

In progress

  • Remove benchmark generate-data command [issue, PR]. merged
  • Add benchmark stream command [issue. PR]. waiting for review
  • Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
  • Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
  • Add range.from/to for type date in the generator tool [PR]
  • elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review
  • dump all the integrations ES assets for benchmark rally command[issue]

@aspacca
Copy link
Contributor

aspacca commented Jan 16, 2024

2024-01-16 Update on the project

3nd iteration

In progress

  • Remove benchmark generate-data command [issue, PR]. merged
  • Add benchmark stream command [issue. PR]. waiting for review
  • Add loading a custom template for TSDB range in benchmark rally command [issue, PR].
  • Make generator tool base now timestamp (now - period) + ((period / tot events) * nth event) [PR]
  • Add range.from/to for type date in the generator tool [PR]
  • elastic-package benchmark rally: support install package from registry and local corpus. [PR]. waiting for review
  • dump all the integrations ES assets for benchmark rally command[issue]

@aspacca
Copy link
Contributor

aspacca commented Feb 13, 2024

2023-02-13: Final update on the issue.

integrations repo

The assets (templates, fields.yml and config.yml) were generated for the following datasets in the integrations repo:

  • aws.ec2_logs
  • aws.ec2_metrics
  • aws.billing
  • aws.sqs
  • kubernetes.state_container
  • kubernetes.node
  • nginx.access
  • mysql.slowlog
  • mysql.error
  • mysql.galera_status
  • mysql.status
  • nginx.stubstatus
  • nginx.access
  • nginx.error
  • mysql.performance
  • kubernetes.container
  • kubernetes.pod

Continuous refinement is ongoing on some existing assets, new assets for new datasets are continuously added

elastic-package repo

  • Added elastic-package benchmark rally in order to generate and run a rally track from the root folder of an integration for a specific dataset. Several options are provided, like only generating the rally track with the related corpus, persisting the rally track and the related corpus or replaying an existing generated rally track with the related corpus
  • Added elastic-package benchmark stream in order to streaming ingestion to an ES cluster from the root folder of an integration for one or multiple datasets at once. An option to backfill events for a configurable amount of time before having run the command is provided.

Further enhancements are already planned, like decoupling the location where the commands need to be launched from (root folder of an integration), improving automation experience according to relevant audiences, and an internal refactoring of the existing duplicated code among the other things, but not limited to those.

elastic-integration-corpus-generator-tool repo

  • Added the possibility to define a specific seed for the rand package and time to be used as Time.Now() in order to generate reproducible content
  • Total content to be generated is now indicated by amount of events instead of content size
  • Added range.from/to for date fields in order to set time bounds in the generated values (similar to numeric range.min/max)
  • Enforced progressive and orderer generation of @timestamp field

Adding support for counter numeric field is already ongoing and a big refactoring of configuration around cardinality is planned and already designed and it will be the next first future implementation. This refactoring is a breaking change we deemed it is necessary and its highest priority is given exactly by the fact the if we proceed with it now the impact will be fairly reduced.

We identified the following areas of ownership:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

No branches or pull requests