-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample data generation #984
Comments
Pinging @marc-gr as he was thinking also about this problem in the context of pipelines benchmarking. |
I'd like to give some context about the rationale when developing the tool.
Satisfying point 1., at the current status, implies the data generated are post-ingest, but nothing prevents us to expand package-spec to include a definition of the pre-ingest schema. I still have to look in the tool spigot by @leehinman. Ideally a third "tool", part of elastic package, can be developed to extract cardinality and fuzziness from existing ingested data in a cluster in order to initially feed the two above. |
I was initially thinking that especially in the metrics case, this should not be too big of an issue. But we are also moving more processing around metrics to ingest pipelines instead of Elastic Agent so the metrics data coming in might look quite a bit different. Something to investigate further. |
I don't know if this needs to be part of Ideally it would be nice if the sample generation tool could be run several ways:
|
I agree with this in the terms of separated repos that provides both standalone cli commands and packages to be consumed directly from |
With vpcflow we found a "problem" with the logs that spigot generated. The vpcflow data was fine, but if we made a rally track out of that data, it was missing fields that the filebeat awss3 input adds to the event. Without those fields we couldn't drive the issue. So I think we really want to have one or more things that can generate logs, and then for both filebeat & logstash have outputs that make rally tracks. That way we don't have to duplicate the fields added by inputs and we don't have to duplicate any user defined processors either, filebeat or logstash will do that like they do in production. This also has another benefit, if we don't have a log generation tool, we can capture real data and make a rally track. |
I guess the missing fields could be generated by https://github.com/elastic/elastic-integration-corpus-generator-tool, since they should be part of "fields.yml", but for the tool at the moment can only generate post-ingestion pipeline documents, but with some please let me know if you'd like to work together on the topic :) |
the
vpcflow processing is pretty minimal, some other integrations will make putting the original back together from the result very complicated. Others like Cloudtrail would be pretty easy to re-assemble from the results.
I'm definitely interested in making sure we can generate source documents for all of our integrations. Even if we get the corpus generator to be nearly perfect, I'm still in favor of providing a turn key way for customers to generate rally tracks from their own data. That way we can run tests with the exact data that is causing the problem. |
Do we need them to generate rally tracks, or is just capturing the raw events enough and we could post-process them into a rally track ourselves? I wonder if adding the ability to tee events to both a file and the actual target output would help here. I definitely like the idea of having something built into Agent that users can enable in production to give us the exact events they are experiencing issues with. I think this will eliminate a lot of back and forth and wasted time because we could have exactly the data that is causing problems, with possible caveats like having to sanitize out personally identifiable information or credentials. |
Doesn't have to be rally, but we should have a simple way of converting to rally. We could probably have a small utility that takes the existing file output and turns that into rally tracks. I really like the tee idea. We would need to ignore ACKs from secondary output, and make sure that a slow secondary output doesn't slow down the primary output. |
Not certain if it offers anything beyond other tools mentioned above, but there also exists Logen for generating logs |
Talking of tools, there is also elastic/geneve. We don't have a good summary of what it does though there are some technical docs at https://github.com/elastic/geneve/tree/main/docs and https://github.com/elastic/geneve/tree/main/tests/reports. The juice is that you describe (in a so called data model) what kind of documents you need and then Geneve will generate as many as you want. A data model can be as simple as "these fields need to be present" to something more complex like "the documents need to have this relation: first doc has some content in Geneve was born to generate documents that would trigger detection rules in the Security app but nothing forbids to describe other kinds of fields/documents relations and use the generated documents for other purposes. Indeed we are currently working with the Analyst Experience team to help them filling their stack with data in a flexible way, this would allow them to use and develop Kibana in ways that are now not easily feasible. An example of data model is:
Example of the four (*) pairs of documents that can be generate:
* Why four pairs? Because the model above has four branches and Geneve can explore all of them individually. In principle Geneve is a "constraints solver", the data model is indeed a way to describes constraints to the data generation process. Relations between fields/documents are indeed constraints to the otherwise completely open solution space from which geneve draws its "solutions". When it happens that the solution space is empty, then some conflicting constraints are present (eg: TBC (in some more suitable place) |
In the last few weeks, work has been done on improving the elastic-integration-corpus-generator-tool and trying out multiple template approaches: elastic/elastic-integration-corpus-generator-tool#39 Even though this work is not completed, I'm putting together here a more concret proposal on how all the pieces could work together to build an end-to-end experience around elastic-package. Below I bring up very specific examples but the exact names are less important than the concepts. If we go down the path of implementation, the details will likely change. Data schemasWhen collecting data with Elastic Agent and shipping to Elasticsearch, there are 4 different data schemas. This is important as the data schemas look different and we must align during generation on what data schemas we are talking about. In the diagram below, the schemas A, B, C and D are shown: flowchart LR
D[Data/Endpoint]
EA[Elastic Agent]
ES[Elasticearch]
IP[Ingest Pipeline]
D -->|Schema A| EA -->|Schema B| IP -->|Schema C| ES -->|Schema D| Query;
Schema C is the one that is defined in integration packages in the Schema B generationSchema B is always in JSON format and is the output generated by Elastic Agent. It contains the meta information about the event itself like host or k8s metadata. As schema B is in JSON format, shipping it to Elasticsearch in theory could be done by a curl request taking the json doc as the body of the document. Sending it to the correct data stream, processing would also happen and data is persisted. The elastic-integration-corpus-generator-tool has ways to generate data based on some config optins and templates. In elastic/elastic-integration-corpus-generator-tool#39 multiple approaches for different templating are discussed. What all have in common are:
What I skipped above is the fields definition for Elaticsearch which is contained in the tool but is not needed in the context of packages as this is Schema C is already defined as part of the package. What is needed in addition is a configuration file for the data generator to deside how much data should be generated, time range etc. In the tool this is currently done through command line parameters. The assumption is that for a single dataset in an integration package, different scenarios could be generated. Lets take package
In the example above, 2 templates each with a config file are used. The The data_generation:
- name: short-sample
timerange: 2d
events: 1000
template: template1
# See spigot for more options https://github.com/leehinman/spigot
output: elasticsearch
- name: middle-sample
timerange: 10d
events: 10000
template: template1
output: elasticsearch
- name: large-sample
timerange: 2d
events: 1000000
template: template2
output: rally More config options could be added. The goal is to show that multiple data generations can be configured. Having all the setup done, elastic-package can be used to generate the data:
The parameters are optional. If the command is run inside a package, it would apply it to all dataset and all tasks by default or one can be selected. As can be seen in the above example, an output format can also be specified. The data can be stored in rally track format or sent to Elasticsearch directly. Behind the scenes, elastic-integration-corpus-generator-tool is used to generated the events out of the templates and spigot to generate the relevant outputs. Schema A generationThe generation of schema A would look very similar. But to ship schema A to Elasticsearch, a running Elastic Agent is needed. Similar to schema B, package-spec could contain config options on how to generate it. It would require in addition some logic around how schema A is collected to run an Elastic Agent for collection of it. There is a good chance, all these configs already exist in the data stream and can be used. Generation of schema B I see as only second priority of this project. Rally track generationOne of the goals of the generation of In an ideal scenarios, a user could run |
Some thoughts on the benchmark topics, but don't want to sidetrack the discussion, maybe we can sync offline.
It would be nice to have |
In these lines, I drafted this for my own reference, so is not intended to be complete but just a bigger picture of how this would be for cases such as the Schema A scenarios mentioned above So ideally we would have all required things to run benchmarks self contained in elastic-package and as part of the integration definitions, and esbench is going to be more a description for a more permanent benchmark setup as it is today for other cases if I am not mistaken. |
Following up on what @marc-gr wrote, we can also consider that data generation and data usage may not necessarily happen consequently and keep this decoupling in mind. The benefit would be that generator tools would be able to generate and store data in a generic storage and tool that leverages those data would be able to "replay" those data without the generation step, which may be compute intensive (thus either being bounded by compute resources or requiring extensive resources to be run at the desired scale). (As discussed with @ruflin some tools, like https://github.com/elastic/rally, already supports loading from S3). |
++, we should not only consider it but make sure it is decouple. I expect by default when data generation is used, the data is written to disk (in some format, maybe rally format?). That doesn't mean there can be eventually commands that bring it all together in one flow. |
Status update:
Next steps (priority order to be defined):
|
Thank you @aspacca ! I have an additional question about |
hi @susan-shu-c
unless you need to generate a lot of data with random content, there's probably no need to make use of on the broader scope of having an I'm not familiar with security integrations: I see for example that |
cc @leehinman |
We are still in the first iteration, with the same goals. There have been slow progresses in the last week due to SDH duties. Week 6 (Feb 6-10): Week 7 (Feb 13-17):
Week still in progress, we expect to open a new PR with |
Week 9 (Feb 27 - Mar 2): New PRs: Merged PRs:
|
First iteration is still ongoing. Progress have been made in Week 10 (Mar 6 - 10): Merged:
|
With today we conclude the first iteration. All planned templates are available. Week 12 (Mar 20 - 24):Merged: Releases: v0.5.0 Related: |
2023-06-15 Update on the project1st iteration, Creating templates (complete)DoneDuring the first iteration we created schema B templates for
2nd iteration, Adding commands to elastic-package (in progress)DoneLibrary refactoring In progressAdding benchmark rally command Next (dependency with the previous one)
|
Regarding The package-spec PR has been merged, we are now waiting for a new release, before doing it we have this pending PR (under review) |
All the PR dependencies have been merged, @aspacca could you launch the next steps please ? |
@aspacca Could we share an update of where we are today and what are the next and remaining steps please ? |
2023-09-25 Update on the project2nd iterationIn progress
|
2023-10-12 Update on the project2nd iterationIn progress
|
@aspacca Can you share some more details on what part of v3 this is blocked on? Anything we can do on our end to get this unblocked?
Will the schemas change in some ways when moving to elastic-package? Context I'm asking: After the moving 1-2 datasets as examplse, could the teams themself move the assets over? |
we dropped a deprecated field and renamed the new one, taking the occasion of the breaking change in v3 (no one but for
no, they won't change, but of reviewing flattened objects notation that I personally forgot to consider when creating the schemas in the first place. |
2023-10-17 Update on the project2nd iterationIn progress
|
2023-10-24 Update on the project2nd iterationIn progress
|
2023-10-31 Update on the project2nd iterationIn progress
|
2023-11-07 Update on the project2nd iterationIn progress
|
2023-11-14 Update on the project2nd iteration
3nd iterationIn progress
|
2023-11-23 Update on the project2nd iteration
3nd iterationIn progress
|
2023-11-24 Update on the project2nd iteration
3nd iterationIn progress
|
2023-11-30 Update on the project2nd iteration
3nd iterationIn progress
|
2023-12-04 Update on the project2nd iteration
3nd iterationIn progress
|
2024-01-09 Update on the project3nd iterationIn progress
|
2024-01-16 Update on the project3nd iterationIn progress
|
2023-02-13: Final update on the issue.integrations repoThe assets (templates,
Continuous refinement is ongoing on some existing assets, new assets for new datasets are continuously added
|
When building integration packages, sample data is important to develop ingest pipeline and build dashboards. Unfortunately in most cases, real sample data is limited and often tricky to produce. This issues proposes a tool as part of elastic-package that can generate and load sample data.
Important: The following is only an initial proposal to better explain the problem and share existing ideas. A proper design is still required.
Why part of elastic-package
Generating sample data is not a new problem and there are several tools which already provide partial solutions to this. A tool to generate sample data in elastic-package is needed to make it available in a simple way to each package developers. How sample data should look and be generated becomes part of the package spec. Like this, someone building a package directly also gets the possibility of generating sample data and use it as part of the developer experience.
Data generation - metrics / logs
For the data generation, two different types of data exist. Metrics and traces are mostly already in the format that will be ingested into Elasticsearch and require very little processing. Logs on the other hand often come as raw messages and require ingest pipelines or runtime fields to structure the data. The goal is that the tool can generate both types of data but it can happen in iterations.
Metrics generation
For the generation of metrics, I suggest to take strong inspiration from the elastic-integration-corpus-generator-tool tool built by @aspacca. Instead of having to build separate config files, the config params for each field would be directly in the
fields.yml
of each data stream. The definition could look similar to the following:The exact syntax for each field type needs definition.
Logs generation
For logs generation, inspiration can be taken from the tool spigot by @leehinman. Ideally we could simplify this by allowing users to specify the message patterns something like
{@timestamp} {source.ip}
and the specify for these fields what the values should be. Then the tool would take over the generation of sample logs.Important is that the log generation outputs message fields pre ingest pipeline.
Generated data format
The proposed data structure generated by the tool is the one used by esrally. It contains 1 JSON doc per line with all the fields inside. This makes it simple to just deliver the data to Elasticsearch and makes it possible to potentially reuse some of this generated data with rally tracks.
Non goals
A non goal of the data generation on loading of data is to replace rally. Rally measures the exact performance and builds reproducible benchmarks. When generating and loading data with elastic-package it is about testing ingest pipelines, testing dashboards and test queries on larger sets of data in an easy way. The focus is on package development.
Another non goal is to generate events that are related to each other. For some solutions it is important that if a host.name shows up other parts of the data contain the same host.name to be able to browse through the solution. This might be added at a later point but is not part of the scope.
Sample data storage
As the sample data can always be generated on the fly, it is not required to store it. If some of the sample data sets should be stored for later use, package-spec should provide a schema to reference sample datasets.
Command line
Command line arguments must be available to generate sample data for a dataset or a package and load it into Elasticsearch. Ideally package-spec allows to store some config files around which data sets can be generated so a package developer can share these configs as part of the package.
Initial packages to start with
I recommend to pick 2 initial packages to start with around the data generation. As k8s and AWS are both more complex package that also generate lots of data, this could be a good start focusing on the metrics part.
Future ideas
The text was updated successfully, but these errors were encountered: