Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "data" asset type #37

Open
ruflin opened this issue Jun 22, 2020 · 18 comments
Open

Add "data" asset type #37

ruflin opened this issue Jun 22, 2020 · 18 comments

Comments

@ruflin
Copy link
Contributor

ruflin commented Jun 22, 2020

Add support to the package to add data.

It needs to be defined which format this data is stored in the package.

@mtojek
Copy link
Contributor

mtojek commented Jun 23, 2020

The description is too vague. Could you please add more details and a use case for such assets?

@ruflin
Copy link
Contributor Author

ruflin commented Jun 24, 2020

We currently have 3 proposed use cases:

  • Install data for look ups
  • Geo data for maps
  • Example data

The issue will be filled with more details as soon as we pick it up. At the moment it's more a placeholder.

@P1llus
Copy link
Member

P1llus commented Jun 25, 2020

Adding the details and the requirements at least from my personal perspective when it comes to allowing modules to include enrichment policies and the data to be used for the enrichment indices.

First let's describe exactly what the use case is and why it can make a big impact:

Current behaviour:
Each module and its related dataset(s) include at least 1 pipeline. This pipeline is used to parse, map and normalise the ingested events.
After normalisation is done, we usually set all our ECS fields based on certain logic. If field X = Y then event.action = Z as an example.

This creates certain issues and restrictions with how much we can actually enrich incoming events. This is because if the logic is simple, we would include a simple "if" condition to a pipeline and if its complex we would use a script processor. These requires script compilations quite often, the scripts to be cached and has a fair impact on performance in smaller scenarios and larger impact in bigger implementations.

Example condition:

  - set:
      if: "ctx._temp_.cisco.message_id == '338301'"
      field: "client.port"
      value: "{{destination.port}}"

Example Script(Line 503 to 926 could be removed):
https://github.com/elastic/beats/blob/bcce25750afc2f205abd8cfbaf810fe8389e2c62/x-pack/filebeat/module/cisco/shared/ingest/asa-ftd-pipeline.yml#L503

Now if I was a regular user, that parses, map and normalise the events myself, using an enrichment index makes much more sense, it removes all logic needed and drills it down to just(just an example):

PUT /_enrich/policy/users-policy
{
    "match": {
        "indices": "cisco-module",
        "match_field": "id",
        "enrich_fields": ["ecs.type", "event.category", "event.type"]
    }
}

Impact with enrichment index support in modules:
If we decide for allowing a module itself to include a dataset involving the keys and values we want to use for enrichment, we would be able to remove a very large amount of logic from all modules that currently requires it, but also has many other benefits:

  • All modules across all beats that would benefits from it would use the same method to manage ECS mapping the exact same way, making it much less straining to keep them updated for engineers.
  • Customers sometimes modify pipelines them self, this would allow us to do a fair amount of our updates, especially on ECS to never need to touch the pipeline.
  • It could allow in the future to update ECS mappings as telemetry and not just from package updates.
  • Script compilations significantly reduced.
  • Ability for customers to reuse the methodology, allowing them to add custom mappings if needed.
  • Running a large amount of different modules/filesets will cause much less strain on the stack.
  • Ability to either unify enrichment between modules by reusing the same mapping or make them module specific.

How would an implementation look like?
In addition to the current filebeat (as an example) directory structure, it should also include a folder for enrichment. This folder could include multiple datasets, a config file that references which index to be created for each dataset, and a config file for the enrichment policies to be applied.

Installation
When a package is installed through ingest management, the same process as now will be followed, but with 3 added steps:
Iterate over datasets and create temp indices for this dataset.
Create the enrichment policies defined by the module.
Executes the policy and then deletes the temp indices as they are not needed anymore.

Package update
When a package is updated, the enrichment index, policy and datasets should be deleted and remade.

Package deletion
All related enrichment policies and indices should be removed as well.

Hopefully this makes sense, and looking forward to hear any thoughts!

@leehinman
Copy link
Contributor

Use case

In AWS cloudtrail we have the situation where one of the log fields
allows us to set 3 of the ECS categorization fields. For example:

a cloudtrail eventName (which gets mapped to event.action) of
"AddUserToGroup" would tell us to set event.kind to "event",
event.category to "[iam]", and event.type to "[change, group]"

This pattern has resulted in the following script processor in the
ingest pipeline, which happens to look an awful lot like an
enrichment

  - script:
      lang: painless
      ignore_failure: true
      params:
        AddUserToGroup:
          category:
            - iam
          kind: event
          type:
            - group
            - change
        .
        .
        .
        UpdateUser:
          category:
            - iam
          kind: event
          type:
            - user
            - change			
      source: >-
        def hm = new HashMap(params.get(ctx.event.action));
        hm.forEach((k, v) -> ctx.event[k] = v);

AWS cloudtrail has thousands of eventNames, it would be nice to replace
this script processor with an enrichment processor. This would help
separate the data from the code and make it easier for users to update
the mappings in the field.

pipeline

- enrich:
    policy_name: cloudtrail-eventname-policy
    field: event.action
    target_field: event
    max_matches: 1

policy

- match:
    indices: cloudtrail-eventname
    match_field: action,
    enrich_fields:
      - category
      - kind
      - type

doc

action: UpdateUser
category:
  - iam
kind: event
type:
  - user
  - change			

@stacey-gammon
Copy link

Hope this is the right place to chime in with a couple questions and ideas.

  • Will developers external to Elastic be able to add their own packages? The only way it seems right now is to add code directly to the package-storage repo? If we ever get to a place where we have a marketplace of Kibana plugins, these Kibana plugins could include data ingestion packages, or sample data. Would this flow be supported?

  • Would this support a use case where I want to poll a third party API to ingest data? I'm currently playing around with a side project of writing a Kibana plugin that a user can set up to ingest data from a Twitter API. I'm planning on using the Task Manager to periodically poll for new information. Is "Ingest data management" the right place for something like that? Or will it be easier to just do what I plan on doing, and write a custom plugin using task manager?

Other idea's I've had for data ingestion:

  • Data ingestion that creates massive amounts of data for performance reasons and testing. Give it configurations like: number of fields, number of indexes, ingestion time interval rate.

  • Easy ways to ingest data from other public datasets. Like one click installations for all these data sets:

Screen Shot 2020-06-25 at 2 41 38 PM

That link lets you use Google BigQuery to explore any of those datasets. It'd be super cool if we had a public hosted Kibana/Elasticsearch instance that let users play around with these datasets.

I guess my question is, is the "ingest data manager" project headed towards something like that? Or if we ever wanted to have one click installs, should they be separate Kibana plugins exposed in a marketplace?

@ruflin
Copy link
Contributor Author

ruflin commented Jul 1, 2020

Thanks everyone for chiming in. This is VERY useful. We will be looking into add data in a few weeks as focus is on 7.9 at the moment.

@stacey-gammon For your first use case: Yes. For the third-party API: Did not think of this use case yet. Could be interesting, but would it also mean shipping code as part of the package?

@ruflin
Copy link
Contributor Author

ruflin commented Aug 12, 2020

I finally got back to this issue and I have a few follow up questions:

@P1llus

  • What would be the name of the indices? Would they follow the indexing strategy naming or something else?
  • Upgrade path: Everything is wiped on upgrade if I understand this correct
  • Will multiple packages use the same enrichment data? Would it be a problem if it is per package and perhaps some of it is duplicated? This removes many version challenges.
  • We need support for enrichment policies in any case. Is the enrichment policy per dataset or global per package?
  • Can you dig into a bit more detail around this temp index and why it is removed again afterwards?

@leehinman Is your use case also covered by what @P1llus describes above?

@stacey-gammon

  • For polling external data sources, packages are probably not the right place (at the moment) and rather do a separate plugin
  • Data ingestion performance testing: Interesting idea. I think what packages could serve is the content itself but I think specifying ingestion rate and measuring impact is out of scope, sounds more like rally.
  • Public datasets: This is super interesting. We could take these datasets and have a script to convert them to be Elasticsearch compatible data and then serve them in the registry as packages. What it means, everyone could pull this data down into its own cluser and play with it.

@ruflin
Copy link
Contributor Author

ruflin commented Aug 12, 2020

@ycombinator For awareness, that data + enrich are on the radar and potentially will need to be added to package-spec if we move forward.

@P1llus
Copy link
Member

P1llus commented Aug 12, 2020

@ruflin
What would be the name of the indices? Would they follow the indexing strategy naming or something else?
For the naming convention I would most likely point to @leehinman , as I am still getting familiar if we are following the same pattern in packages as we are for filebeat standalone modules.
My opinion is that it should follow the same/similar naming convention we have, which is something like BEATNAME-VERSION-DATASET-ENRICHPOLICYNAME. That way we can ensure it will support multiple versions at the same time I believe?

Upgrade path: Everything is wiped on upgrade if I understand this correct
Yes so enrichment polices are static and read only. So upon creation they can never be changed, that's why when a package is updated it should always be recreated, it's also then not modifiable by end-users on purpose.

Will multiple packages use the same enrichment data? Would it be a problem if it is per package and perhaps some of it is duplicated? This removes many version challenges.
Multiple packages should not share the same enrichment policy, the reason for that is that it can become combersome to manage, and make the enrichment indices larger. Since they are only a few KB in size it should not be a problem keeping them separate, that also includes datasets in the same package, they should not be able to see eachother, because then the naming convention gets wrong.

We need support for enrichment policies in any case. Is the enrichment policy per dataset or global per package?
I would recommend having the enrichment policy definition per dataset. Currently I am not aware of any usecases specific to global per package, but if anyone is more in favor of that then I would hope we could still make it possible to do either/or.
One niche usecase comes in mind would be Zeek, since it has a massive amount of smaller datasets, and would certainly benefit from one global. But in almost all other cases that won't be relevant.

Can you dig into a bit more detail around this temp index and why it is removed again afterwards?
To explain about the temp index I think It's better to explain the workflow when creating a enrichment policy, as it would make more sense. Anything can be found referenced here: https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-match-enrich-policy-type.html

When you want to create a enrichment index and policy this would be the steps:

  1. Create a normal index just like any other, using the Create API.
  2. Attaching mapping to that index related to the data we want to put into that index.
  3. Index the documents we want to use for enrichment.
  4. Run the put enrich policy API, the body of the request will include the name of the policy you want to create + the name of the index created in step 1 and 2.
  5. Run Execute Enrich Policy API, referencing the newly created policy name.
    When executing this enrich policy, it will take a snapshot of all the current data in the index created in step 1-3, and create a duplicate read-only enrich index. This new enrich index can not be modified itself, but if we add more data to the original index from step 1-3 and re-run the execute enrich policy API step, it will re-create a new enrich index with the updated data and delete the old.

Hopefully this makes sense, and as you might notice, after we have executed step 4-5, we will never have any need for the initial "temp" index so to speak, as we won't be updating it anytime soon, and it would just stay there on the cluster until it would have to be recreated during an upgrade anyway. That also makes it easier since we don't need to mess with things like ILM for these.
Upon package installation/upgrade it should run through all steps each time, to ensure nothing is left behind from other times, and also stops the user from modifying the original index

@leehinman
Copy link
Contributor

  • What would be the name of the indices? Would they follow the indexing strategy naming or something else?

I think it should be named in a similar manner to the ingest pipeline
for the dataset.

  • Upgrade path: Everything is wiped on upgrade if I understand this correct

Yes

  • Will multiple packages use the same enrichment data? Would it be a problem if it is per package and perhaps some of it is duplicated? This removes many version challenges.
  • We need support for enrichment policies in any case. Is the enrichment policy per dataset or global per package?

I think it is OK to start with enrichment data & policies per
dataset. I don't know if we will have more cases where we want the
policy/data to be the same vs policy/data be independent per dataset.

  • Can you dig into a bit more detail around this temp index and why it is removed again afterwards?

@leehinman Is your use case also covered by what @P1llus describes above?

Yes

@ruflin
Copy link
Contributor Author

ruflin commented Aug 18, 2020

I'm strongly in favor of having everything per dataset as I think it simplifies things. As we currently also discuss some other assets to be added to the package, I started a checklist on what needs to be done / discussed: #27

@leehinman @P1llus To keep this moving, I'm wondering if you would have time / possibility to move this forward? My proposal would be similar to elastic/kibana#75153 (comment)

@ruflin ruflin transferred this issue from elastic/package-registry Aug 18, 2020
@ruflin
Copy link
Contributor Author

ruflin commented Aug 18, 2020

I moved this to the package-spect repo as I think it is now more fitting here. This should not change the conversation.

@P1llus
Copy link
Member

P1llus commented Aug 20, 2020

Is there anything more you would need from me @ruflin ? I am happy with the current comments from the others and the current state as long as we all agree on this topic.

I didn't notice the notification since when you moved the issue to another repo it disappears from all notifications for me.

@ruflin
Copy link
Contributor Author

ruflin commented Aug 20, 2020

The question now is, who drives this forward :-) @ph FYI

@P1llus Did not know that moving it to a different repo kills the notifications :-( Sorry about that.

@ph
Copy link

ph commented Aug 21, 2020

@P1llus @leehinman as @ruflin will you be driving that on your side?

@ruflin
Copy link
Contributor Author

ruflin commented Aug 24, 2020

Happy to coordinate all the efforts but would be great if @P1llus @leehinman could find out who does especially the part on the Kibana side.

@webmat
Copy link
Contributor

webmat commented Aug 25, 2020

Here's another use case for having the ability to have data as part of a package.

I would like to be able to create a documentation/tutorial package for ECS. This package could contain assets that are meant to explore what's in ECS. One part of this would be an index that contains all of the ECS fields, their definitions, datatype and other details.

This way users could explore ECS or answer questions such as:

  • Display only numeric fields, wildcard fields, or fields that have multi-fields.
  • Search for a specific word in any of the descriptions.

This could also come with a dashboard to get started exploring.

This would essentially be a more polished version of what I showed in last holiday season's ECS Advent blog post :-)

The "data" part of the package would simply be all ECS field definitions of a given ECS release.

@pmusa
Copy link

pmusa commented Sep 4, 2020

The training team is really interested in this effort. It is currently cumbersome to add a "demo" dataset for users to play with. It would be great if we could generate packages that anyone can consume to play and learn new features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants