Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

Closed
Tracked by #106048 ...
joshdover opened this issue Aug 24, 2021 · 14 comments
Closed
Tracked by #106048 ...

[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

joshdover opened this issue Aug 24, 2021 · 14 comments
Assignees
Labels
enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature v7.15.0

Comments

@joshdover
Copy link
Contributor

joshdover commented Aug 24, 2021

In #106048 we're adding the ability to upgrade package policies, both manually and automatically when possible. During some package policy upgrades, users will be required to take manual action when a package's inputs change in a way that make the updated package policy fail validation. This can happen due to changes like:

  • An input field name was changed
  • An input field type was changed
  • A new required input field was added

In some of these scenarios, there are additional enhancements we may want to consider to eliminate these conflict scenarios and increase the likelihood that packages upgrades are seamless as possible for users. In order to know where to focus our efforts, we should collect telemetry on package policy upgrades in order to answer questions like:

  • What % of package policy upgrades require manual user intervention due to conflicts?
  • What types of conflicts are most common?
  • Which packages are users encountering conflicts on most often?
  • How often are users upgrading packages policies?

Related:

@joshdover joshdover added Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature labels Aug 24, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@mostlyjason
Copy link
Contributor

This is great thanks for filing this!

@joshdover joshdover added the enhancement New value added to drive a business result label Sep 7, 2021
@jen-huang jen-huang changed the title Add usage telemetry for package policy upgrade conflicts [Fleet] Add usage telemetry for package policy upgrade conflicts Sep 29, 2021
@juliaElastic juliaElastic self-assigned this Oct 13, 2021
@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 14, 2021

So I was coming up with this format to add to a new custom collector under fleet usage collectors.
Few considerations:

  • the plan is to store these objects as saved objects until the collector is invoked

  • generate the id of the objects as name+current_version+new_version+status, to avoid duplicates even if the dry run/update is called multiple times

  • after the collector runs, the saved objects should be cleared to avoid duplication

  • capture error messages during dry run of package policy upgrade

  • auto upgrade and manual upgrade might need a distinction in telemetry (might only be determined from UI)

{ "package_policy_upgrades": [
  {
    "package_name": "apache",
    "current_version": "0.3.3",
    "new_version": "1.1.1",
    "status": "success"
  },
  {
    "package_name": "aws",
    "current_version": "0.3.3",
    "new_version": "1.1.1",
    "status": "failure",
    "error": [{
      "key":"inputs.cloudtrail-aws-s3.streams.aws.cloudtrail.vars.queue_url",
      "message":["Queue URL is required"]
      }]
  }
]}

@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 15, 2021

There are 2 ways to add upgrade telemetry:

  • add a new usage collector to be included in kibana telemetry
  • create an event based service in fleet to directly publish upgrade telemetry to the new telemetry cluster

Pros and cons of Event Based service:
Pros:

  • No need to store telemetry in saved objects until collector is invoked
  • Send events as they happen, rather than waiting for collector
  • Simpler data model

Cons:

  • Create, test and maintain event-based service (security has done this)
  • Telemetry team might implement their own event-based service (not likely anytime soon)
  • maintain additional index for upgrade telemetry

Technically for each data type, we could create a new channel, indexer and job in new telemetry cluster.

@joshdover
Copy link
Contributor Author

Telemetry is challenging with our release model because any bugs in the collection process cannot be fixed for long periods of time. Processing events on the ingest side also feels much simpler when we're trying to answer questions like "how often does X happen". Basic counts of this are simple enough to do with a usage collector but it breaks down quickly if you want to segment on any property (eg. package name, package version, user role, etc.).

Events naturally give us this with a very simple collection mechanism that is unlikely to have bugs with the flexibility to massage data afterwards (if needed, often it won't be). IMO experimenting with an event approach could be well worth the effort and give us deeper insight into how users are using our application and lower maintenance cost.

I lean towards writing a very simple event sender, largely based on the one that Security Solution has already built.

For reference, Security Solution's implementation lives here:

@joshdover
Copy link
Contributor Author

How would this scale for different type of events?

I think we'll want some guardrails in the initial implementation to be sure we don't send too much data that would either:

  • Overwhelm the network
  • Consume too much of the customer's bandwidth
  • Overwhelm the Telemetry API or pipelines

I think a cap on size of payloads and using periodic batching would be adequate for this purpose.

@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 18, 2021

An update on this:
I got to know that Kibana Stack team is also working on creating a new index for kibana telemetry, it is in progress. It would be called something like kibana-snapshot channel. #113525
I think it might make sense to use that for existing fleet telemetry in kibana (if it is recommended to add fleet fields to that mapping).

As for the upgrade telemetry, I got it working locally with both collector and sending directly to a fleet channel.
To be able to see the data on the new cluster, we need a custom indexer merged: https://github.com/elastic/telemetry/pull/637

@jen-huang @mostlyjason @joshdover
Which approach do you think we should take? See pros and cons above, though it turned out that creating an event based sender is quite simple (by reusing security's solution).
So I am quite happy with the sender approach, it gives more control over publishing events.

Example by using collectors:

 {
    "stack_stats": {
        "kibana": {
            "plugins": {
                "fleet": {
                    "package_policy_upgrades": [
                        {
                            "package_name": "apache",
                            "current_version": "0.3.3",
                            "new_version": "1.1.1",
                            "status": "success"
                        },
                        {
                            "package_name": "aws",
                            "current_version": "0.6.1",
                            "new_version": "1.3.0",
                            "status": "failure",
                            "error": [
                                {
                                    "key": "inputs.cloudtrail-aws-s3.streams.aws.cloudtrail.vars.queue_url",
                                    "message": [
                                        "Queue URL is required"
                                    ]
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

Example using event based service:

{
    "package_policy_upgrade": {
        "package_name": "apache",
        "current_version": "0.3.3",
        "new_version": "1.1.1",
        "status": "success"
    }
}

Also I have been playing around with the data and it might make sense to add some categorization to error messages. E.g. required fields contain their field name in the error message. At least change this to a generic "Field is required" message.

These are the possible input validation errors that I found in the code https://github.com/elastic/kibana/blob/master/x-pack/plugins/fleet/common/services/validate_package_policy.ts#L227

  • "Queue URL is required",
  • "Invalid YAML format",
  • "Invalid format",
  • "Strings starting with special YAML characters like * or & need to be enclosed in double quotes.",
  • "Boolean values must be either true or false"

image

@joshdover
Copy link
Contributor Author

@juliaElastic Is this now closed by #115180?

@juliaElastic
Copy link
Contributor

Closing this as changes are done.
@mostlyjason let me know if any changes in data format is needed.

@amolnater-qasource
Copy link

Hi @juliaElastic
We have attempted to retest this on latest 7.16.0 Snapshot, however we didn't get expected results under telemetry.

Build details:
Build: 45910
Commit: af229de

Steps followed:

  1. Installed 0.3.3 version Apache integration.

  2. We observed the same available the same under telemetry (kibana-url_port/api/stats?extended=true)
    6

  3. Upgraded the Apache integration and again checked telemetry
    7

  4. We didn't get expected outcome as shared at comment.

 {
    "stack_stats": {
        "kibana": {
            "plugins": {
                "fleet": {
                    "package_policy_upgrades": [
                        {
                            "package_name": "apache",
                            "current_version": "0.3.3",
                            "new_version": "1.1.1",
                            "status": "success"
                        },

Could you please confirm if we are missing anything?

cc: @EricDavisX
Thanks

@juliaElastic
Copy link
Contributor

@amolnater-qasource as discussed on slack, the description was outdated, since the solution is not using collectors, only sending events directly to new telemetry api. So the only way to verify this is to check debug logs and check the events on telemetry staging link

@amolnater-qasource
Copy link

Hi @juliaElastic
Thanks for sharing details over this feature.

As discussed this telemetry staging link is not accessible to us.
Further this feature isn't testable at kibana-url/api/stats?extended=true

@EricDavisX Please let us know if we can skip this test.

Thanks

@EricDavisX
Copy link
Contributor

Hi - I'll ask the team if they have coverage over Telemetry or if the risk is so minimal such that they do not need manual tests? @juliaElastic and @jen-huang it's your call. We can deprecate this case or modify it to a 'best effort', or we can submit whichever request is needed to grant access to the telementry cluster if desired / appropriate. Please advise.

@juliaElastic
Copy link
Contributor

@EricDavisX I think the risk is minimal here, since we are adding telemetry. To verify that the events were sent, you can check in kibana debug logs. I don't recall having to request access to telemetry staging link, you could ask on #telemetry channel on what is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature v7.15.0
Projects
None yet
Development

No branches or pull requests

7 participants