[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

joshdover · 2021-08-24T15:39:42Z

In #106048 we're adding the ability to upgrade package policies, both manually and automatically when possible. During some package policy upgrades, users will be required to take manual action when a package's inputs change in a way that make the updated package policy fail validation. This can happen due to changes like:

An input field name was changed
An input field type was changed
A new required input field was added

In some of these scenarios, there are additional enhancements we may want to consider to eliminate these conflict scenarios and increase the likelihood that packages upgrades are seamless as possible for users. In order to know where to focus our efforts, we should collect telemetry on package policy upgrades in order to answer questions like:

What % of package policy upgrades require manual user intervention due to conflicts?
What types of conflicts are most common?
Which packages are users encountering conflicts on most often?
How often are users upgrading packages policies?

elasticmachine · 2021-08-24T15:39:44Z

Pinging @elastic/fleet (Team:Fleet)

mostlyjason · 2021-08-25T13:37:04Z

This is great thanks for filing this!

juliaElastic · 2021-10-14T14:35:31Z

So I was coming up with this format to add to a new custom collector under fleet usage collectors.
Few considerations:

the plan is to store these objects as saved objects until the collector is invoked
generate the id of the objects as name+current_version+new_version+status, to avoid duplicates even if the dry run/update is called multiple times
after the collector runs, the saved objects should be cleared to avoid duplication
capture error messages during dry run of package policy upgrade
auto upgrade and manual upgrade might need a distinction in telemetry (might only be determined from UI)

{ "package_policy_upgrades": [
  {
    "package_name": "apache",
    "current_version": "0.3.3",
    "new_version": "1.1.1",
    "status": "success"
  },
  {
    "package_name": "aws",
    "current_version": "0.3.3",
    "new_version": "1.1.1",
    "status": "failure",
    "error": [{
      "key":"inputs.cloudtrail-aws-s3.streams.aws.cloudtrail.vars.queue_url",
      "message":["Queue URL is required"]
      }]
  }
]}

juliaElastic · 2021-10-15T12:53:41Z

There are 2 ways to add upgrade telemetry:

add a new usage collector to be included in kibana telemetry
create an event based service in fleet to directly publish upgrade telemetry to the new telemetry cluster

Pros and cons of Event Based service:
Pros:

No need to store telemetry in saved objects until collector is invoked
Send events as they happen, rather than waiting for collector
Simpler data model

Cons:

Create, test and maintain event-based service (security has done this)
Telemetry team might implement their own event-based service (not likely anytime soon)
maintain additional index for upgrade telemetry

Technically for each data type, we could create a new channel, indexer and job in new telemetry cluster.

joshdover · 2021-10-15T13:12:26Z

Telemetry is challenging with our release model because any bugs in the collection process cannot be fixed for long periods of time. Processing events on the ingest side also feels much simpler when we're trying to answer questions like "how often does X happen". Basic counts of this are simple enough to do with a usage collector but it breaks down quickly if you want to segment on any property (eg. package name, package version, user role, etc.).

Events naturally give us this with a very simple collection mechanism that is unlikely to have bugs with the flexibility to massage data afterwards (if needed, often it won't be). IMO experimenting with an event approach could be well worth the effort and give us deeper insight into how users are using our application and lower maintenance cost.

I lean towards writing a very simple event sender, largely based on the one that Security Solution has already built.

For reference, Security Solution's implementation lives here:

kibana/x-pack/plugins/security_solution/server/lib/telemetry/sender.ts

Line 30 in 8280743

export class TelemetryEventsSender {

joshdover · 2021-10-15T13:14:21Z

How would this scale for different type of events?

I think we'll want some guardrails in the initial implementation to be sure we don't send too much data that would either:

Overwhelm the network
Consume too much of the customer's bandwidth
Overwhelm the Telemetry API or pipelines

I think a cap on size of payloads and using periodic batching would be adequate for this purpose.

juliaElastic · 2021-10-18T09:38:21Z

An update on this:
I got to know that Kibana Stack team is also working on creating a new index for kibana telemetry, it is in progress. It would be called something like kibana-snapshot channel. #113525
I think it might make sense to use that for existing fleet telemetry in kibana (if it is recommended to add fleet fields to that mapping).

As for the upgrade telemetry, I got it working locally with both collector and sending directly to a fleet channel.
To be able to see the data on the new cluster, we need a custom indexer merged: https://github.com/elastic/telemetry/pull/637

@jen-huang @mostlyjason @joshdover
Which approach do you think we should take? See pros and cons above, though it turned out that creating an event based sender is quite simple (by reusing security's solution).
So I am quite happy with the sender approach, it gives more control over publishing events.

Example by using collectors:

 {
    "stack_stats": {
        "kibana": {
            "plugins": {
                "fleet": {
                    "package_policy_upgrades": [
                        {
                            "package_name": "apache",
                            "current_version": "0.3.3",
                            "new_version": "1.1.1",
                            "status": "success"
                        },
                        {
                            "package_name": "aws",
                            "current_version": "0.6.1",
                            "new_version": "1.3.0",
                            "status": "failure",
                            "error": [
                                {
                                    "key": "inputs.cloudtrail-aws-s3.streams.aws.cloudtrail.vars.queue_url",
                                    "message": [
                                        "Queue URL is required"
                                    ]
                                }
                            ]
                        }
                    ]
                }
            }
        }
    }
}

Example using event based service:

{
    "package_policy_upgrade": {
        "package_name": "apache",
        "current_version": "0.3.3",
        "new_version": "1.1.1",
        "status": "success"
    }
}

Also I have been playing around with the data and it might make sense to add some categorization to error messages. E.g. required fields contain their field name in the error message. At least change this to a generic "Field is required" message.

These are the possible input validation errors that I found in the code https://github.com/elastic/kibana/blob/master/x-pack/plugins/fleet/common/services/validate_package_policy.ts#L227

"Queue URL is required",
"Invalid YAML format",
"Invalid format",
"Strings starting with special YAML characters like * or & need to be enclosed in double quotes.",
"Boolean values must be either true or false"

joshdover · 2021-11-04T10:37:30Z

@juliaElastic Is this now closed by #115180?

juliaElastic · 2021-11-09T10:20:38Z

Closing this as changes are done.
@mostlyjason let me know if any changes in data format is needed.

amolnater-qasource · 2021-11-15T10:27:39Z

Hi @juliaElastic
We have attempted to retest this on latest 7.16.0 Snapshot, however we didn't get expected results under telemetry.

Build details:
Build: 45910
Commit: af229de

Steps followed:

Installed 0.3.3 version Apache integration.
We observed the same available the same under telemetry (kibana-url_port/api/stats?extended=true)
Upgraded the Apache integration and again checked telemetry
We didn't get expected outcome as shared at comment.

 {
    "stack_stats": {
        "kibana": {
            "plugins": {
                "fleet": {
                    "package_policy_upgrades": [
                        {
                            "package_name": "apache",
                            "current_version": "0.3.3",
                            "new_version": "1.1.1",
                            "status": "success"
                        },

Could you please confirm if we are missing anything?

cc: @EricDavisX
Thanks

juliaElastic · 2021-11-24T09:13:49Z

@amolnater-qasource as discussed on slack, the description was outdated, since the solution is not using collectors, only sending events directly to new telemetry api. So the only way to verify this is to check debug logs and check the events on telemetry staging link

amolnater-qasource · 2021-11-24T10:16:47Z

Hi @juliaElastic
Thanks for sharing details over this feature.

As discussed this telemetry staging link is not accessible to us.
Further this feature isn't testable at kibana-url/api/stats?extended=true

@EricDavisX Please let us know if we can skip this test.

Thanks

EricDavisX · 2021-11-24T13:56:53Z

Hi - I'll ask the team if they have coverage over Telemetry or if the risk is so minimal such that they do not need manual tests? @juliaElastic and @jen-huang it's your call. We can deprecate this case or modify it to a 'best effort', or we can submit whichever request is needed to grant access to the telementry cluster if desired / appropriate. Please advise.

juliaElastic · 2021-11-24T14:54:30Z

@EricDavisX I think the risk is minimal here, since we are adding telemetry. To verify that the events were sent, you can check in kibana debug logs. I don't recall having to request access to telemetry staging link, you could ask on #telemetry channel on what is needed.

joshdover added Team:Fleet Team label for Observability Data Collection Fleet team telemetry Issues related to the addition of telemetry to a feature labels Aug 24, 2021

jen-huang added the v7.15.0 label Aug 24, 2021

This was referenced Sep 3, 2021

[Fleet] Upgrade package policies, phase 1 #106048

Closed

[Integrations] Add telemetry for package upgrades #111136

Closed

joshdover added the enhancement New value added to drive a business result label Sep 7, 2021

jen-huang changed the title ~~Add usage telemetry for package policy upgrade conflicts~~ [Fleet] Add usage telemetry for package policy upgrade conflicts Sep 29, 2021

juliaElastic self-assigned this Oct 13, 2021

joshdover mentioned this issue Oct 14, 2021

[Fleet] Initiate Fleet setup on boot #111858

Closed

10 tasks

juliaElastic mentioned this issue Oct 15, 2021

[Fleet] Package policy upgrade telemetry with sender #115180

Merged

9 tasks

juliaElastic mentioned this issue Oct 25, 2021

[Fleet] added package upgrade info logs #116093

Merged

juliaElastic closed this as completed Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

joshdover commented Aug 24, 2021 •

edited by mostlyjason

Loading

elasticmachine commented Aug 24, 2021

mostlyjason commented Aug 25, 2021

juliaElastic commented Oct 14, 2021 •

edited

Loading

juliaElastic commented Oct 15, 2021 •

edited

Loading

joshdover commented Oct 15, 2021

joshdover commented Oct 15, 2021

juliaElastic commented Oct 18, 2021 •

edited

Loading

joshdover commented Nov 4, 2021

juliaElastic commented Nov 9, 2021

amolnater-qasource commented Nov 15, 2021

juliaElastic commented Nov 24, 2021

amolnater-qasource commented Nov 24, 2021

EricDavisX commented Nov 24, 2021

juliaElastic commented Nov 24, 2021

[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

[Fleet] Add usage telemetry for package policy upgrade conflicts #109870

Comments

joshdover commented Aug 24, 2021 • edited by mostlyjason Loading

elasticmachine commented Aug 24, 2021

mostlyjason commented Aug 25, 2021

juliaElastic commented Oct 14, 2021 • edited Loading

juliaElastic commented Oct 15, 2021 • edited Loading

joshdover commented Oct 15, 2021

joshdover commented Oct 15, 2021

juliaElastic commented Oct 18, 2021 • edited Loading

joshdover commented Nov 4, 2021

juliaElastic commented Nov 9, 2021

amolnater-qasource commented Nov 15, 2021

juliaElastic commented Nov 24, 2021

amolnater-qasource commented Nov 24, 2021

EricDavisX commented Nov 24, 2021

juliaElastic commented Nov 24, 2021

joshdover commented Aug 24, 2021 •

edited by mostlyjason

Loading

juliaElastic commented Oct 14, 2021 •

edited

Loading

juliaElastic commented Oct 15, 2021 •

edited

Loading

juliaElastic commented Oct 18, 2021 •

edited

Loading