Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Package policy upgrade telemetry with sender #115180

Merged
merged 39 commits into from
Nov 3, 2021

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Oct 15, 2021

Summary

Collecting telemetry of package policy upgrades: #109870

See latest commits for Event Sender implementation (in this case we don't need collector).
Took it from security_solution and simplified for our use case.

See sender.ts
To test:
Install older version of packages, e.g. apache-0.3.3 (no conflicts), aws-0.6.1 (conflicts).
Add policies for each and update them.
Aws policy can only be upgraded manually by resolving conflicts.

For event sender:
Add to kibana.dev.yml to opt in telemetry and see debug logs:

telemetry.optIn: true
logging.loggers:
 - name: plugins.fleet
   level: debug

In case of success/failure of upgrades, the debug log should include events sent to telemetry v3 endpoint.

[2021-10-18T16:49:12.391+02:00][DEBUG][plugins.fleet.telemetry_events] Telemetry URL: https://telemetry-staging.elastic.co/v3-dev/send/fleet-stats
[2021-10-18T16:49:12.391+02:00][DEBUG][plugins.fleet.telemetry_events] cluster_uuid: vvUkWqalQJara7GMDQnE3Q cluster_name: elasticsearch
[2021-10-18T16:49:12.392+02:00][DEBUG][plugins.fleet.telemetry_events] [{"package_policy_upgrade":{"package_name":"aws","current_version":"0.6.1","new_version":"1.3.0","status":"failure","error":[{"key":"inputs.cloudtrail-aws-s3.streams.aws.cloudtrail.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.cloudwatch-aws-s3.streams.aws.cloudwatch_logs.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.ec2-aws-s3.streams.aws.ec2_logs.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.elb-aws-s3.streams.aws.elb_logs.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.s3-aws-s3.streams.aws.s3access.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.vpcflow-aws-s3.streams.aws.vpcflow.vars.queue_url","message":["Queue URL is required"]},{"key":"inputs.waf-aws-s3.streams.aws.waf.vars.queue_url","message":["Queue URL is required"]}]},"id":"aws_0.6.1_1.3.0_failure","license":{"uid":"c0f3d843-30fd-44f2-9774-95e02db80d70","status":"active","type":"trial","issued_to":"elasticsearch","issuer":"elasticsearch"},"cluster_uuid":"vvUkWqalQJara7GMDQnE3Q","cluster_name":"elasticsearch"}]
[2021-10-18T16:49:12.801+02:00][DEBUG][plugins.fleet.telemetry_events] Events sent!. Response: 200 {"status":"ok"}

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@juliaElastic juliaElastic marked this pull request as ready for review October 19, 2021 12:11
@juliaElastic juliaElastic requested a review from a team as a code owner October 19, 2021 12:11
@juliaElastic juliaElastic added v8.0.0 release_note:skip Skip the PR/issue when compiling release notes labels Oct 19, 2021
@juliaElastic juliaElastic changed the title draft of upgrade usage collector [Fleet] Package policy upgrade telemetry with sender Oct 19, 2021
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 19, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest architectural decision I'd like to discuss before moving forward with this is how we want to organize our different events. With the current state of this PR, we'll be adding additional channels for each event type that we implement. Concerns I have with this:

  • It requires that we create a new indexer for each channel since I believe indexers can only read from a single channel
  • Each channel is stored separately in the GCP storage bucket. Would this make other types of processing that we want to do in the future difficult? For example, using tools like Apache Spark.

Alternatively, we could use a single channel for all Fleet telemetry events and use a well-known field name, such as event.type (to borrow from ECS), to describe the schema of the event's properties. This would allow us to use a single indexer for all of our events, which I assume would involve less boilerplate in creating all the necessary bits to stand up a new indexer. Questions I have with this alternative approach:

  • If we use a single channel and single indexer, can we still ingest the documents into separate indices depending on the event.type?
  • How practical would it be to use a single index with dynamic mappings instead? We could even define in our collection API which fields will be dynamically mapped and which will not. This would allow us to more rapidly add new event types without having to touch the indexer at all. For instance, we could index events in this type of format:
{
  "event.type": "package_policy_upgrade_result",
  "cluster_uuid": "...",
  "cluster_name": "...",
  "event.properties.mapped": {
     // We could have `dynamic: true` specified for these fields, with the option to explicitly map them in the indexer when the dynamic mappings are not good enough
     // All field names have the event.type prepended to avoid overlapping field names
     "package_policy_upgrade_result.package_name": "system",
     "package_policy_upgrade_result.current_version": "1.1.0",
     ...
  },
  "event.properties.raw": {
    // In the mappings these fields would not be mapped by default, but could be mapped later or used by runtime fields
    "package_policy_upgrade_result.error": "...",
  }
}

The JavaScript API for queuing such events would be something like:

sender.queueEvent('package_policy_upgrade_result', {
  mapped: {
    package_version: 'system',
    current_version: '1.1.0',
  },
  raw: {
    error: '...' // just an example, we may want to actually map this particular field
  }
}

@@ -146,7 +146,8 @@ export const updatePackagePolicyHandler: RequestHandler<
esClient,
request.params.packagePolicyId,
{ ...newData, package: pkg, inputs },
{ user }
{ user },
packagePolicy.package?.version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we switch to named parameters / object on this API? The list is quite long and it's not obvious what each arg is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit cautious of changing this, the update method is being called from a few plugins outside of fleet like apm and security_solution.

`Package policy successfully upgraded ${JSON.stringify(upgradeMeta)}`,
upgradeMeta
if (packagePolicy.package.version !== currentVersion) {
appContextService.getLogger().info(`Package policy upgraded successfully`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's go ahead and keep the detailed log info as well, but maybe it should be at the debug level instead?

@@ -184,6 +193,8 @@ export class FleetPlugin
this.kibanaBranch = this.initializerContext.env.packageInfo.branch;
this.logger = this.initializerContext.logger.get();
this.configInitialValue = this.initializerContext.config.get();
this.telemetryEventsSender = new TelemetryEventsSender(this.logger);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a child logger here so the logs are clearly labeled as coming from our telemetry code. Right now this is done inside the TelemetryEventsSender class, but I think it's more clear and explicit if we do it from this level.

Suggested change
this.telemetryEventsSender = new TelemetryEventsSender(this.logger);
this.telemetryEventsSender = new TelemetryEventsSender(this.logger.get('telemetry_events'));

return;
}

const clusterInfo = await this.receiver?.fetchClusterInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I quite understand the purpose of the receiver class or it's naming (what is it receiving?). Should we just move this fetchClusterInfo function in this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security had a lot more functionality in receiver, I can simplify here since we have only fetchClusterInfo

}
}

public queueTelemetryEvents(events: TelemetryEvent[], channel: string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to improve the type safety here to ensure that events passed in conform to an expected type for the channel. I also slightly prefer having the channel name as the first argument to better match other popular telemetry APIs. Maybe something like this:

interface FleetTelemetryChannelEvents {
  // channel name => event type
  'fleet-upgrades': PackagePolicyUpgradeUsage;
}

type FleetTelemetryChannel = keyof FleetTelemetryChannelEvents;

public queueTelemetryEvents<T extends FleetTelemetryChannel>(
  channel: T,
  events: Array<FleetTelemetryChannelEvents[T]>
) { ... }

private isOptedIn?: boolean = true; // Assume true until the first check

constructor(logger: Logger) {
this.logger = logger.get('telemetry_events');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in server/plugin.ts. I think we should create the child logger from the calling code rather than do this here.

Comment on lines 23 to 26
// do not add events with same id
events = events.filter(
(event) => !this.queue.find((qItem) => qItem.id && event.id && qItem.id === event.id)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this de-duplication functionality? Can we also support events without ids?

It's worth noting that this de-duplication only works if the duplicate event hasn't been sent yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because dry run might run multiple times when someone is navigating to Integration details - Settings page or Upgrade a Policy. I added telemetry at dry run time because that is where the conflicts are checked. This logic is not filtering out events without an id.

It's true that this won't filter out duplicates if the events are already sent. I was thinking to group by cluster uuid when building dashboards in telemetry, because in real world there won't be more than one upgrade of the same package version twice.
We could remove this logic and see how many duplicates are sent in normal usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes more sense now. I'm ok leaving this here in that case, as long as we consider it a best-effort de-duplication / 'spam' prevention and don't rely on it on the analysis side.

because in real world there won't be more than one upgrade of the same package version twice.

nit: since these failures happen on package policy upgrades there may actually be more than one per cluster if they have configured the same integration for multiple policies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I'm only seeing this now.
So I removed the id filtering logic. Instead, I have added an id function on telemetry side to create doc id based on package name, version, status, cluster id and date, so this way we are not going to get multiple docs with the same properties from a cluster each day. Even if there are multiple policies upgraded, they will return the same results on dry run, so hopefully we don't need those extra events.
https://github.com/elastic/telemetry/blob/main/kpi-engine/src/telemetry/fleet.clj#L38

try {
this.logger.debug(`Telemetry URL: ${telemetryUrl}`);

const toSend: TelemetryEvent[] = cloneDeep(events).map((event) => ({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should do the cloning when the event is queued rather than when it is being sent. That way no modifications to the objects from calling code can cause any problems after they have been queued.

export type SearchTypes = BaseSearchTypes | BaseSearchTypes[] | undefined;

// For getting cluster info. Copied from telemetry_collection/get_cluster_info.ts
export interface ESClusterInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a type available in @elastic/elasticsearch that we can just re-use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll update

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Oct 27, 2021

@joshdover
Thanks for the review! I'll get the pr updated based on your comments.

Regarding the support of different events:
I agree with the overhead of creating a new indexer+channel for each new event type, this is what security is doing and telemetry team suggested, though I'm in for a better approach.
See existing indexers here: https://github.com/elastic/telemetry/tree/e37eee1d5159983f7b166dae1f75268474fef8ce/kpi-engine/resources/indices/telemetry

If we use a single channel and single indexer, can we still ingest the documents into separate indices depending on the event.type?

AFAIK you need multiple indexers to ingest to separate indices e.g. you see the indexer config has to define the index/alias to ingest into, though you can have multiple indexers reading from the same channel
https://github.com/elastic/telemetry/blob/e37eee1d5159983f7b166dae1f75268474fef8ce/kpi-engine/resources/indices/telemetry/fleet-upgrades.edn

How practical would it be to use a single index with dynamic mappings instead?

I'm not sure whether we can specify dynamic mappings in indexer, I'll ask around.
Though we can map an object, I'm checking how this is mapped in elastic side.

 :field-map {
   [:event.properties :mapped] [:original-body :event.properties :mapped]

If we go with this unified channel approach, should we rename the channel to something more generic?

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic juliaElastic added the auto-backport Deprecated - use backport:version if exact versions are needed label Oct 28, 2021
@juliaElastic
Copy link
Contributor Author

As discussed on slack, we will stay with this solution for now, and come back to it when we have more use cases for telemetry to decide whether we want to create a generic message type or use different messages and channels.

* 2.0.
*/

export const TELEMETRY_MAX_BUFFER_SIZE = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) It might be more accurate to name this TELEMETRY_MAX_QUEUE_SIZE

x-pack/plugins/fleet/server/telemetry/queue.ts Outdated Show resolved Hide resolved

import { loggingSystemMock } from 'src/core/server/mocks';

import axios from 'axios';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Group imports from installed packages above imports from local modules.

@joshdover joshdover self-requested a review November 1, 2021 13:18
@juliaElastic
Copy link
Contributor Author

juliaElastic commented Nov 1, 2021

@mostlyjason @joshdover
Take a look at this, the data is now flowing to the new telemetry cluster.
I pushed some data from local to demonstrate how the upgrade telemetry looks like.

Discover
https://telemetry-v2-staging.elastic.dev/goto/aaf2772eb54a6f16d34899dfb3fe7866
Dashboard
https://telemetry-v2-staging.elastic.dev/goto/83df7acd68bbe280c2c0525a3eaf39fe

@joshdover
Copy link
Contributor

ack: will review tomorrow

@juliaElastic
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Test Failures

  • [job] [logs] OSS CI Group #7 / visualize app new charts library visualize ciGroup7 Timelion visualization expression typeahead dynamic suggestions for argument values .es() should show index pattern suggestions for index argument

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 1120 1121 +1
Unknown metric groups

API count

id before after diff
fleet 1222 1223 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic merged commit 654e75a into elastic:main Nov 3, 2021
@juliaElastic juliaElastic deleted the upgrade-telemetry branch November 3, 2021 12:30
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Nov 3, 2021
* draft of upgrade usage collector

* telemetry sender service

* fixed tests and types

* cleanup

* type fix

* removed collector

* made required field message generic, added test

* cleanup

* cleanup

* cleanup

* removed v3-dev as outdated

* removed conditional from telemetry url creation

* supporting multiple channels in sender

* fix types

* refactor

* using json content type

* fix test

* simplified telemetry url

* fixed type

* added back ndjson

* moved telemetry to update, added dryrun

* fix types

* fix prettier

* updated after review

* fix imports

* added error_message field

* review fixes

Co-authored-by: Kibana Machine <[email protected]>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
8.0

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Nov 3, 2021
* draft of upgrade usage collector

* telemetry sender service

* fixed tests and types

* cleanup

* type fix

* removed collector

* made required field message generic, added test

* cleanup

* cleanup

* cleanup

* removed v3-dev as outdated

* removed conditional from telemetry url creation

* supporting multiple channels in sender

* fix types

* refactor

* using json content type

* fix test

* simplified telemetry url

* fixed type

* added back ndjson

* moved telemetry to update, added dryrun

* fix types

* fix prettier

* updated after review

* fix imports

* added error_message field

* review fixes

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Julia Bardi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants