Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] Event-based Telemetry #95960

Closed
wants to merge 7 commits into from
Closed

Conversation

afharo
Copy link
Member

@afharo afharo commented Mar 31, 2021

Summary

This PR implements a POC for generally available Event-based telemetry APIs.

The reasoning for adding this new set of APIs will be further explained in a subsequent RFC. The TL;DR reason: there are some use cases where we need more granular data as opposed to the current snapshot-aggregated telemetry.

The goal of this PR is to showcase a Proof-of-Concept of how this API will work while maintaining the main constraints we want to enforce:

  1. Low impact on our users: we should use as minimum resources as possible so we don’t affect the performance of the product nor the cost of maintenance.
  2. Trust: Kibana is connected to Elasticsearch that holds customer’s data. We want our users to be confident that we are not collecting any of that information or any other PII.
  3. Transparency: building on trust, we want to share with our users the data that we collect about them if they request it.

To ensure those constraints this PR implements:

  • A set of Byte-Sized in-memory queues that limit the amount of data each plugin can send [Low impact on resources]:
    • Plugins are assigned an allowance of 1MB each to be split across all the channels they register.
    • Whenever a channel enqueues a new event, if its queue is already full, the eldest document will be removed to accommodate the new one.
    • When registering the channels, plugins can optionally enforce a quota percentage to a specific channel. This is to help developers to fine-tune the drop-rate they're willing to accept in specific channels in favour of others.
    • When registering the channels, plugins must provide a schema that will be audited to ensure no PII-related data is collected [Trust]. The schema is used to validate the incoming events, so we'll discard anything that doesn't match.
  • Extends core's APIs to provide the new PluginScopedAPIs. This prepends the caller's plugin name as an argument to the callee API so plugins cannot impersonate others to gain access to a larger allowance.
  • A Leaky Bucket to send the events at a maximum constant rate:
    • If enough data is enqueued, it defaults to 10kb payloads sent every 10s.
    • Otherwise, it sends any remainder that has been in the queues for longer than 1h.
    • It round-robins through all the plugins, and then through the channels to make sure that one plugin doesn't take all the bandwidth. i.e.: Plugin A-Channel 1 of 2, Plugin B-Channel 1 of 1, Plugin C-Channel 1 of 3, Plugin A-Channel 2 of 2, Plugin B-Channel 1 of 1, Plugin C-Channel 2 of 3, ...
  • Sets the telemetry logger to silent by default in production ([Telemetry] Custom logger appender? #89839), and allows users to explicitly enable on demand by setting the telemetry.logging config according to the logging settings. This will log the events for users to be aware of what we collect about them [Transparency]
  • Adds 2 example plugins to demonstrate all these capabilities.

TODOs

  • Document the new Core API
  • Update telemetry README file
  • Update telemetry settings asciidoc to explain about the new logging features

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@afharo afharo added the release_note:skip Skip the PR/issue when compiling release notes label Mar 31, 2021
Copy link
Member Author

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review

* @private
*/
private ensureQueueSize() {
while (this.queue.length > 1 && this.size > this.maxByteSize) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the only place that we check this.maxByteSize, we might be able to dynamically update the value once we implement the custom disabling of queues. i.e.: If a plugin registers 3 channels but we disable 2 of them, the remainder channel can get all the plugin's allowance for itself instead of the initial 33%.

this.queue.push(buffer);
this.ensureQueueSize();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: a public clear method would be useful for:

  • Clearing queues when opted-out
  • Clearing queues when a channel is disabled

* Side Public License, v 1.
*/

import * as t from 'io-ts';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using io-ts as the validation tool because it's orders of magnitude more performant than joi/@kbn/config-schema

*/
export type TelemetrySchemaValue =
| {
type: AllowedSchemaTypes | 'pass_through';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially intentionally allowing pass_through for specific use cases where we want a bit of a weaker validation.

| {
type: AllowedSchemaTypes | 'pass_through';
_meta: {
description: string; // Intentionally enforcing the descriptions here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description will be mandatory on the leaves


private isFullQueue(): boolean {
const queueSize = this.queue.reduce((acc, buffer) => acc + buffer.length, 0);
return queueSize >= this.config.threshold.getValueInBytes();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Possible, performance improvement: precalculate this.config.threshold.getValueInBytes() into a private variable to avoid calling it on every loop.

Comment on lines +95 to +104
private hasMaxWaitExpired(): boolean {
const diff = moment().diff(this.lastSend, 'milliseconds');
return diff >= this.maxWaitInMs;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe Date.now() is more performant? We don't call it that often though to be a problem.

@afharo afharo force-pushed the event-based-telemetry branch from 43e34fd to b90fdec Compare April 1, 2021 17:29
@kibanamachine
Copy link
Contributor

kibanamachine commented Apr 6, 2021

💔 Build Failed

Failed CI Steps


Test Failures

Kibana Pipeline / general / Jest Integration Tests.src/core/server/logging/integration_tests.RollingFileAppender `size-limit` policy with `numeric` strategy rolls the log file in the correct order

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

Error: expect(received).toEqual(expected) // deep equality

- Expected  - 1
+ Received  + 0

  Array [
    "kibana.1.log",
-   "kibana.2.log",
    "kibana.log",
  ]
    at Object.<anonymous> (/dev/shm/workspace/parallel/4/kibana/src/core/server/logging/integration_tests/rolling_file_appender.test.ts:103:28)

Kibana Pipeline / general / Jest Integration Tests.src/core/server/logging/integration_tests.RollingFileAppender `size-limit` policy with `numeric` strategy only keep the correct number of files

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

Error: expect(received).toEqual(expected) // deep equality

- Expected  - 1
+ Received  + 0

  Array [
    "kibana-1.log",
-   "kibana-2.log",
    "kibana.log",
  ]
    at Object.<anonymous> (/dev/shm/workspace/parallel/4/kibana/src/core/server/logging/integration_tests/rolling_file_appender.test.ts:150:28)

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@afharo
Copy link
Member Author

afharo commented Apr 20, 2022

Closing as this POC is no longer relevant.

@afharo afharo closed this Apr 20, 2022
@afharo afharo deleted the event-based-telemetry branch October 1, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants