Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[On-Week] Hot update of APM/EBT labels #157093

Merged
merged 50 commits into from
Aug 31, 2023

Conversation

afharo
Copy link
Member

@afharo afharo commented May 8, 2023

Summary

This PR is exploring the ability to update the APM/EBT labels on the fly.

Context

During test executions and, more specifically, during Perf and Scalability test executions, we rely on APM labels to tag the transactions that belong to each tested scenario, PR, branch, and test stage (warmup vs. actual test). However, up until now, we only could update these via kibana.yml config, which implies restarting Kibana whenever anything of the tagging info needs to be updated.

This is also a problem when testing on Cloud and Serverless because restarting is even more expensive.

Alternatively, we could update the file kibana.yml and send the SIGHUP signal to Kibana. However, this solution wouldn't work on Cloud nor Serverless because we don't have access to either re-writing the kibana.yml or triggering the SIGHUP signal to the running process.

The adopted solution

We are exploring adding an HTTP API that CI can call whenever it needs to update the context. In order to do that, we need to be able to extend the configuration loaded via kibana.yml, CLI, and Env vars in a dynamic way.

This PR introduces the concept of dynamicConfiguration for plugins to opt-in for specific settings. They can be updated via PUT /internal/core/_settings. The settings in the body are validated against the dynamicConfiguration list and the actual config schemas.

In this PR, we only opt-in for telemetry.labels (validation schema).

It can be used in any of the forms:

PUT kbn:/internal/core/_settings
{
  "telemetry.labels": {
    "branch": "my-test-branch"
  }
}

PUT kbn:/internal/core/_settings
{
  "telemetry.labels.branch": "afharo-test-1"
}

PUT kbn:/internal/core/_settings
{
  "telemetry": {
    "labels": {
      "branch": "afharo-test-1"
    }
  } 
}

Why this solution?
This solution allows us to:
✅ Temporarily change certain settings: the setting overrides are only stored in memory, so restarts would wipe them.
✅ Control the settings of each individual node: similar to kibana.yml, these settings are applied to the node handling the request.
Overriding config means the UI has it during bootstrap: the page-load transactions include the correct configuration because they don't require an additional HTTP request to fetch the config from another plugin/service.
✅ Immediately refreshed config

What's not that good about this solution?
Some flaws of this solution:
⚠️ Requires calling it for every node: if we test any autoscaling capabilities, we'll need to apply the settings to the new nodes as soon as they come up.
⚠️ Despite being flagged as an internal API, external users may think this is persisted and shared in some way.

👆 If we find that persistence is a requirement, we could store it as a SO (although we may miss the immediately refreshed config capability).

Alternative options considered

Exposing a custom API inside the telemetry plugin

The telemetry plugin could expose the API PUT /internal/telemetry/labels to update those.

✅ Separation of concerns
✅ No need to support this from core
🚫 The UI needs to call GET /api/v2/telemetry/config to fetch the dynamically changed values. This means early transactions/events like page-load might miss the dynamically-appended labels.

UI Settings

We could use Globally-Scoped UI Settings for this.

✅ They are injected into the page load
✅ They are persisted as SO and shared to all instances
✅ Good for autoscaling
✅ Good for updating through proxies
⚠️ Can't control individual tags per instance
🚫 They persist through restarts (not so good for testing multiple scenarios for the same ES instance)
🚫 They are visible in the Stack Management view
🚫 It requires plugins to handle and merge both config providers.

TODO

  • @dmlemeshko to validate this API is valid to be called from the CI/tests runner
  • @afharo to explore how to push the update to the browser
  • Update APM version once feat: add apm.setGlobalLabel() apm-agent-nodejs#3337 is published
  • Validate with the Security folks that this new HTTP API is safe to use
  • Validate with Kibana Core folks if we want to open this pandora box
  • @afharo to add more tests

Checklist

For maintainers

@afharo afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc wg:performance Work tracked by the performance workgroup Spacetime Team:Performance labels May 8, 2023
@afharo afharo self-assigned this May 8, 2023
@dmlemeshko dmlemeshko added the ci:cloud-deploy Create or update a Cloud deployment label May 10, 2023
package.json Outdated
@@ -788,7 +788,7 @@
"deepmerge": "^4.2.2",
"del": "^6.1.0",
"elastic-apm-http-client": "^11.0.1",
"elastic-apm-node": "^3.45.0",
"elastic-apm-node": "git+https://github.com/afharo/apm-agent-nodejs.git#add-agent.setGlobalLabel",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be merged until elastic/apm-agent-nodejs#3337 is merged and published

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is merged. Do we know when we're planning on publishing the version shipping the change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-luna can you help us with this?


router.versioned
.put({
path: '/_settings',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using PUT /_settings to update the dynamic options. I went with this approach instead of UI Settings because we don't need persistence through restarts... Also, this allows us to set specific settings for each Kibana node in case we have multiple instances.

But happy to iterate on that and move this to UI Settings if we don't want to open this pandora box.

@@ -6,6 +6,7 @@
"id": "telemetry",
"server": true,
"browser": true,
"enabledOnAnonymousPages": true,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows us to send telemetry on the login screen

Copy link
Contributor

@pgayvallet pgayvallet May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the new config keys that are exposed to anon page, I wonder, Is this fine, from a security standpoint?

E.g we're now surfacing the following config props to anon pages:

'telemetry.sendUsageFrom (alternatives)',
'telemetry.sendUsageTo (any)',

Are those safe to be accessed anonymously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are safe:

  • sendUsageFrom can be either server or browser
  • sendUsageTo is either prod or staging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I think so too yeah. I was afraid those might be urls or something

tap((labels) => {
// Hack to update the APM agent's labels.
// In the future we might want to expose APM as a core service to make reporting metrics much easier.
window.elasticApm?.addLabels(labels);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the UI, the agent is global, so addLabels works great.

However, on the server, the nodejs agent needs the global labels setter because addLabels only applies to the current transaction.

Comment on lines 39 to 41
options: {
authRequired: 'optional',
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows us to fetch configuration while on the login screen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is nothing sensitive in the response, right?

Copy link
Member Author

@afharo afharo Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of the output is declared in

body: schema.object({
allowChangingOptInStatus: schema.boolean(),
optIn: schema.oneOf([schema.boolean(), schema.literal(null)]),
sendUsageFrom: schema.oneOf([schema.literal('server'), schema.literal('browser')]),
telemetryNotifyUserAboutOptInDefault: schema.boolean(),
labels: labelsSchema,
}),

And our awesome tests for the versioned router will fail validation (proved) if any additional info is exposed (kudos to @jloleysens 🧡)

@afharo afharo force-pushed the onweek/hot-update-of-apm-labels branch from 01c251e to bdadfab Compare August 4, 2023 16:35
@afharo
Copy link
Member Author

afharo commented Aug 4, 2023

Also, since this functionality is currently only needed during tests, have we considered gating it with a "dev-only"-like flag (regardless of the approach we take here)?

I've hidden the feature behind a feature flag that it's only enabled on Serverless (we could decide to disable it on Serverless Prod as well if necessary). On top of that, it's an internal API, so external actors shouldn't be able to call the API.

Copy link
Member

@azasypkin azasypkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I won't comment on the general functionality of coreApp.allowDynamicConfigOverrides. If it's indeed a necessity for your team, it's acceptable from a security standpoint as long as you:

  1. ✔️ Ensure that this API is hidden behind a non-default flag.
  2. ✔️ Require consumers to explicitly opt-in for configuration values that can be altered using this API.
  3. ✔️ Restrict access to this API to superusers only. While less stringent than our prior requirement of filesystem access, this is acceptable given our other implemented restrictions.
  4. ⛔ Introduce a snapshot-based test for dynamicConfig, similar to the one we have for exposeToBrowser so that we're notified whenever someone decides to utilize it.

Comment on lines 39 to 41
options: {
authRequired: 'optional',
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is nothing sensitive in the response, right?

Comment on lines +57 to +71
/**
* This test is meant to fail when any setting is flagged as capable
* of dynamic configuration {@link PluginConfigDescriptor.dynamicConfig}.
*
* Please, add your settings to the list with a comment of why it's required to be dynamic.
*
* The intent is to trigger a code review from the Core and Security teams to discuss potential issues.
*/
test('detecting all the settings that have opted-in for dynamic in-memory updates', () => {
expect(getListOfDynamicConfigPaths()).toStrictEqual([
// We need this for enriching our Perf tests with more valuable data regarding the steps of the test
// Also helpful in Cloud & Serverless testing because we can't control the labels in those offerings
'telemetry.labels',
]);
});
Copy link
Member Author

@afharo afharo Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azasypkin, I added the test you requested 😇 in the commit 2bbc917

Copy link
Member

@azasypkin azasypkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the security point of view 👍 Thanks for adding tests!

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@dmlemeshko dmlemeshko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and for Cloud deployment using kibana-a-la-carte. It works as expected. Great work @afharo !

@kibana-ci
Copy link
Collaborator

kibana-ci commented Aug 31, 2023

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-apps-server-internal 8 9 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
telemetry 19.2KB 19.4KB +259.0B
Unknown metric groups

API count

id before after diff
@kbn/core-apps-server-internal 8 9 +1
@kbn/core-plugins-server 58 60 +2
total +3

ESLint disabled line counts

id before after diff
@kbn/core 8 9 +1

Total ESLint disabled count

id before after diff
@kbn/core 9 10 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @afharo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:cloud-deploy Create or update a Cloud deployment release_note:skip Skip the PR/issue when compiling release notes Spacetime Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Performance Team:QA Team label for QA Team Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! v8.11.0 wg:performance Work tracked by the performance workgroup
Projects
None yet
Development

Successfully merging this pull request may close these issues.