-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] Collect telemetry about data/API performance #51612
Conversation
07a0d9b
to
0d39b5a
Compare
💔 Build Failed
|
Spoke to the telemetry folks, we should wrap this in a task that stores data to saved objects (similar to what we had), and we should investigate more aggressive pre-processing on our side (e.g., min/max/avg/percentiles) |
0d39b5a
to
d83b37e
Compare
💔 Build Failed
|
d83b37e
to
a2d96cc
Compare
💔 Build Failed
|
a2d96cc
to
caee49b
Compare
Pinging @elastic/apm-ui (Team:apm) |
x-pack/legacy/plugins/apm/scripts/upload-telemetry-data/download-telemetry-mapping.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned over the current implementation, specifically about:
- writing directly to the
xpack-phone-home
index - using a dynamic mapping for the
apm
data that's being overwritten in thexpack-phone-home
index which is, on purpose, not mapped dynamically. - At the moment, the
xpack-home-phone
template is updated when new mappings are added - Adding yet another collector
I'd like some other members of the Pulse team to take a look before specifically approving/requesting changes and, while I realize the target is for 7.7, I hope you understand the concerns.
} | ||
"apm-telemetry": { | ||
"properties": {}, | ||
"dynamic": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic mappings aren't set within the current telemetry service. The properties have to be defined explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand these mappings are for the SavedObjects. It should not cause any effect on the telemetry service. But it's true I think it's considered an antipattern the dynamic: true
mappings in the SavedObjects as well (maybe the @elastic/kibana-platform team can confirm).
@dgieselaar mentioned it was dynamic only during the development phase. Just please don't forget to change this to the actual mapping when done :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for having dynamic:true
was given in a Slack message:
...the dynamic mapping is temporary. Once the data that we are collecting is finalized I was planning on adding a static mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mapping for saved objects is no longer dynamic, thanks!
x-pack/legacy/plugins/apm/scripts/upload-telemetry-data/index.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks great! I only added some FYIs and nits.
The only main thing I'd personally don't do here is the upload-telemetry-data
script. Unless there's a reason to have it I can't think of at the moment.
ruby: 'ruby', | ||
go: 'go' | ||
}; | ||
export type AgentName = typeof AGENT_NAMES[number]; | ||
|
||
export function isAgentName(agentName: string): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export function isAgentName(agentName: string): boolean { | |
export function isAgentName(agentName: string): agentName is AgentName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
); | ||
export function isRumAgentName( | ||
agentName: string | undefined | ||
): agentName is AgentName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
): agentName is AgentName { | |
): agentName is 'js-base' | 'rum-js' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌🏻
export const AGENT_NAMES = [ | ||
'java', | ||
'js-base', | ||
'rum-js', | ||
'dotnet', | ||
'go', | ||
'java', | ||
'nodejs', | ||
'python', | ||
'ruby' | ||
] as const; | ||
|
||
const agentNames: { [agentName in AgentName]: agentName } = { | ||
python: 'python', | ||
java: 'java', | ||
nodejs: 'nodejs', | ||
'js-base': 'js-base', | ||
'rum-js': 'rum-js', | ||
dotnet: 'dotnet', | ||
ruby: 'ruby', | ||
go: 'go' | ||
}; | ||
export type AgentName = typeof AGENT_NAMES[number]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
} | ||
"apm-telemetry": { | ||
"properties": {}, | ||
"dynamic": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand these mappings are for the SavedObjects. It should not cause any effect on the telemetry service. But it's true I think it's considered an antipattern the dynamic: true
mappings in the SavedObjects as well (maybe the @elastic/kibana-platform team can confirm).
@dgieselaar mentioned it was dynamic only during the development phase. Just please don't forget to change this to the actual mapping when done :)
bool: { | ||
filter: [{ range: { '@timestamp': { gte: 'now-30d' } } }] | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size: 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other reply
taskManager.ensureScheduled({ | ||
id: APM_TELEMETRY_TASK_NAME, | ||
taskType: APM_TELEMETRY_TASK_NAME, | ||
interval: '1m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this taskManager approach to get all the stats ready to consume from the savedObject at any point when the usage collector requires it.
Especially taking into account all the requests we do to retrieve this telemetry. That's a good amount of information! Well done! 👏
Just FYI, we usually report telemetry every 24h so, maybe running the collection logic every minute is a bit too much? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to 24h - another artifact of "dev mode" 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, let me know if it should be anything else than 24h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to 720m (12 hours, the task manager only supports seconds and minutes)
} | ||
APM_TELEMETRY_SAVED_OBJECT_TYPE, | ||
dataTelemetry, | ||
{ id: APM_TELEMETRY_SAVED_OBJECT_TYPE, overwrite: true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a risk we might delete existing telemetry because of
https://github.com/elastic/kibana/blob/caee49b307bf9a6c8c1c146b1a7c35fd044fd894/x-pack/legacy/plugins/apm/server/lib/apm_telemetry/collect_data_telemetry/index.ts#L64-L68
But the chances should be low so I wouldn't worry too much about it (just a heads up) ;)
export type APMDataTelemetry = DeepPartial<{ | ||
has_any_services: boolean; | ||
services_per_agent: Record<AgentName, number>; | ||
versions: { | ||
apm_server: { | ||
minor: number; | ||
major: number; | ||
patch: number; | ||
}; | ||
}; | ||
counts: { | ||
transaction: TimeframeMap; | ||
span: TimeframeMap; | ||
error: TimeframeMap; | ||
metric: TimeframeMap; | ||
sourcemap: TimeframeMap; | ||
onboarding: TimeframeMap; | ||
agent_configuration: TimeframeAll; | ||
max_transaction_groups_per_service: TimeframeAll; | ||
max_error_groups_per_service: TimeframeAll; | ||
traces: TimeframeAll; | ||
services: TimeframeAll; | ||
}; | ||
integrations: { | ||
ml: { | ||
has_anomalies_indices: boolean; | ||
}; | ||
}; | ||
agents: Record< | ||
AgentName, | ||
{ | ||
agent: { | ||
version: string[]; | ||
}; | ||
service: { | ||
framework: { | ||
name: string[]; | ||
version: string[]; | ||
}; | ||
language: { | ||
name: string[]; | ||
version: string[]; | ||
}; | ||
runtime: { | ||
name: string[]; | ||
version: string[]; | ||
}; | ||
}; | ||
} | ||
>; | ||
indices: { | ||
shards: { | ||
total: number; | ||
}; | ||
all: { | ||
total: { | ||
docs: { | ||
count: number; | ||
}; | ||
store: { | ||
size_in_bytes: number; | ||
}; | ||
}; | ||
}; | ||
}; | ||
tasks: Record< | ||
| 'processor_events' | ||
| 'agent_configuration' | ||
| 'services' | ||
| 'versions' | ||
| 'groupings' | ||
| 'integrations' | ||
| 'agents' | ||
| 'indices_stats', | ||
{ took: { ms: number } } | ||
>; | ||
}>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's amazing to have this! This will help us much to create the mapping in the telemetry cluster! Thanks for being so thoughtful and strict with the types! ❤️
x-pack/plugins/apm/kibana.json
Outdated
"apm_oss", | ||
"data", | ||
"home", | ||
"usageCollection" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the recommendation is to import this plugin as optional:
https://github.com/elastic/kibana/blob/master/src/plugins/usage_collection/README.md#new-platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks!
From a Slack message:
|
5b0c188
to
2197df8
Compare
65f17d2
to
55355a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -1,42 +1,734 @@ | |||
{ | |||
"apm-services-telemetry": { | |||
"apm-telemetry": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! this is so awesome will be so useful when we add the mapping in the telemetry cluster. Please refer to this file when requesting it :)
}, | ||
"java": { | ||
"type": "long", | ||
"null_value": 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the removal of the "null_value" might cause any issues?
Feel free to dismiss this if this change is intended :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch - I don't think we need a fallback anymore but I'll put it there to prevent any backwards compat issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
*/ | ||
|
||
// compile typescript on the fly | ||
// eslint-disable-next-line import/no-extraneous-dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a description to this script about what it does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aye, I could do it in this file, in dev_docs or in a README.md file in this folder, any preferences?
...(githubToken | ||
? { | ||
auth: githubToken | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is githubToken
not required? If the repo is private I'd expect this to require a token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll just fail the process in that case. Can't remember why I made it optional.
...(yaml.safeLoad( | ||
fs.readFileSync( | ||
fs.existsSync(kibanaDevConfig) ? kibanaDevConfig : kibanaConfig, | ||
'utf8' | ||
) | ||
) as {}), | ||
...(pick( | ||
{ | ||
'elasticsearch.username': process.env.ELASTICSEARCH_USERNAME, | ||
'elasticsearch.password': process.env.ELASTICSEARCH_PASSWORD, | ||
'elasticsearch.hosts': process.env.ELASTICSEARCH_HOST | ||
}, | ||
identity | ||
) as { | ||
'elasticsearch.username': string; | ||
'elasticsearch.password': string; | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be cleaner with safeLoad
done separately?
Also: shouldn't elasticsearch.hosts
be in the explicit type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll split it up a bit. elasticsearch.hosts
is already in the type because it has a default value, but it doesn't hurt to have it in both, I'll add it there as well (it looks odd).
'nodejs', | ||
'python', | ||
'ruby' | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorting <3
@@ -7,7 +7,13 @@ | |||
export const SERVICE_NAME = 'service.name'; | |||
export const SERVICE_ENVIRONMENT = 'service.environment'; | |||
export const SERVICE_AGENT_NAME = 'agent.name'; | |||
export const SERVICE_AGENT_VERSION = 'agent.version'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SERVICE_AGENT_NAME
and SERVICE_AGENT_VERSION
should be called AGENT_NAME
and AGENT_VERSION
since they are no longer on the "service" context (they used to be afair)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah was wondering why that was the case. I'll update
} | ||
) { | ||
const logger = this.initContext.logger.get('apm'); | ||
const logger = this.initContext.logger.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no longer in the context of 'apm'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's already scoped, so scoping it again leads to [apm] [apm]
output
642543f
to
cd8d753
Compare
cd8d753
to
72b7ae9
Compare
serviceMapEnabled: Joi.boolean().default(true), | ||
|
||
// telemetry | ||
telemetryCollectionEnabled: Joi.boolean().default(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want a specific flag to disable APM-only telemetry (while the rest of the telemetry is enabled)? Wouldn't that complicate the analysis about the usage? aka: how can I tell APM is used but not reporting telemetry because it's disabled?
We could do that by checking the application_usage.apm
stats... but I'd like @alexfrancoeur thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to get some additional feedback 👍🏻the reason why I've added it because we want an escape hatch in case the queries are too expensive, without requiring our users to disable telemetry altogether in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afharo If you don't mind, I'll merge this in given the time constraints, but let me know if there's anything we need to change.
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yarn.lock config changes LGTM
This was causing type check failures in master once it was merged so it was reverted in 6de7f2a |
* master: (34 commits) [APM] add service map config options to legacy plugin (elastic#61002) [App Arch] migrate legacy CSS to new platform (core_plugins/kibana_react) (elastic#59882) Migrated styles for "share" plugin to new platform (elastic#59981) [ML] Module setup with dynamic model memory estimation (elastic#60656) Drilldowns (elastic#59632) Upgrade mocha dev-dependency from 6.2.2 to 7.1.1 (elastic#60779) [SIEM] Overview: Recent cases widget (elastic#60993) [ML] Functional tests - stabilize df analytics clone tests (elastic#60497) [SIEM] Updates process and TLS tables to use ECS 1.5 fields (elastic#60854) Migrate doc view part of discover (elastic#58094) Revert "[APM] Collect telemetry about data/API performance (elastic#51612)" fix(NA): log rotation watchers usage (elastic#60956) [SIEM] [CASES] Build lego blocks case details view (elastic#60864) Create Painless Lab app (elastic#57538) [SIEM] Move Timeline Template field to first step of rule creation (elastic#60840) [Reporting/New Platform Migration] Use a new config service on server-side (elastic#55882) [Alerting] allow email action to not require auth (elastic#60839) [Maps] Default ES document layer scaling type to clusters and show scaling UI in the create wizard (elastic#60668) [APM] Collect telemetry about data/API performance (elastic#51612) Implement Kibana Login Selector (elastic#53010) ...
…lastic#51612)"" This reverts commit 6de7f2a.
…ic#61030) * Revert "Revert "[APM] Collect telemetry about data/API performance (elastic#51612)"" This reverts commit 6de7f2a. * Update transaction mock data to reflect the type
Closes #50757.
An example of collected telemetry: