-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: fetch sourcemaps from Elasticsearch #9722
Conversation
Add a new caching metadata fetcher that fetch sourcemap metadata from ES and forward the request to the backend fetcher if it exists. The backend fetcher is an ES fetcher wrapped in a LRU cache to cache the sourcemap body. Sourcemap metadata are periodically synced with ES by the caching metadata fetcher. Keep fleet and kibana fetcher until the UI team has updated the sourcemap uploading flow. Update ES sourcemap response to include sourcemap metadata.
This pull request does not have a backport label. Could you fix it @kruskall? 🙏
NOTE: |
sourcemaps are compressed with zlib and encoded as base64, keep that in mind when retrieving the content from ES. The document structure has been updated to reflect the current latest commit in the kibana PR. Use the _id to fetch the actual sourcemap since the service fields are doc-value-only and cannot be used for filtering. Log an error whenever sync fails.
the new kibana PR is upload sourcemaps to ES and migrating old one to it. We can restore the sourcemap test to use the standard method instead of upload the sourcemap manually to ES.
The sourcemap cache is now refreshed whenever the content hashcode changes or a sourcemap is removed from ES. Update the sourcemap cache to use a LRU cache.
How to test:
|
This pull request is now in conflicts. Could you fix it @kruskall? 🙏
|
📚 Go benchmark reportDiff with the
report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, left some comment related to updating the cache. Please also update the description as the fallback logic is now removed.
We should also add metrics for better insights into the sourcemap handling: at least track the number of requests to ES for sourcemaps, track sucess of metadata requests.
internal/sourcemap/metadata.go
Outdated
if len(s.set) == 0 { | ||
// If cache is empty and we're trying to fetch a sourcemap | ||
// make sure the initial cache population has ended. | ||
s.once.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if ES is not yet reachable at this point but later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back at this, I don't think it's a good idea, this will basically block until the cache is populated since we cannot retrieve sourcemaps until then with the current implementation.
I'm not sure what's the best way to proceed here. Returning nothing might be misleading. Should we bypass the cache and send the request to ES if the cache is not populated yet ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be if the metadata cache is not ready. If a sourcemap is part of the metadata cache but the body is not yet cached, it will be fetched from ES on request.
So let's take a look at the metadata cache. What are the scenarios for when the cache might not be ready?
- ES is not reachable, due to a restart, misconfiguration, etc.: in this case, trying to fetch directly from ES would not succeed either.
- Fetching metadata from ES is ongoing. The fetching and cache update for metadata can be expected to be fairly quick, even when many sourcemaps are stored. Bypassing the metadata cache would work here, but seems like a hack. I'd rather either block the request processing (and collect logs & metrics so we get an understanding how long this blocking period is) or treat it as no sourcemap is available (and again, collect logs & metrics for the scenario). The third option would be to block the APM Server listeners until everything is readily set up, but this would definitely lead to apm event loss, which is definitely worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ES is not reachable, due to a restart, misconfiguration, etc.: in this case, trying to fetch directly from ES would not succeed either.
I don't think the metadata cache should be concerned If ES is not ready/reachable/available. A failure in that scenario would be "working as intended" imo because the metadata cache is proxying what it got from the ES fetcher (error, timeout, etc.).
Fetching metadata from ES is ongoing. The fetching and cache update for metadata can be expected to be fairly quick, even when many sourcemaps are stored. Bypassing the metadata cache would work here, but seems like a hack. I'd rather either block the request processing (and collect logs & metrics so we get an understanding how long this blocking period is) or treat it as no sourcemap is available (and again, collect logs & metrics for the scenario). The third option would be to block the APM Server listeners until everything is readily set up, but this would definitely lead to apm event loss, which is definitely worse.
I was thinking of "if init is not completed, forward the request to the backend fetcher". In theory the first request for each sourcemap would be forwarded to ES anyway so this wouldn't create additional work unless we get a request for a missing sourcemap on ES during init. The reasoning behind this was that when the metadata cache is not initialized:
- if the sourcemap exists on ES: the metadata fetcher will send the request to the es fetcher anyway. No additional time/work/blocking: this would be the same as the metadata cache receiving a request for the first time.
- if the sourcemap does not exist on ES: the metadata will send the request to the ES fetcher, creating additional work. The additional processing time is offset by the fact we would have had to wait for the init phase to finish if we had blocked so my assumption was that there wasn't gonna be a time difference between the two in the end.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is really used for triggering the first fetch, in case it hasn't yet been initiated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic has been reworked and now we forward the request to ES if the metadata cache is not ready.
This does not create additional costs since the response is cached in the lru cache.
The existing behavior pre |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good now!
The sourcemap_fetcher_test
s only test the happy path, some checks for error cases would also be good, but test coverage looks good enough to be merged now.
Thanks for all the effort that went into this!
@@ -3,6 +3,8 @@ apm_server: | |||
indices: | |||
- names: ['apm-*', 'traces-apm*', 'logs-apm*', 'metrics-apm*'] | |||
privileges: ['write','create_index','manage','manage_ilm'] | |||
- names: ['.apm-source-map'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the tests use admin credentials for sourcemap only?
I wouldn't suggest this, with admin credentials it is easy to miss breaking changes
IMO it's fine to define the privilege here, in alignment with privilege definition for the apm datastreams.
First test with Fleet-managed APM Server is successful:
I will run some more tests running without Fleet. |
I tried running Elastic Agent locally in Docker, communicating with Elasticsearch in an Elastic Cloud us-west2 deployment. Errors were indexed with:
Looking at the logs, this is coming from the metadata fetcher, which has a 1 second ping timeout. Opened #10338 |
Motivation/summary
Add a new caching metadata fetcher that fetch sourcemap metadata from ES and forward the request to the backend fetcher if it exists. The backend fetcher is an ES fetcher wrapped in a LRU cache to cache the sourcemap body.
Sourcemap metadata are periodically synced with ES by the caching metadata fetcher.
Keep fleet and kibana fetcher until the UI team has updated the sourcemap uploading flow.
Update ES sourcemap response to include sourcemap metadata.
Checklist
apmpackage
have been made)For functional changes, consider:
How to test these changes
Related issues
Closes #9643