Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve eBPF context propagation stability #368

Merged
merged 72 commits into from
Oct 15, 2023

Conversation

RonFed
Copy link
Contributor

@RonFed RonFed commented Oct 10, 2023

The main goal of this PR is to improve the context propagation in the eBPF code by making it more stable across kernel versions and across instrumented libraries. Those fixes are the result of investigation of Keyval's customers running go auto instrumentation in production.
The changes can be divided into a few categories:

  1. Memory allocation: we use bpd_probe_write_user helper function in order to modify user memory. The main challenge is to write to memory which we "allocate" since page faults won't get handled if they occur while the eBPF code attempts to write to such addresses. In order to reduce the page faults I added the following changes:
  • madvise and mlock syscalls after we perform the mmap.
  • allocate 8 pages per CPU. This number is a result of some experiments I did but the tradeoff as I understand it is: on one side we want to have enough memory available for handling a large number of instrumneted functions without overriding exiting trace memory. On the other side, allocating too much memory increases the chance of page faults - which will cause the write operations to fail, and damage the context propagation.
  • Since the added code, I added allocate.go
  1. Improving the resiliancy of context propagation relevant functions in the eBPF code. This mainly include error handling and a unified API for all the uprobes.
  2. Extracting the Go context key via a new util function.
  3. Adding utils function in the eBPF code for reading user memory strings.
  4. Change the http server uprobe name to better fit the instrumented function

@RonFed RonFed marked this pull request as ready for review October 10, 2023 21:39
@RonFed RonFed requested a review from a team October 10, 2023 21:39
@edeNFed
Copy link
Contributor

edeNFed commented Oct 14, 2023

Great work! The code looks much better now. I left a small question regarding stop_tracking_span

@edeNFed edeNFed merged commit 9a08e90 into open-telemetry:main Oct 15, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants