-
Notifications
You must be signed in to change notification settings - Fork 0
/
BPF-NOTES
144 lines (119 loc) · 7.87 KB
/
BPF-NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
- Perf events are created in enabled (inactive) state when you do not set
attr.disabled = 1. That is convenient because no explicit enabling is
needed. However, we may want to created probe events in disabled state in
the future (i.e. avoid them from firing) until we are ready to start the
tracing for real (equiv. to the old GO ioctl). We'd need to run through the
list of probes we need (created perf events), enable them, and then flip
the master switch.
- A useful feature (that I think current DTrace cannot even do - need to
check) is listing which executable that are currently running on the system
have USDT support.
- Probing a running executable using USDT probes that are defined in it is
an important use case. Can we determine the USDT info from the running
executable as an alternative to the more clunky DTrace-style constructor
that feeds the USDT info to DTrace as DOF through the helper ioctl interface?
- Possible patterns that could be optimized:
mov %rY, IMM becomes add %rX, IMM
add %rX, %rY
Another one (IMM fits in a 32-bit signed value):
mov %rY, IMM becomes stw [%rX+OFF], IMM
stw [%rX+OFF], %rY
Also, if IMM fits in a 32-bit signed value, the following optimization can be
performed:
lddw %rX, ... becomes lddw %rX, ...
add %rX, IMM stb [%rX+IMM], %rY
stb [%rX+0], %rY
- Global variables can be stored in a BPF map, which can be a large, single
block of memory. Variables can be accessed with three BPF instructions:
- load the DTrace context address (on the stack)
- dereference at the right offset for the map value pointer
- dereference using the variable's offset
No function calls are needed (whether to our function or a BPF helper
function), and so these few instructions can be generated each time a
global variable is accessed.
- Local variables can be stored in a per-CPU BPF map, which can be a large,
single block of memory. Access to local variables can be similar to how
we handle global variables.
- In the new design, there is no need for shifting variable ids up because of the
range of built-in variables because we can generate code to obtain the value
of the built-in variables as we compile programs (most likely using a call to
a BPF function, passing the id of the built-in variable).
- Constants should be defined for the standard maps that we will be creating
eBPF programs. The program loading step should then resolve the map values
we put in instructions (using a reloc table style approach) by putting in the
actual fds of the maps that were created.
- The TLS key needs to be carefully calculated because we want to ensure that
it uniquely (and correctly) represents task execution context. The TLS key
stores an adjusted pid value in the lower 60 bits, prefixed by a 4 bit value
indicating whether a hardirq is active. We use 4 bits because the kernel
supports up to 16 levels of nested IRQ (even though it is quite rare).
All per-cpu idle threads share the same pid (0), which means that we need to
adjust the pid to ensure that the idle threads have their own TLS key. We
do this by replacing the pid value (0) by the active CPU id. For tasks where
the pid is not 0 (all tasks other than the idle threads), we add NR_CPUS.
This ensures that the TLS key for idle threads will never conflict with any
regular pid value.
We also add DIF_VARIABLE_MAX to the pid value to assure that the TLS key is
never equal to a variable id. This is necessary to ensure that keys for
global associative array elements will not collide with TLS variables. To
completely guarantee that they cannot collide, we also strictly define the
order of key elements (placing the variable id and TLS key at the end of the
key sequence).
(The last part is only relevant if we stick to the DTrace approach of having
one dynamic variable storage construct. If we were to give each associative
array or TLS variable its own map, there are no collisions possible between
arrays and TLS variables.)
- Dynamic variables are allocated (in legacy DTrace) from a per-cpu buffer
space if at all possible. When the space on the current CPU has been
exhausted, it will try to allocate space from another CPU. If we use BPF
hashmaps for dynamic variables that is no longer possible and we need to
allocate enough entries to accomodate the anticipated usage. Perhaps we can
add a pragma to set the amount of entries per map to allow the user more
control over this. A global hashmap is of course slower than per-cpu storage
but per-cpu BPF maps are AFAIK not able to 'borrow' slots from another CPU.
- While it is tempting to actually implement the actions as function calls, or
at least in a way where all parts of a clause are compiled into the same eBPF
function, this is a bit more complex because DTrace was not designed that
way. It also requires significant changes in the probe data consumption code
because instead of a single data item per ECB, we will now need to support
a tuple of values. The userspace component needs to maintain a list of what
data items are expected in the tuple associated with the ECB because the data
stream does not include any metadata.
Perhaps it would be best if a first version sticks to the one data item per
ECB approach. We can compile each D expression into an eBPF function and
emit a main program that simply calls each eBPF function in turn. That may
give us a sub-optimal version of DTrace based on eBPF, but it allows us to
offer something that semi-works sooner.
However, as it turns out, the legacy code generation associates a strtab with
every D expression that is compiled into its own DIF object. That would
require us to create a map to hold the strtab for every DUF object which is a
massive overuse of resources. It may be possible to implement a solution
that sits in the middle: construct the strtab as a shared resource, so we
only create a single map (with a single element) to hold the strtab, and all
generated code portions can access it. We can still concatenate all the code
fragments into a single larger function (but each fragment is generated as if
it stands on its own, so it will generate its own output.) The yet to be
solved problem with that approach is that code sharing because a real pain
because each fragment will have its own id that is dependent both on the
probe and the clause.
- It is not possible to trigger eBPF program execution for the ERROR probe in
a true sense because triggering a probe that we hijack to implement the ERROR
probe would cause reentrant eBPF program execution which is not possible when
handling trace events.
However, because the DTrace implementation based on eBPF works with an eBPF
program as trampoline to set up a context for DTrace and call the actual
program as a BPF-to-BPF function call, we can actually implement the ERROR
probe handling as an entry function that create the same kind of DTrace
context (for the ERROR probe) and then calls the default or user-specified
eBPF function that actually implements tha clauses associated with the probe.
All DTrace generated eBPF code should flag faults and at appropriate points
in execution a check should be made to determine whether a fault occured. At
that point, we need to call the entry function into the ERROR probe handling.
The ERROR probe clauses will generate probe data indicating the fault, and
the probe that caused the fault should bypess generating output because of
the fault that occured. If we ensure that writes to the buffer happen at the
clause level, we can guarantee that clauses either generate data or trigger
an ERROR (which will generate data). There is no possibility for partial
data to be written to the buffer. (Even if we decide to write out data per
individual action, again, either an action generates data or it will cause an
ERROR probe invocation to generate data.)